Institutional Data Initiative at Harvard

Institutional Data Initiative at Harvard · 2025-06-12T21:06:21.994Z

Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. In our analysis of the dataset’s coverage across time, topic, and language and found: - 43% of English text + long tail of 254 languages - 20 clear topical tranches - Largely published in the 19th and 20th centuries The dataset also includes extensive volume-level metadata with both original and generated components, such as results from text-level language detection. As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type. Looking forward, we hope to continue growing Institutional Books through community. We invite collaboration from researchers and model makers as we: - Evaluate the dataset’s impact on model outputs - Continuing to refine our OCR pipelines We see Institutional Books as the beginning of a process that makes millions more books accessible to the public for a variety of uses. We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.

Research Services

View all 6 employees

About us

The Institutional Data Initiative is a research initiative at Harvard Law School Library. We work with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. Our goal is to help build a vast commons of well-understood data, gather a diverse community to investigate and improve it, and affirm the role of institutions as stewards of knowledge in the age of AI.

Website: http://institutional.org
External link for Institutional Data Initiative at Harvard
Industry: Research Services
Company size: 2-10 employees
Type: Nonprofit

Employees at Institutional Data Initiative at Harvard

See all employees

Updates

Institutional Data Initiative at Harvard

906 followers
2mo
Report this post
Join our team! We are hiring two full-time roles to support our growing initiative. The Senior Partnerships Manager and Communications and Community Manager will work directly with IDI's Executive Director and Program Director to advance IDI's mission. Interested candidates should submit a resume and cover letter to the careers.harvard.edu portal.

Be Part of Harvard. careers.harvard.edu

2 Comments

Like Comment Share
Institutional Data Initiative at Harvard reposted this
Greg Leppert

Institutional Data Initiative…•2K followers
5mo
Report this post
When libraries participate in Google Books, Google not only scans their books, it also makes a wealth of image, OCR, and metadata available to them via the Google Return Interface (GRIN). But working with GRIN can be challenging. We learned this lesson over the months it took to download 1M of Harvard Library's books for our Institutional Books release. As a result, many libraries have yet to take full advantage of the wonderful resources GRIN provides. So today we're releasing GRIN Transfer: a tool for libraries to download their collections—big or small—from GRIN. GRIN Transfer handles request batching, failure recovery, and data aggregation so that libraries can focus on using the data rather than simply gaining access to it. We're also sharing the pipeline we developed for Institutional Books that seamlessly dedupes, classifies, and enhances the data once GRIN Transfer brings it down. If you're a Google Books partner library, you can find more information in our blog post: https://lnkd.in/e-EcKhR8 Or if you're simply curious about what it's like to work with GRIN, you can find a wealth of details in our technical report: https://lnkd.in/eBEjtNxq

Announcing the release of GRIN Transfer institutional.org

4 Comments

Like Comment Share
Institutional Data Initiative at Harvard

906 followers
6mo
Report this post
What is the pathway towards greater diversity in data and AI? Hear from Professor Ruth Okediji, scholar of IP Law at Harvard Law School, who will be in conversation with Assistant Dean Amanda Watson of the Harvard Law School Library.
1 Comment

Like Comment Share
Institutional Data Initiative at Harvard

906 followers
7mo
Report this post
Join us tomorrow at 10AM EST: https://lnkd.in/guiPMssY
Institutional Data Initiative at Harvard

906 followers
7mo

Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi, PhD and Peter W. J. Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
Like Comment Share
Institutional Data Initiative at Harvard

906 followers
7mo
Report this post
Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi, PhD and Peter W. J. Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
2 Comments

Like Comment Share
Institutional Data Initiative at Harvard reposted this
Greg Leppert

Institutional Data Initiative…•2K followers
10mo
Report this post
This Monday, the Institutional Data Initiative at Harvard will host Petr Knoth to talk about his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually on June 23rd at 12:45pm ET using the RSVP link below. https://lnkd.in/eakNbKSy

Welcome! You are invited to join a meeting: IDI Talk with Petr Knoth (CORE). After registering, you will receive a confirmation email about joining the meeting. harvard.zoom.us

1 Comment

Like Comment Share
Institutional Data Initiative at Harvard reposted this
Greg Leppert

Institutional Data Initiative…•2K followers
10mo
Report this post
Tomorrow, it's our pleasure to host Ayah Bdeir to talk about the power of data in building an AI ecosystem that's open, transparent, and fair. 11am ET on June 17th. Register at the link below to attend virtually. Cohosted by the Institutional Data Initiative at Harvard and Berkman Klein Center for Internet & Society at Harvard University. https://lnkd.in/eHfuRExD

Welcome! You are invited to join a webinar: Open AI Development. After registering, you will receive a confirmation email about joining the webinar. harvard.zoom.us

Like Comment Share
Institutional Data Initiative at Harvard

906 followers
10mo
Report this post
Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. In our analysis of the dataset’s coverage across time, topic, and language and found: - 43% of English text + long tail of 254 languages - 20 clear topical tranches - Largely published in the 19th and 20th centuries The dataset also includes extensive volume-level metadata with both original and generated components, such as results from text-level language detection. As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type. Looking forward, we hope to continue growing Institutional Books through community. We invite collaboration from researchers and model makers as we: - Evaluate the dataset’s impact on model outputs - Continuing to refine our OCR pipelines We see Institutional Books as the beginning of a process that makes millions more books accessible to the public for a variety of uses. We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
2 Comments

Like Comment Share

LinkedIn respects your privacy

Institutional Data Initiative at Harvard

Research Services

About us

Employees at Institutional Data Initiative at Harvard

Harry Verwayen

Europeana•2K followers

Greg Leppert

Institutional Data Initiative…•2K followers

Catherine Brobston

Institutional Data Initiative…•644 followers

David Lowry-Duda

Institutional Data Initiative…•118 followers

Updates

Join now to see what you are missing

Similar pages

Berkman Klein Center for Internet & Society at Harvard University

CORE (COnnecting REpositories)

Library Innovation Lab @ Harvard Law

EleutherAI

pleias

Direct Relief

Loaves and Fishes NH

Polska Federacja Stowarzyszeń Zawodów Nieruchomościowych

Brina

Progetto Power

Browse jobs

Software Engineer jobs

Development Officer jobs

PHD jobs

Outreach Director jobs

President jobs

Card Manager jobs

Archivist jobs

Junior Product Manager jobs

Senior Product Manager jobs

Intern jobs

Project Manager jobs

Manager jobs

Analyst jobs