Join our team! We are hiring two full-time roles to support our growing initiative. The Senior Partnerships Manager and Communications and Community Manager will work directly with IDI's Executive Director and Program Director to advance IDI's mission. Interested candidates should submit a resume and cover letter to the careers.harvard.edu portal.
About us
The Institutional Data Initiative is a research initiative at Harvard Law School Library. We work with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. Our goal is to help build a vast commons of well-understood data, gather a diverse community to investigate and improve it, and affirm the role of institutions as stewards of knowledge in the age of AI.
- Website
-
http://institutional.org
External link for Institutional Data Initiative at Harvard
- Industry
- Research Services
- Company size
- 2-10 employees
- Type
- Nonprofit
Employees at Institutional Data Initiative at Harvard
Updates
-
Institutional Data Initiative at Harvard reposted this
When libraries participate in Google Books, Google not only scans their books, it also makes a wealth of image, OCR, and metadata available to them via the Google Return Interface (GRIN). But working with GRIN can be challenging. We learned this lesson over the months it took to download 1M of Harvard Library's books for our Institutional Books release. As a result, many libraries have yet to take full advantage of the wonderful resources GRIN provides. So today we're releasing GRIN Transfer: a tool for libraries to download their collections—big or small—from GRIN. GRIN Transfer handles request batching, failure recovery, and data aggregation so that libraries can focus on using the data rather than simply gaining access to it. We're also sharing the pipeline we developed for Institutional Books that seamlessly dedupes, classifies, and enhances the data once GRIN Transfer brings it down. If you're a Google Books partner library, you can find more information in our blog post: https://lnkd.in/e-EcKhR8 Or if you're simply curious about what it's like to work with GRIN, you can find a wealth of details in our technical report: https://lnkd.in/eBEjtNxq
-
Join us tomorrow at 10AM EST: https://lnkd.in/guiPMssY
Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi, PhD and Peter W. J. Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
-
-
Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi, PhD and Peter W. J. Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
-
-
Institutional Data Initiative at Harvard reposted this
This Monday, the Institutional Data Initiative at Harvard will host Petr Knoth to talk about his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually on June 23rd at 12:45pm ET using the RSVP link below. https://lnkd.in/eakNbKSy
-
Institutional Data Initiative at Harvard reposted this
Tomorrow, it's our pleasure to host Ayah Bdeir to talk about the power of data in building an AI ecosystem that's open, transparent, and fair. 11am ET on June 17th. Register at the link below to attend virtually. Cohosted by the Institutional Data Initiative at Harvard and Berkman Klein Center for Internet & Society at Harvard University. https://lnkd.in/eHfuRExD
-
Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. In our analysis of the dataset’s coverage across time, topic, and language and found: - 43% of English text + long tail of 254 languages - 20 clear topical tranches - Largely published in the 19th and 20th centuries The dataset also includes extensive volume-level metadata with both original and generated components, such as results from text-level language detection. As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type. Looking forward, we hope to continue growing Institutional Books through community. We invite collaboration from researchers and model makers as we: - Evaluate the dataset’s impact on model outputs - Continuing to refine our OCR pipelines We see Institutional Books as the beginning of a process that makes millions more books accessible to the public for a variety of uses. We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
-