Eliana Diodati
This project develops an open dataset and transparent methodology to identify research institutes in bibliographic databases and classify them as public, private, or hybrid using LLMs and RAG on Wikipedia and official webpages.
Fellows program
| Name | Affiliation | Research Topic | Cohort |
|---|---|---|---|
| Guilherme Junqueira | University of Florida, Finance | A new comprehensive dataset documenting financing sources for young innovative firms. | 2025 |
| Kyoungah Noh | University at Albany SUNY, Economics | A comprehensive dataset of standardized priority dates for patents. | 2025 |
| Laura Shupp | MIT Sloan | A comprehensive patent dataset covering the Middle East and North Africa. | 2025 |
| Matthew Lee Chen | Harvard Economics | Categorization of citations as “deep” or “shallow” for a broad corpus of 19th- and 20th-century British and American scientific articles and patents. | 2025 |
| Mihai Codreanu | Stanford Economics | A database of electronics products using 20th-century historical data, matched to patents via LLMs. | 2025 |
| Rebekah Dix | MIT Economics | A dataset of combination innovations in medicine using LLMs and clinical trials data. | 2025 |
| Tianshu Lyu | Yale School of Management | A new matched dataset of consumer products with patents using detailed product-level data and topic modeling techniques. | 2025 |
| Alexander Kann | University of Mannheim | Alexander developed a sophisticated classifier using BERT (Bidirectional Encoder Representations from Transformers) to identify and match defensive disclosures with their corresponding patent technology classifications. | 2024 |
| Bernardo Dionisi | Duke University | Bernardo created Pydrad, an open-source Python package designed to streamline the construction, transformation, combination, and comparison of innovation datasets. | 2024 |
| Matteo Tranchero | UC Berkeley | Matteo utilized Bio-BERT to develop a novel innovation impact metric based on “knowledge entities” rather than traditional citation counts. | 2024 |
| Maya Durvasula | Stanford University | Maya applied large language models to integrate three critical data sources: clinical trial records from ClinicalTrials.gov, scientific publications, and FDA approval data. | 2024 |
| Saqib Mumtaz | UC Berkeley | Saqib’s research connects scientific publications with their media coverage by linking EurekaAlert press release data to OpenAlex author information. | 2024 |