i3 Fellows

Fellows program

Current Fellows (2026)

Eliana Diodati

This project develops an open dataset and transparent methodology to identify research institutes in bibliographic databases and classify them as public, private, or hybrid using LLMs and RAG on Wikipedia and official webpages.

Georgi Demirev

Northwestern

Georgi will build and continuously update a public dataset of corporate product launch announcements by scraping press releases (e.g., PRNewswire) and classifying them with a fine-tuned transformer model.

Marco Panuzi

Stanford

Using LLMs and clustering, this project will extract and categorize author-action sentences (e.g., “we propose an estimator”) from millions of paper abstracts to create the first large-scale dataset of fine-grained scientific tasks.

Piyasha Majumdar

Virginia Tech

The project will digitize and standardize all historical Indian patents from 1900–2005, link them to UK and US counterparts, and classify them using the CPC system.

Randol Yao

MIT

Randol will integrate structured Traditional Chinese Medicine data (herbs, compounds, formulas from TCM Bank) with modern drug, patent, and clinical-trial databases, standardizing names and mapping ancient indications to MeSH/ICD codes.

Taoyu Long

University of Georgia

The project will produce and maintain an open patent-ID–firm–month panel of in-force U.S. patents for Compustat firms, correctly accounting for subsidiary patents, assignments, and maintenance-fee events.

Yanuo Zhou

University of Toronto

LLM_AuditKit is an open-source Python package that audits large language models for embedded producer-specific biases that trade off factual accuracy for perceived harmlessness.

Yujing Huang

UCLA

Yujing will create an open, reproducible dataset tracking Chinese scientists trained in the U.S. (1911–1953) and their return to China, linking historical student directories, biographical records, and modern bibliometric data (OpenAlex, CNKI).

Alumni

Name	Affiliation	Research Topic	Cohort
Guilherme Junqueira	University of Florida, Finance	A new comprehensive dataset documenting financing sources for young innovative firms.	2025
Kyoungah Noh	University at Albany SUNY, Economics	A comprehensive dataset of standardized priority dates for patents.	2025
Laura Shupp	MIT Sloan	A comprehensive patent dataset covering the Middle East and North Africa.	2025
Matthew Lee Chen	Harvard Economics	Categorization of citations as “deep” or “shallow” for a broad corpus of 19th- and 20th-century British and American scientific articles and patents.	2025
Mihai Codreanu	Stanford Economics	A database of electronics products using 20th-century historical data, matched to patents via LLMs.	2025
Rebekah Dix	MIT Economics	A dataset of combination innovations in medicine using LLMs and clinical trials data.	2025
Tianshu Lyu	Yale School of Management	A new matched dataset of consumer products with patents using detailed product-level data and topic modeling techniques.	2025
Alexander Kann	University of Mannheim	Alexander developed a sophisticated classifier using BERT (Bidirectional Encoder Representations from Transformers) to identify and match defensive disclosures with their corresponding patent technology classifications.	2024
Bernardo Dionisi	Duke University	Bernardo created Pydrad, an open-source Python package designed to streamline the construction, transformation, combination, and comparison of innovation datasets.	2024
Matteo Tranchero	UC Berkeley	Matteo utilized Bio-BERT to develop a novel innovation impact metric based on “knowledge entities” rather than traditional citation counts.	2024
Maya Durvasula	Stanford University	Maya applied large language models to integrate three critical data sources: clinical trial records from ClinicalTrials.gov, scientific publications, and FDA approval data.	2024
Saqib Mumtaz	UC Berkeley	Saqib’s research connects scientific publications with their media coverage by linking EurekaAlert press release data to OpenAlex author information.	2024