Neha Hulkund | MIT CSAIL

About Me

I'm a second-year AI/ML PhD student at MIT CSAIL. I work on data curation in pre and post-training of machine learning systems, with a focus on practical deployment.

Previously, I was a master's student at MIT CSAIL in the HealthyML Lab advised by Prof. Marzyeh Ghassemi and a visiting student at ETH Zurich, hosted by Prof. Fanny Yang. I also went to MIT for undergrad and double majored in computer science and mathematics (Course 6 & 18) with a concentration in Ancient and Medieval Studies. I've had the pleasure to intern at Microsoft Research (×2), Apple Research, and Amadeus.

My research interests include:

Data Curation
Instruction Selection
Coreset Methods
Pre & Post-Training
Distribution Shift
Optimal Transport

Publications

* denotes equal contribution

Preprints

DataS³: Dataset Subset Selection for Specialization Neha Hulkund, Alaa Maalouf, ..., Sara Beery (15+ authors)

A large-scale study of dataset curation methods for specializing foundation models, benchmarking subset selection algorithms across domains and modalities.

Preprint arxiv

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't) Nihal Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

Identifies which factors in instruction data selection actually drive LLM fine-tuning performance, separating signal from noise across commonly used selection heuristics.

Preprint arxiv

Conference & Journal Papers

Predicting Out-of-Domain Generalization with Local Manifold Smoothness Nathan Ng, Neha Hulkund, Kyunghyun Cho, Marzyeh Ghassemi

Proposes a geometry-based metric — local manifold smoothness — that predicts how well a model will generalize to out-of-distribution data without requiring target domain labels.

TMLR paper

Workshop Papers

Exploration into Gradient-Based Coreset Methods for Targeted Subset Selection Evelyn Zhu, Neha Hulkund, Sara Beery

Evaluates gradient-based coreset selection methods for curating task-specific training subsets, with analysis of their scalability and effectiveness for targeted fine-tuning.

ICLR Data-FM Workshop

Interpretable Distribution Shift Detection using Optimal Transport Neha Hulkund, Nicolo Fusi, Jennifer Wortman Vaughan, David Alvarez-Melis

Uses optimal transport distances to detect and explain distribution shift in a human-interpretable way, identifying which features are responsible for the shift.

ICML 2022 DataPerf Workshop arxiv

The Limits of Algorithmic Stability for Robustness to Distribution Shift Neha Hulkund, Vinith Suriyakumar, Taylor Killian, Marzyeh Ghassemi

Provides theoretical bounds showing that algorithmic stability alone is insufficient to guarantee robustness under distribution shift, motivating data-centric approaches.

NeurIPS 2022 WiML Workshop pdf poster

GAN-based Data Augmentation for Chest X-ray Classification Shobhita Sundaram*, Neha Hulkund*

Shows that GAN-generated synthetic chest X-rays improve classifier performance in low-data regimes, with analysis of which augmentation strategies transfer most effectively.

Spotlight KDD 2021 DSHealth Workshop arxiv