About Me

I'm a second-year AI/ML PhD student at MIT CSAIL. I work on data curation in pre and post-training of machine learning systems, with a focus on practical deployment.

Previously, I was a master's student at MIT CSAIL in the HealthyML Lab advised by Prof. Marzyeh Ghassemi and a visiting student at ETH Zurich, hosted by Prof. Fanny Yang. I also went to MIT for undergrad and double majored in computer science and mathematics (Course 6 & 18) with a concentration in Ancient and Medieval Studies. I've had the pleasure to intern at Microsoft Research (×2), Apple Research, and Amadeus.

My research interests include:

  • Data Curation
  • Instruction Selection
  • Coreset Methods
  • Pre & Post-Training
  • Distribution Shift
  • Optimal Transport

Publications

* denotes equal contribution

Preprints

manifold smoothness
DataS³: Dataset Subset Selection for Specialization Neha Hulkund, Alaa Maalouf, ..., Sara Beery (15+ authors)

A large-scale study of dataset curation methods for specializing foundation models, benchmarking subset selection algorithms across domains and modalities.

Preprint arxiv
manifold smoothness
A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't) Nihal Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

Identifies which factors in instruction data selection actually drive LLM fine-tuning performance, separating signal from noise across commonly used selection heuristics.

Preprint arxiv

Conference & Journal Papers

manifold smoothness
Predicting Out-of-Domain Generalization with Local Manifold Smoothness Nathan Ng, Neha Hulkund, Kyunghyun Cho, Marzyeh Ghassemi

Proposes a geometry-based metric — local manifold smoothness — that predicts how well a model will generalize to out-of-distribution data without requiring target domain labels.

TMLR paper

Workshop Papers

targeted subset selection
Exploration into Gradient-Based Coreset Methods for Targeted Subset Selection Evelyn Zhu, Neha Hulkund, Sara Beery

Evaluates gradient-based coreset selection methods for curating task-specific training subsets, with analysis of their scalability and effectiveness for targeted fine-tuning.

ICLR Data-FM Workshop
otdd interp
Interpretable Distribution Shift Detection using Optimal Transport Neha Hulkund, Nicolo Fusi, Jennifer Wortman Vaughan, David Alvarez-Melis

Uses optimal transport distances to detect and explain distribution shift in a human-interpretable way, identifying which features are responsible for the shift.

ICML 2022 DataPerf Workshop arxiv
The Limits of Algorithmic Stability for Robustness to Distribution Shift Neha Hulkund, Vinith Suriyakumar, Taylor Killian, Marzyeh Ghassemi

Provides theoretical bounds showing that algorithmic stability alone is insufficient to guarantee robustness under distribution shift, motivating data-centric approaches.

NeurIPS 2022 WiML Workshop pdf poster
chest xray
GAN-based Data Augmentation for Chest X-ray Classification Shobhita Sundaram*, Neha Hulkund*

Shows that GAN-generated synthetic chest X-rays improve classifier performance in low-data regimes, with analysis of which augmentation strategies transfer most effectively.

Spotlight KDD 2021 DSHealth Workshop arxiv