Research

Building Malleable Machine Learning (ML) Systems

September 22, 2021
4 min read

Defining and Building Malleable ML Systems – Machine Learning Whiteboard (MLW) Open-Source Series

As you may know, earlier this year, we started our machine learning whiteboard (MLW) series, an open-invite space to brainstorm ideas and discuss the latest papers, techniques, and workflows in the AI space. We emphasize an informal and open environment to everyone interested in learning about machine learning. In this episode, Karan Goel, a Ph.D. student at Stanford working on research, machine learning, and systems, walks through some of the big-picture questions around making machine learning models more robust and easier to use, specifically diving into defining and building malleable ML systems, with “Robustness Gym: Unifying the NLP Evaluation Landscape,” a research paper by Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, and Christopher Ré, along with “Mandoline: Model Evaluation under Distribution Shift,” authored by Mayee Chen, Karan Goel, Nimit Sohoni, Fait Poms, Kayvon Fatahalian, Christopher Ré, presented at ICML 2021.This episode is part of the #MLwhiteboard video series hosted by Snorkel AI. Check out the episode here:

Paper Abstracts:

Robustness Gym: Unifying the NLP Evaluation Landscape

Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies four standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all four evaluation paradigms with just a few clicks and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate Robustness Gym’s utility to practitioners, we conducted a real-world case study with a sentiment-modeling team, revealing performance degradations of 18%+.To verify that Robustness Gym can aid novel research analyses, we perform the first study of state-of-the-art commercial and academic named entity linking (NEL) systems, as well as a fine-grained analysis of state-of-the-art summarization models. For NEL, commercial systems struggle to link rare entities and lag their academic counterparts by 10%+, while state-of-the-art summarization models struggle on examples that require abstraction and distillation, degrading by 9%+. For more information, please visit Robustness Gym.

Mandoline: Model Evaluation under Distribution Shift

Machine learning models are often deployed in different settings than they were trained and validated on, posing a challenge to practitioners who wish to predict how well the deployed model will perform on a target distribution. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as importance weighting can be applied to estimate performance on the target. However, importance weighting struggles when the source and target distributions have non-overlapping support or are high-dimensional. Taking inspiration from fields such as epidemiology and polling, we develop Mandoline, a new evaluation framework that mitigates these issues. Our key insight is that practitioners may have prior knowledge about the ways in which the distribution shifts, which we can use to guide the importance weighting procedure better. Specifically, users write simple “slicing functions” – noisy, potentially correlated binary functions intended to capture possible axes of distribution shift – to compute reweighted performance estimates. We further describe a density ratio estimation framework for the slices and show how its estimation error scales with slice quality and dataset size. Empirical validation on NLP and vision tasks shows that \name can estimate performance on the target distribution up to 3x more accurately compared to standard baselines.Where to connect with Karan: TwitterLinkedinGithub.


If you are interested in learning with us, consider joining us at our biweekly ML whiteboard.Stay in touch with Snorkel AI, follow us on TwitterLinkedInFacebookYoutube, or Instagram, and if you’re interested in joining the Snorkel team, we’re hiring! Please apply on our careers page.

Share this article

Recommended articles

View all articles
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image
Benchtalks #2: The future of coding benchmarks
For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel
June 3, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.