user avatar
Yanda Chen
@yanda_chen_
Member of Technical Staff @AnthropicAI CodeRL/Alignment | PhD @ColumbiaCompSci | NLP & ML | Prev Intern @MSFTResearch, @AmazonScience
San Francisco, CA
Joined January 2019
Posts
  • user avatar
    My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏
    New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
    Title card for the paper "Reasoning Models Don't Always Say What They Think", by Chen et al.
  • user avatar
    Life update: I’m excited to share that I’m joining the Alignment Science team at @AnthropicAI as a Member of Technical Staff/Research Scientist. I’ll be focusing on AI safety. Looking forward to it!
  • user avatar
    [1/6] Have you found that if you slightly change your prompts, your favorite language model outputs differently? Curious why and how to capitalize such sensitivity? Check out our work “On the Relation between Sensitivity and Accuracy in In-context Learning”.
  • user avatar
    [1/9] Large Language Models (LLMs) can mimic humans to explain human decisions. But can they explain THEMSELVEs? How to evaluate explanations along this axis? Check out our work “Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations”!
  • user avatar
    [1/7] Pre-trained LMs can do in-context learning, but this is unexpected given the distribution shift between pre-training data and ICL prompts. What structures of pre-training data yield ICL? Check out our work “Parallel Structures in Pre-training Data Yield In-Context Learning”
  • user avatar
    [1/8] Large LMs (e.g.,GPT-3) are good at few-shot learning. But prompting LMs exhibit artifacts like oversensitivity to example order/choice & instruction wording.☹️ Our work “Meta-learning via LM In-context Tuning” proposes a fix to meta-train LMs to learn in-context learning!😃
  • user avatar
    Come check out our ACL poster today at Session B (14:00-15:30). We study how in-context learning emerges, and find it emerges from parallel structures in the pre-training data.
    [1/7] Pre-trained LMs can do in-context learning, but this is unexpected given the distribution shift between pre-training data and ICL prompts. What structures of pre-training data yield ICL? Check out our work “Parallel Structures in Pre-training Data Yield In-Context Learning”
  • user avatar
    I'll be at ICML and presenting our spotlight paper, "Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations," on Wednesday, 7/24 11:30 AM - 1:00 PM CEST. Please check it out! Looking forward to discussing explainability, LLMs, alignment, etc.
    [1/9] Large Language Models (LLMs) can mimic humans to explain human decisions. But can they explain THEMSELVEs? How to evaluate explanations along this axis? Check out our work “Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations”!
  • user avatar
    --Repost due to accidental deletion-- [1/n] Large Language Models can generate fluent explanations, but their explanations are often INCONSISTENT on different inputs. How to improve their consistency? Check out our work “Towards Consistent Natural-Language Explanations via
  • user avatar
    Check out our #ACL2021NLP paper "Cross-language Sentence Selection via Data Augmentation and Rationale Training" in Session 11D at 12-1pm (EDT) Tuesday Aug 3, 2021! We present an effective approach for cross-language sentence retrieval in low-resource settings.
  • user avatar
    Replying to @yanda_chen_
    [5/6] Experiments show that sensitivity is a strong signal for selective prediction, as SenSel consistently outperforms the MaxProb baseline by up to 5.1 AUC pts. We argue that ICL sensitivity is not merely an isolated artifact, but reflects how confidently the LM learns a task.
  • user avatar
    Replying to @yanda_chen_
    [3/9] We propose to evaluate the counterfactual simulatability of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input.
  • user avatar
    Replying to @yanda_chen_
    [4/8] Compared to MAML and multi-task fine-tuning which adapt to new tasks by gradient descent on task examples, LM weights are frozen during task adaptation in in-context tuning. Our approach gets rid of fine-tuning during adaptation and nested optimization of meta-training.