user avatar
Tanya Goyal
@tanyaagoyal
Faculty @Cornell_CS. she/her
Ithaca, NY
Born July 9
Joined September 2019
Posts
  • user avatar
    ✨New preprint✨ Zero-shot GPT-3 does *better* at news summarization than any of our fine-tuned models. Humans like these summaries better. But all of our metrics think they’re MUCH worse. Work/ w/ @jessyjli, @gregd_nlp. Check it out here: arxiv.org/abs/2209.12356 [1/6]
  • user avatar
    Replying to @tanyaagoyal
    I will join Cornell CS @cs_cornell as an assistant professor in Fall 2024 after spending a year at @princeton_nlp working on all things language models. I will be posting about open positions in my lab soon but you can read about my research here: tagoyal.github.io
  • user avatar
    I successfully defended my PhD @UTCompSci last week. A BIG thank you to my advisor @gregd_nlp and mentor @jessyjli for being incredibly supportive throughout my PhD, and esp. over the last few months on the job market! Next, ... 1/2
  • user avatar
    Excited to share SNaC (Summary Narrative Coherence), a dataset of 9.6K error annotations across 150 long summaries + a data collection framework for fine-grained coherence errors in summarization. arxiv.org/abs/2205.09641 (work w/ @jessyjli @gregd_nlp)
  • user avatar
    Replying to @tanyaagoyal
    What does this mean for evaluation? All our metrics (ROUGE + factuality work from the last 2+ years, etc.) fail to evaluate GPT-3 summaries that look nothing like past generated or reference summaries of standard datasets! We need to rethink automatic evaluation going forward!
  • user avatar
    New work "Neural Syntactic Preordering for Controlled Paraphrase Generation" (with @gregd_nlp) at #acl2020nlp! Basic Idea: Break paraphrasing into 2 steps: "soft" reordering of input (like preordering in MT) followed by rearrangement aware paraphrasing arxiv.org/abs/2005.02013 1/
  • user avatar
    Our paper "Evaluating Factuality in Generation with Dependency-level Entailment" (w/ @gregd_nlp) to appear in Findings of #EMNLP2020! arxiv.org/abs/2010.05478 We decompose the sen-level factuality into entailment evaluation of smaller units (dependency arcs) of the hypotheses
  • user avatar
    Presenting our work (w/ @JiachengNLP @jessyjli @gregd_nlp) “Training Dynamics for Text Summarization Models” at #acl2022 Findings: aclanthology.org/2022.findings-… In-person PS5-3 Summarization: 3:15p, May 24 (Dublin time) Virtual PS3 Summarization: 7.30a, May 25 (Dublin time)
  • user avatar
    Excited to present our work on multi-decoder summarization models "HydraSum" later this week at #EMNLP! work w/ @nazneenrajani @owenhaoliu & @iam_wkr during my Salesforce internship! arxiv.org/abs/2110.04400 I will present this in person on 9th Dec, 11am in the summ session 🧵 1/
    Shows training and inference for hydrasum. During training, single decoders are replaced by 2 decoders in a mixture-of-experts. 

During inference, the figure shows that you can sample from individual decoders or their mixture (can even specify the gate manually)
  • user avatar
    Replying to @tanyaagoyal
    We collect human preference annotations for news summaries generated by current SOTA and zero-shot GPT-3 models. For multiple settings (generic + keyword) and datasets (CNN + BBC), GPT-3 summaries beat prior fine-tuned models! [2/6]
  • user avatar
    I will be presenting this (+ Hydrasum tinyurl.com/hydrasum), both in the summarization oral session at 11 am today. Come say hi!
    Excited to share SNaC (Summary Narrative Coherence), a dataset of 9.6K error annotations across 150 long summaries + a data collection framework for fine-grained coherence errors in summarization. arxiv.org/abs/2205.09641 (work w/ @jessyjli @gregd_nlp)
  • user avatar
    Replying to @tanyaagoyal
    Browse examples of generated summaries and human annotations at: tagoyal.github.io/zeroshot-news-… [6/6]
  • user avatar
    Replying to @tanyaagoyal
    Furthermore, GPT-3 can emulate multiple different styles and is keyword controllable. It does great in all the settings we tested it in, and doesn’t present the kinds of factual errors we’ve seen in the literature. [3/6]
  • user avatar
    Replying to @tanyaagoyal
    This also means we can now break away from noisy benchmark datasets, e.g. XSum, that (we observe) cannot produce systems for real settings. Instead, actual use cases and not data availability can now dictate future research directions (task goals, domains, etc.) [4/6]