✨New preprint✨
Zero-shot GPT-3 does *better* at news summarization than any of our fine-tuned models. Humans like these summaries better. But all of our metrics think they’re MUCH worse.
Work/ w/ @jessyjli, @gregd_nlp. Check it out here: arxiv.org/abs/2209.12356
[1/6]
Tanya Goyal
201 posts
- Replying to @tanyaagoyalI will join Cornell CS @cs_cornell as an assistant professor in Fall 2024 after spending a year at @princeton_nlp working on all things language models. I will be posting about open positions in my lab soon but you can read about my research here: tagoyal.github.io
- I successfully defended my PhD @UTCompSci last week. A BIG thank you to my advisor @gregd_nlp and mentor @jessyjli for being incredibly supportive throughout my PhD, and esp. over the last few months on the job market! Next, ... 1/2
- Excited to share SNaC (Summary Narrative Coherence), a dataset of 9.6K error annotations across 150 long summaries + a data collection framework for fine-grained coherence errors in summarization. arxiv.org/abs/2205.09641 (work w/ @jessyjli @gregd_nlp)
- Replying to @tanyaagoyalWhat does this mean for evaluation? All our metrics (ROUGE + factuality work from the last 2+ years, etc.) fail to evaluate GPT-3 summaries that look nothing like past generated or reference summaries of standard datasets! We need to rethink automatic evaluation going forward!
- New work "Neural Syntactic Preordering for Controlled Paraphrase Generation" (with @gregd_nlp) at #acl2020nlp! Basic Idea: Break paraphrasing into 2 steps: "soft" reordering of input (like preordering in MT) followed by rearrangement aware paraphrasing arxiv.org/abs/2005.02013 1/
- Our paper "Evaluating Factuality in Generation with Dependency-level Entailment" (w/ @gregd_nlp) to appear in Findings of #EMNLP2020! arxiv.org/abs/2010.05478 We decompose the sen-level factuality into entailment evaluation of smaller units (dependency arcs) of the hypotheses
- Presenting our work (w/ @JiachengNLP @jessyjli @gregd_nlp) “Training Dynamics for Text Summarization Models” at #acl2022 Findings: aclanthology.org/2022.findings-… In-person PS5-3 Summarization: 3:15p, May 24 (Dublin time) Virtual PS3 Summarization: 7.30a, May 25 (Dublin time)
- Excited to present our work on multi-decoder summarization models "HydraSum" later this week at #EMNLP! work w/ @nazneenrajani @owenhaoliu & @iam_wkr during my Salesforce internship! arxiv.org/abs/2110.04400 I will present this in person on 9th Dec, 11am in the summ session 🧵 1/
- Replying to @tanyaagoyalWe collect human preference annotations for news summaries generated by current SOTA and zero-shot GPT-3 models. For multiple settings (generic + keyword) and datasets (CNN + BBC), GPT-3 summaries beat prior fine-tuned models! [2/6]
- I will be presenting this (+ Hydrasum tinyurl.com/hydrasum), both in the summarization oral session at 11 am today. Come say hi!Excited to share SNaC (Summary Narrative Coherence), a dataset of 9.6K error annotations across 150 long summaries + a data collection framework for fine-grained coherence errors in summarization. arxiv.org/abs/2205.09641 (work w/ @jessyjli @gregd_nlp)
- Replying to @tanyaagoyalBrowse examples of generated summaries and human annotations at: tagoyal.github.io/zeroshot-news-… [6/6]
- Replying to @tanyaagoyalFurthermore, GPT-3 can emulate multiple different styles and is keyword controllable. It does great in all the settings we tested it in, and doesn’t present the kinds of factual errors we’ve seen in the literature. [3/6]
- Replying to @tanyaagoyalThis also means we can now break away from noisy benchmark datasets, e.g. XSum, that (we observe) cannot produce systems for real settings. Instead, actual use cases and not data availability can now dictate future research directions (task goals, domains, etc.) [4/6]








