user avatar
Tom McCoy
@RTomMcCoy
Assistant professor @YaleLinguistics. Studying computational linguistics, cognitive science, and AI. He/him.
New Haven, CT
Joined December 2018
  • Pinned
    user avatar
    🤖🧠NOW OUT IN PNAS🧠🤖 Language models show many surprising behaviors. E.g., they can count 30 items more easily than 29 In Embers of Autoregression, we explain such effects by analyzing what LMs are trained to do pnas.org/doi/10.1073/pn… Major updates since the preprint! 1/n
    At the top is the title of the paper: "Embers of autoregression show how large language models are shaped by the problem they are trained to solve". Below on the left is a screenshot of ChatGPT being asked to count how many words are in a list. The correct answer is 29, but it says 30. Next to it is a plot showing ChatGPT's accuracy at counting elements in a list; in general, it does well on multiples of 10 but poorly on other numbers. The explanation offered at the bottom of the image is: In training sets, round numbers are much more common than other numbers.
  • user avatar
    🤖🧠NEW PAPER🧠🤖 Language models are so broadly useful that it's easy to forget what they are: next-word prediction systems Remembering this fact reveals surprising behavioral patterns: 🔥Embers of Autoregression🔥 (counterpart to "Sparks of AGI") arxiv.org/abs/2309.13638 1/8
    The top says: “Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. By R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths.”

The bottom left shows a ClipArt image of fire. The top of the fire is labeled “Sparks of AGI,” and the bottom is labeled “Embers of autoregression”.

The bottom right shows a box labeled “Shift ciphers” with two examples of GPT-4 responses. First, when asked to shift each letter in a message back by 13, GPT-4 gets the correct answer: “I think everyone has their own path, and they can make it happen.” But when it has to shift back by 8, now GPT-4 incorrectly answers “I think therefore I am the best, and they can come at me with all their might.” The bottom says: “GPT-4 is much better at shifting back 13 letters (accuracy 0.51) than 8 letters (accuracy: 0.00). Explanation: In natural corpora, shifting by 13 is about 400x more common than shifting by 8.”
  • user avatar
    It has become acceptable for acronyms to use any letters within a word, not just the first letter. E.g., ORNATE = acrOnyms fRom noN-initial chAracTErs But why stick with whole letters? In my new paradigm CLIP, an acronym can use any curves or line segments from the base phrase!
    At the top of the image is the text "choosing letter-internal parts," written in gray. Inside each word, one letter is capitalized. Each of these capital letters has part of it highlighted in red to form a different letter. For example, the first E in the word "letter" has its vertical line segment and the bottom horizontal line segment highlighted in red to form the letter L. Overall, the letters that are highlighted in this way are the letter O of "choosing" (with the letter C highlighted within it), the letter E of "letter" (with the letter L highlighted within it), the letter T of "internal" (with the letter I highlighted within it), and the letter R of "parts" (with the letter P highlighted within it). Together these sub-letters spell CLIP, which is written at the bottom of the image, with arrows connecting each letter in CLIP to the position it was drawn from in the original phrase "choosing letter-internal parts."
  • user avatar
    How am I only learning now that Latvia's prime minister has a PhD in linguistics from Penn?? I've seen many lists of "jobs for linguists outside academia" but they never include Prime Minister of Latvia.
  • user avatar
    Linguists: In case you could use a diversion, I've made a phonetic crossword - all the answers must be written in the IPA, one phoneme per square. (Non-linguists: Here's a chance to learn some phonetics!) Puzzle: rtmccoy.com/crosswords/cha… Answers: rtmccoy.com/crosswords/cha…
  • user avatar
    🤖🧠NEW PAPER🧠🤖 What explains the dramatic recent progress in AI? The standard answer is scale (more data & compute). But this misses a crucial factor: a new type of computation. Shorter opinion piece: arxiv.org/abs/2205.01128 Longer tutorial: microsoft.com/en-us/research… 1/5
    At the top is a paper title and list of authors. The title is “Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems.” The authors are Paul Smolensky, R. Thomas McCoy, Roland Fernandez, Matthew Goldrick, and Jianfeng Gao. Below are two images. On the left is a sentence (“the door is unlockable”) paired with a vector. On the right is some sheet music paired with a sound wave. The caption for these images reads: A sentence can be “in” a neural network’s vector representations in the same way that music notes are “in” a sound wave. The vector, like the sound wave, is both compositional and continuous.
  • user avatar
    🤖🧠NEW PAPER🧠🤖 Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths? Our answer: Use meta-learning to distill Bayesian priors into a neural network! Paper: arxiv.org/abs/2305.14701 1/n
    A schematic of our method. On the left are shown Bayesian inference (visualized using Bayes’ rule and a portrait of the Reverend Bayes) and neural networks (visualized as a weight matrix). Then, an arrow labeled “meta-learning” combines Bayesian inference and neural networks into a “prior-trained neural network”, described as a neural network that has the priors of a Bayesian model – visualized as the same portrait of Reverend Bayes but made out of numbers. Finally, an arrow labeled “learning” goes from the prior-trained neural network to two examples of what it can learn: formal languages (visualized with a finite-state automaton) and aspects of English syntax (visualized with a parse tree for the sentence “colorless green ideas sleep furiously”).
  • user avatar
    *NEW PREPRINT* Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities? Answer: Some of both! Paper: arxiv.org/abs/2111.09509 1/n
    Paper title: “How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN”
Authors: R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, Asli Celikyilmaz
Images: A plot showing how often n-grams are novel, for various values of n. When n is less than 5, models are less novel than the baseline. After 5, they are more novel. The image also shows a list of novel words coined by GPT-2, namely IKEA-ness, Smurfverse, quackdom, non-airbender, Brazilianisms, Thirteenthly, bagshare, nonneotropical, hill-elves, and Disqusiquette.
  • user avatar
    🤖🧠Paper out in Nature Communications! 🧠🤖 Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths? Our answer: Use meta-learning to distill Bayesian priors into a neural network! nature.com/articles/s4146… 1/n
    A schematic of our method. On the left are shown Bayesian inference (visualized using Bayes’ rule and a portrait of the Reverend Bayes) and neural networks (visualized as a weight matrix). Then, an arrow labeled “meta-learning” combines Bayesian inference and neural networks into a “prior-trained neural network”, described as a neural network that has the priors of a Bayesian model – visualized as the same portrait of Reverend Bayes but made out of numbers. Finally, an arrow labeled “learning” goes from the prior-trained neural network to two examples of what it can learn: formal languages (visualized with a finite-state automaton) and aspects of English syntax (visualized with a parse tree for the sentence “colorless green ideas sleep furiously”).
  • user avatar
    Transformers are the current state of the art, but one day LSTMs may overtake them. That would make LSTMs current again. You could even say…re-current.
  • user avatar
    Flying home from #LSA2020? Remember to put your liquids in a separate bag!
  • user avatar
    🤖🧠 I'll be considering applications for postdocs & PhD students to start at Yale in Fall 2025! If you are interested in the intersection of linguistics, cognitive science, & AI, I encourage you to apply! Postdoc link: rtmccoy.com/prospective_po… PhD link: rtmccoy.com/prospective_st…
    Top: A syntax tree for the sentence "the doctor by the lawyer saw the artist".

Bottom: A continuous vector.
  • user avatar
    Excited to share some updates, which all still feel surreal: - Just defended my dissertation advised by @TalLinzen & @Paul_Smolensky! - Next up: Postdoc w/ Tom Griffiths @cocosci_lab! - Then joining @YaleLinguistics as an asst prof w 2ndary appt @YaleCompsci! A thank-you thread:
  • user avatar
    Takeaways from #NeurIPS: 1) In-distribution generalization is out 2) Out-of-distribution generalization is in 3) We want compositionality (whatever it is) 4) "GPT-2" is very hard to say