Is OpenAI's o1 a good calculator? We tested it on up to 20x20 multiplication—o1 solves up to 9x9 multiplication with decent accuracy, while gpt-4o struggles beyond 4x4. For context, this task is solvable by a small LM using implicit CoT with stepwise internalization. 1/4
Can we teach LMs to internalize chain-of-thought (CoT) reasoning steps? We found a simple method: start with an LM trained with CoT, gradually remove CoT steps and finetune, forcing the LM to internalize reasoning.
Paper: bit.ly/internalize_st…
Done w/ @YejinChoinka@pmphlt 1/5
For those curious about how o3-mini performs on multi-digit multiplication, here's the result. It does much better than o1 but still struggles past 13×13. (Same evaluation setup as before, but with 40 test examples per cell.)
Is OpenAI's o1 a good calculator? We tested it on up to 20x20 multiplication—o1 solves up to 9x9 multiplication with decent accuracy, while gpt-4o struggles beyond 4x4. For context, this task is solvable by a small LM using implicit CoT with stepwise internalization. 1/4
We trained GPT2 to predict the product of two numbers up to 🌟20🌟 digits w/o intermediate reasoning steps, surpassing our previous 15-digit demo! How does a 12-layer LM solve 20-digit multiplication w/o CoT?🤯
Try our demo: huggingface.co/spaces/yuntian…
Paper: bit.ly/internalize_st…
Lastly, this task is solvable even by a small language model: Implicit CoT with Stepwise Internalization can solve up to 20x20 multiplication with 99.5% accuracy, using a gpt-2 small architecture (117M parameters). 4/4
x.com/yuntiandeng/st…
We trained GPT2 to predict the product of two numbers up to 🌟20🌟 digits w/o intermediate reasoning steps, surpassing our previous 15-digit demo! How does a 12-layer LM solve 20-digit multiplication w/o CoT?🤯
Try our demo: huggingface.co/spaces/yuntian…
Paper: bit.ly/internalize_st…
Today I learned a student of mine from China gave up waiting for his Canadian visa after over a year without updates:
1. He was a Vector Scholarship awardee.
2. He had to set aside $20K under the Direct Stream (for faster visa processing), despite being my funded student.
3. He
I am hiring NLP/ML PhD students at UWaterloo, home to 5 NLP professors! Apply by Dec 1
Strong consideration will be given to those who can tackle the below challenge: Can we use LM's hidden states to reason multiple problems simultaneously?
Retweets/shares appreciated🥰
Can LMs solve reasoning tasks without showing their work? "Implicit Chain of Thought Reasoning via Knowledge Distillation" teaches LMs to reason internally to solve tasks like 5×5 multiplication. Here's how we bypass human-like step-by-step reasoning bit.ly/implicitCoT 1/6
How many reasoning tokens does OpenAI o1 use? It turns out they are almost always multiples of 64 (99+% of the time in 100K collected turns)🤔Could it be that the model only uses multiples of 64 tokens to think? Or maybe OpenAI rounds the token count in the returned usage? 1/4
Interestingly, the number of private reasoning tokens grows sublinearly with problem size, but is beyond what human-written CoT requires. For example, for 20x20, o1 uses ~3600 reasoning tokens, but human CoT needs ~400 for partial products and ~400 for sums, totaling ~800. 2/4
Excited to share that I'm joining @UWCheritonCS as an Assistant Professor and @VectorInst as a Faculty Affiliate in Fall '24. Before that, I'm doing a postdoc at @allen_ai with @YejinChoinka. Immensely grateful to my PhD advisors @srush_nlp and @pmphlt. This journey wouldn't have
Ever wondered how nondeterministic GPT-4 is even with greedy decoding (T=0)? I built a website that asks GPT-4 to draw a unicorn every hour and tracks if the results stay consistent over time (spoiler alert: they don't! 🦄).
Explore the findings:
openaiwatch.com
🚀New dataset release: WildChat-4.8M
4.8M real user-ChatGPT conversations collected from our public chatbots:
- 122K from reasoning models (o1-preview, o1-mini): represent real uses in the wild and very costly to collect
- 2.5M from GPT-4o
🔗 hf.co/datasets/allen… (1/4)
Thrilled to see WildChat featured by @_akhaliq, just as predicted by AKSelectionPredictor!😊
Explore 1 million user-ChatGPT conversations, plus details like country, state, timestamp, hashed IP, and request headers here:
huggingface.co/datasets/allen…