Mehrdad Farajtabar (@MFarajtabar) / X

Mehrdad Farajtabar

252 posts

Mehrdad Farajtabar

@MFarajtabar

Research Scientist at @Apple, prev @DeepMind, prev @GeorgiaTech

Seattle Area

sites.google.com/view/mehrdad

Joined January 2021

Pinned
Mehrdad Farajtabar
@MFarajtabar
May 12
🧵 1/11 Everyone's doing on-policy distillation now (Qwen3, Deepseek V4, GLM-5). But here's what nobody's asking: at any given token or for a question and a teacher, when does the teacher's guidance actually help, and when does it quietly make things worse? We found a way to
30K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the
1.6M
Mehrdad Farajtabar
@MFarajtabar
Jun 5, 2025
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,
908K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated
124K
Mehrdad Farajtabar
@MFarajtabar
Sep 27, 2022
Our team at Apple is looking for interns to work on Continual/Lifelong/Transfer Learning, Multi-Modal Large Models, ML Efficiency, and likewise. The position is available as early as next month and the duration is >6 months. Feel free to send your resumes to m_farajtabar@apple
Mehrdad Farajtabar
@MFarajtabar
Nov 15, 2023
My team at #Apple is looking for interns to work on Large Language Models (#LLM) especially on efficient "inference" and training. Please email your CV and highlighted related research or codes to m_farajtabarATappleDOTcom. The ideal candidate must:
208K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
5/ #Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?
60K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in #AI_safety, #alignment, #education, #health_care, and #decision_making systems. Our findings emphasize the
40K
Mehrdad Farajtabar
@MFarajtabar
Oct 22, 2024
1/ LLM inference is very expensive; and LLMs don't necessarily use their full capacity to respond to a specific prompt. That's why many researchers have been investigating adaptive computation methods such as early exiting, layer/expert pruning, speculative decoding, mixture of
27K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing #GSM_NoOp! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next!
57K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
9/ #Result 4: A massive performance drop! All models, including o1 models, show significant declines. While it’ll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe.“
39K
Mehrdad Farajtabar
@MFarajtabar
Jun 5, 2025
Replying to @MFarajtabar
8/8 Scaling compute is helpful, but not enough to close the reasoning gaps 🧠 Our findings challenge assumptions about LRM capabilities. Despite sophisticated self-reflection mechanisms from RL training, our results suggest that these models can't follow algorithm steps and
machinelearning.apple.com
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the...
Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes…
19K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical/#symbolic reasoning? vs.
63K
Mehrdad Farajtabar
@MFarajtabar
Oct 10, 2024
Replying to @MFarajtabar
3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the #GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic
60K