Zengzhi Wang (@SinclairWang1) / X

Zengzhi Wang

1,709 posts

Zengzhi Wang

@SinclairWang1

PhDing @sjtu1896 Working on Pre-training Data Engineering for Foundation Models: MathPile (2023), 🫐 ProX (2024), 💎 MegaMath (2025)，🐙 OctoThinker（2025）

sinclaircoder.github.io

Joined November 2020

Pinned
Zengzhi Wang
@SinclairWang1
Jun 26, 2025
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?
93K
Zengzhi Wang
@SinclairWang1
Jun 30, 2025
Just finished reading it quickly. It was truly impressive.
21K
Zengzhi Wang
@SinclairWang1
Apr 24, 2025
🚨New blog alert! Working on LLM x RL? You don’t want to miss this. Most SOTA RL results today rely on Qwen2.5 base models, but swap in Llama at the same model size and RL training dynamics shift drastically—RL from base often fails. Why? We ran a series of carefully controlled
21K
Zengzhi Wang
@SinclairWang1
May 28, 2025
I believe that we need a deeper understanding of the relationship between pre-training and RL scaling. How to perform pre-training better, making language models more suitable and smooth for RL scaling? That is to say, Pre-training for RL. If you are interested in it, welcome to
Stella Li
@StellaLisy
May 27, 2025
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
34K
Zengzhi Wang
@SinclairWang1
May 7, 2023
A mement worth remembering, although it may seem very trivial, is very important for me.😅🥳🥳🥳
24K
Zengzhi Wang
@SinclairWang1
Apr 20, 2025
A nice summary blog about mid-training.
22K
Zengzhi Wang
@SinclairWang1
Sep 26, 2024
🚨New paper!🚨 Still worried about the low quality of your rule-cleaned pre-training corpora? Try 🫐 ProX! 1. Dramatically boosts pre-training corpus quality with a language model that generates executable programs. 2. A 1.7B model, trained on corpus refined by 🫐 ProX with
26K
Zengzhi Wang
@SinclairWang1
Jun 10, 2025
Interesting effort on pre-training. While I appreciate the effort that went into this, I respectfully hold some different opinions on certain aspects of it. (1) The choices of pre-training data. As the paper mentioned, "The proposed approach allows RL to be scaled to the
Qingxiu Dong
@qx_dong
Jun 10, 2025
⏰ We introduce Reinforcement Pre-Training (RPT🍒) — reframing next-token prediction as a reasoning task using RLVR ✅ General-purpose reasoning 📑 Scalable RL on web corpus 📈 Stronger pre-training + RLVR results 🚀 Allow allocate more compute on specific tokens
11K
Zengzhi Wang
@SinclairWang1
May 7, 2025
MegaMath, currently the largest open-source math pre-training corpora collection, reaches 70k+ downloads. Check our paper for more details: arxiv.org/pdf/2504.02807 Download data: huggingface.co/datasets/LLM36…
21K
Zengzhi Wang
@SinclairWang1
Dec 29, 2023
Replying to @_akhaliq
Hi, thanks very much for your interest in our MathPile. Our data is now open source on huggingface.co/datasets/GAIR/….
GAIR/MathPile · Datasets at Hugging Face
From huggingface.co
14K
Zengzhi Wang
@SinclairWang1
Apr 30, 2024
If Model A beats B in benchmarks, is it really better? Not if it trained on those benchmarks—that's an unfair edge! How can you tell if a model used benchmark data for training? 🤔 Welcome to check out our latest work: huggingface.co/papers/2404.18… (1/n)
Paper page - Benchmarking Benchmark Leakage in Large Language Models
From huggingface.co
26K
Zengzhi Wang
@SinclairWang1
Jun 18, 2025
Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from @essential_ai. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly
Essential AI
@essential_ai
Jun 18, 2025
[1/5] 🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!
11K
Zengzhi Wang
@SinclairWang1
May 7, 2025
I believe rewriting (refining) pre-training corpora is indeed promising. Last month, we dropped MegaMath, including MegaMath-Web-Pro (15.1B tokens), refined by LLMs at scale, currently the top-quality (maybe best) open-source math pre-training corpus. Check our paper for more
PapersAnon
@papers_anon
May 7, 2025
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Two datasets. Found that rewriting instead of filtering produced better results by eliminating noise and redundancy. With a fixed 50B training budget, continual pre-training of Llama-3.1-8B boosts pass@1 by
6.1K
Zengzhi Wang
@SinclairWang1
Jun 13, 2025
Just finishing the quick review of Magistral's technical report. I believe it's definitely worthwhile having a look due to a lot of insightful details and highlights on implementation. (1) Tricky and smart language consistency reward by utilizing a fasttext classifier. I believe
00:36
Mistral AI
@MistralAI
Jun 10, 2025
Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.
6.1K