Muyu He (@HeMuyu0327) / X

Muyu He

469 posts

Muyu He

@HeMuyu0327

RL @ZyphraAI | Expert of mixtures, or 21st century schizoid man.

San Francisco

riddlehe.github.io

Joined September 2023

Pinned
Muyu He
@HeMuyu0327
May 30
In our new paper, we naturally derive a new attention variant based on the surprising finding that deep layers benefit the most from learning a context-free value vectors, without the input from the residual stream. The attention variant: since the value vector does not depend
45K
Muyu He
@HeMuyu0327
Sep 13, 2025
Yep @thinkymachines called it: the chunk size of prefill strategy does cause the LLM outputs to be non-deterministic. When I partition the attention reduction into 1, 2, 4 and 16 chunks, the logits drift significantly, across all three dtypes. So there are at least two
83K
Muyu He
@HeMuyu0327
Sep 22, 2025
Done a bit of the math in this new Flow-RL paper and it's fascinating. To ensure models don't lose diversity by only maximizing reward, they predict the sum of (the exponent) of all possible rewards for a question, and then train the model to get rewards according to a
29K
Muyu He
@HeMuyu0327
Sep 26, 2025
Finally!! Validated the most important contention in @thinkymachines 's nondeterminism blog: split reduction along the kv dimension causes batch-variant outputs. Given a decode step with a large context size (eg. 4096), depending on batch sizes, attention implementations such as
25K
Muyu He
@HeMuyu0327
Sep 11, 2025
Loving the @thinkymachines article on how batch size causes LLMs to be nondeterministic, so I launched some exps and found something unexpected: - Even when using scaled dot-product attention, there are perceivable logit shifts for batch size =1 vs >1, showing nondeterministic
27K
Muyu He
@HeMuyu0327
Nov 7, 2025
On-policy distillation is powerful, but @thinkymachines's tinker only supports distilling from a teacher model within the same family, making it impossible for qwen to learn from deepseek, gpt-oss, etc. For the first time, we enabled model-agnostic distillations natively using
23K
Muyu He
@HeMuyu0327
Oct 13, 2025
The original attention sink paper finds that sink token occurs regardless of semantic content, but it simply assumes that the absolute position of the token is what causes the sink. I did some layer sweep exps and the results are interesting -- by changing the positional
23K
Muyu He
@HeMuyu0327
Nov 1, 2025
Tinker for training exists, but Tinker for data doesn't. Yet, researchers spend most of the time on data preprocessing / generation and training integration. This Halloween, we introduce spider, ie. Tinker for data. It spins up a client for users to define a production-grade
43K
Muyu He
@HeMuyu0327
Sep 29, 2025
Since this hasn't been done, I patched the standard flash-attention repo (FA2) to make attention outputs fully deterministic, inspired by the @thinkymachines blog. TLDR: there must be some extra unreported sources of nondeterminism, since after the patch we clearly removed all
20K
Muyu He
@HeMuyu0327
Oct 24, 2025
There is a very interesting transformers "bug" that invalidates our prev results on attention sinks, and suggests something far more interesting. The "bug": masking_utils automatically masks out tokens in a sequence if the position ids are not incrementing by one (p1), so the
20K
Muyu He
@HeMuyu0327
Oct 16, 2025
After finding that models choose attention sinks to be the token with the smallest **relative**, not absolute, position id in the global context, we find another unexpected behavior not covered by the original paper: Attention sinks are only chosen from tokens after the **last
16K
Muyu He
@HeMuyu0327
Aug 2, 2025
We randomly came across this relatively unknown model, Kwaipilot, on LiveCodeBench pro where it’s beating the ass of R1 and the like. Turns out they used some really inspiring RL techniques. RL on coding or coding + math produces short reasoning traces that are unhelpful. So
16K
Muyu He
@HeMuyu0327
Nov 9, 2025
There is a "bug" in how @huggingface implements their on-policy distillation for teacher-student models with different tokenizers, and we have fixed it in our implementation using native Tinker. The "bug": the student's rollout is retrieved raw for the computation of the
27K
Muyu He
@HeMuyu0327
Oct 19, 2025
Interested in what causally induces attention sink, we ask this question: if query tokens never update as much as attention sink requires, do attention sinks still emerge? The result: No. Experiment setup: after each attention layer, before computing attn * V, we halve the
13K