We are ByteDance Seed team.
You can get to know us better through the following channels👇
Seamlessly Endowing LLMs with Test-Time Training Ability
Guhao Feng*, Shengjie Luo*, Kai Hua, Ge Zhang, Wenhao Huang, Di He, Tianle Cai
In-Place TTT is a drop-in test-time training method for Transformer LLMs. This repository provides the training, checkpoint conversion, inference, and evaluation stack built on VeOmni, together with recommended configs for Qwen3-8B and LLaMA-3.1-8B.
[2026/03] The codebase is open-sourced.
[2026/02] In-Place TTT is accepted to ICLR 2026 as an Oral presentation.
Current large language models follow a static "train then deploy" paradigm. Once deployed, model weights are frozen and cannot adapt to new information encountered during inference. This limits long-context reasoning, where useful information arrives progressively and the model would benefit from updating itself as it reads.
In-Place Test-Time Training (In-Place TTT) addresses this by updating a subset of model parameters, the MLP down-projection fast weights, during inference. Unlike prior TTT approaches that require architectural side modules or external memory, In-Place TTT stays inside the standard Transformer block and remains compatible with off-the-shelf autoregressive LLMs.
The method is centered around three ideas:
- Architectural compatibility. Fast weights live in the existing MLP down-projection matrix, so no extra attention heads or memory modules are introduced.
- LM-aligned objective. The fast-weight update is aligned with next-token prediction instead of a generic reconstruction target.
- Chunk-wise update. Long sequences are split into chunks so updates can be computed efficiently and scaled to long contexts.
As used in this repo, the end-to-end workflow is:
- Provide your own VeOmni-compatible processed dataset and base model assets.
- Launch continual pretraining with VeOmni through
train.shandtasks/train_torch.py. - Export DCP checkpoints into HuggingFace format with
scripts/merge_dcp_to_hf.py. - Run TTT-aware inference and RULER evaluation with
inference_model/,eval.sh, andeval_config/.
The repository includes recommended training configs for Qwen3-8B and LLaMA-3.1-8B, checkpoint conversion utilities, and a full RULER evaluation pipeline via OpenCompass from 4K to 256K context lengths.
Step 1. Install PyTorch and FlashAttention:
pip3 install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip3 install flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
rm flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whlStep 2. Install VeOmni from the validated commit:
pip3 install "veomni @ git+https://github.com/ByteDance-Seed/VeOmni.git@9b91e164bea9e17f17ed490aab5e076c2335ca25"Step 3. Install the remaining dependencies:
pip3 install liger-kernel
pip3 install byted-wandb torchdata blobfile datasets diffusers tiktoken timm
pip3 install transformers==4.57.3
pip3 install opt_einsum einops
pip3 uninstall -y byted-wandb wandb
pip3 install byted-wandbStep 4. Optionally verify the installed VeOmni source:
python3 - <<'PY'
import json, pathlib, veomni
p = pathlib.Path(veomni.__file__).resolve().parents[1] / "veomni-0.1.0.dist-info" / "direct_url.json"
print("veomni file:", veomni.__file__)
print("direct_url:", json.loads(p.read_text()) if p.exists() else "not found")
PYThis repository no longer ships data-processing scripts. Provide your own processed dataset through data.train_path.
The recommended configs assume:
data.data_type=plaintextdata.datasets_type=iterabledata.text_keys=content_split
For dataset argument definitions and supported loading modes, refer to the official VeOmni docs:
Example:
bash train.sh tasks/train_torch.py configs/pretrain/qwen3_longct.yaml \
--data.train_path /path/to/your_data \
--train.output_dir /path/to/your_output_dirBelow is the recommended model config pattern used in the provided Qwen and LLaMA examples.
model:
model_path: /path/to/your_base_model
foundation:
ttt_layers: [0, 6, 12, 18, 24, 30, 36]
ttt_mode: true
ttt_proj: true
ttt_lr: 3
ttt_chunk: 4096
data:
train_path: /path/to/your_data
train_size: 20000000000
dataloader_type: native
datasets_type: iterable
data_type: plaintext
max_seq_len: 65536
text_keys: content_split
drop_last: true
train:
output_dir: /path/to/your_output_dir
data_parallel_mode: fsdp2
global_batch_size: 64
micro_batch_size: 1
optimizer: adamw
lr: 5.0e-6
lr_warmup_ratio: 0.02
lr_decay_style: cosine
lr_decay_ratio: 0.90
weight_decay: 0.1
max_grad_norm: 1.0
max_steps: 5000
enable_mixed_precision: true
enable_gradient_checkpointing: true
enable_full_shard: true
init_device: meta
ckpt_manager: dcp
save_steps: 500
save_hf_weights: true
use_wandb: trueThe corresponding recommended config files are:
configs/pretrain/qwen3_longct.yamlconfigs/pretrain/llama3_longct.yaml
Quick smoke run:
bash train.sh tasks/train_torch.py configs/pretrain/qwen3_longct.yaml \
--train.output_dir /path/to/your_output_dir \
--train.max_steps 1 \
--train.use_wandb falseRecommended Qwen config override:
bash train.sh tasks/train_torch.py configs/pretrain/qwen3_longct.yaml \
--train.wandb_project your_wandb_project \
--train.wandb_name your_run_name \
--train.output_dir /path/to/your_output_dir \
--model.foundation '{"ttt_layers":[0,6,12,18,24,30,36],"ttt_mode":true,"ttt_proj":true,"ttt_lr":3,"ttt_chunk":4096}'Recommended LLaMA config override:
bash train.sh tasks/train_torch.py configs/pretrain/llama3_longct.yaml \
--train.wandb_project your_wandb_project \
--train.wandb_name your_run_name \
--train.output_dir /path/to/your_output_dir \
--model.foundation '{"ttt_layers":[0,6,12,18,24,30,36],"ttt_mode":true,"ttt_proj":true,"ttt_lr":3,"ttt_chunk":4096}'Convert VeOmni DCP checkpoints into HuggingFace format:
python scripts/merge_dcp_to_hf.py \
--load-dir /path/to/your_checkpoint_dir
python scripts/merge_dcp_to_hf.py \
--load-dir /path/to/your_checkpoint_dir \
--save-dir /path/to/your_hf_checkpoint_dir \
--model-assets-dir /path/to/your_base_model \
--shard-size 5000000000Run the default RULER evaluation sweep:
bash eval.shSingle-config smoke run:
CUDA_VISIBLE_DEVICES=0 python3 -c \
"import inference_model; from opencompass.cli.main import main; import sys; sys.argv=['opencompass','eval_config/ruler_4k.py','--debug']; main()"To evaluate your own checkpoints, update eval_config/models.py with your model name and HuggingFace checkpoint path.
- Drop-in TTT for standard Transformers. In-Place TTT updates the MLP down-projection fast weights without introducing extra architectural side modules.
- LM-aligned fast-weight updates. The optimization target is derived for autoregressive language modeling instead of a generic reconstruction objective.
- Long-context continual pretraining stack. The repo includes recommended Qwen3-8B and LLaMA-3.1-8B configs built on VeOmni and FSDP2.
- Checkpoint export path.
scripts/merge_dcp_to_hf.pyconverts VeOmni DCP checkpoints into HuggingFace format. - TTT-aware inference and evaluation.
inference_model/,eval.sh, andeval_config/cover inference and RULER evaluation through OpenCompass. - Long-context coverage. The evaluation setup spans 4K, 8K, 16K, 32K, 64K, 128K, and includes a 256K config.
This project is licensed under the Apache License 2.0.
If you find this work useful for your research and applications, feel free to give us a star or cite us using:
@inproceedings{feng2026inplace,
title = {In-Place Test-Time Training},
author = {Feng, Guhao and Luo, Shengjie and Hua, Kai and Zhang, Ge and Huang, Wenhao and He, Di and Cai, Tianle},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
note = {Oral Presentation},
url = {https://openreview.net/forum?id=dTWfCLSoyl}
}About ByteDance Seed Team
Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

