SlideFormer is a PyTorch-based heterogeneous runtime for full-parameter fine-tuning on a single GPU. It co-designs layer-sliding execution, heterogeneous memory management, and asynchronous transfer/compute pipelines across GPU memory, CPU RAM, and optional NVMe storage. Only a compact active layer window is materialized on the GPU, while persistent training states are kept in CPU memory and cross-tier data movement is carefully scheduled throughout execution.
In our evaluation, SlideFormer achieves 1.40×–6.27× higher throughput than related offloading baselines, reduces peak GPU memory by up to 50% and peak CPU memory by up to 40%, supports up to 8× larger batch sizes, and enables full-parameter fine-tuning of 100B+ models on a single commodity GPU.
If this repository is useful to your work, please consider starring it.
- 2026.06: We released SlideFormer with expanded model compatibility, reproducibility scripts, and a new fine-grained chunked pipeline. A multi-GPU extension of SlideFormer's heterogeneous runtime is under active development and will be documented soon.
- 2026.03: SlideFormer was accepted to DAC 2026. The conference presentation page is available here, and the paper is available on arXiv.
- Early 2025: The initial SlideFormer system was developed and evaluated; subsequent development has focused on system-level optimization.
-
Layer-sliding execution: keeps only a compact active layer window on the GPU, allowing full-parameter fine-tuning of models that exceed GPU memory capacity.
-
Lightweight asynchronous engine: overlaps GPU computation with CPU optimizer updates, FP32-to-BF16 conversion and H2D parameter transfers, D2H gradient movement, activation offload/prefetch, and optional NVMe I/O.
-
Heterogeneous memory hierarchy: coordinates GPU memory, CPU RAM, and optional local NVMe storage to reduce memory pressure while preserving full-parameter mixed-precision training semantics.
-
Memory-efficient implementation: uses pre-allocated GPU cache units, layer-shared host buffers, and a layer-wise CPU LayerAdam path to reduce allocation overhead and peak GPU/CPU memory usage.
SlideFormer is designed for single-GPU machines with sufficient CPU memory. The exact memory requirement depends on the model size, sequence length, batch size, optimizer state, and offload configuration.
From our evaluation sweep, a practical estimate is about 12 GB CPU RAM per 1B parameters + batch-dependent activation/buffer overhead. NVMe optimizer/activation offload can further reduce the CPU-memory requirement at the cost of lower throughput.
git clone https://github.com/RegiaYoung/SlideFormer.git
cd SlideFormer
conda env create -f environment.yml
conda activate slideformerCore dependencies are
torch,
transformers, and
datasets. Since the core
runtime is mostly torch-native, SlideFormer can also run on AMD/ROCm GPUs with a
compatible PyTorch/ROCm stack.
Optional dependencies for acceleration/offload are
flash-attn,
liger-kernel, and
tensornvme; the code falls back
when they are not available. We recommend installing all optional dependencies
to match the reported performance.
Run a dummy-data end-to-end sanity check:
python scripts/main_dummy.py \
--model_path /path/to/Llama-3.1-8B-Instruct \
--seq_len 1024 \
--batch_size 64On multi-NUMA machines, binding the process to the GPU's NUMA node can improve CPU-GPU bandwidth:
numactl --cpunodebind=0 --membind=0 python scripts/main_dummy.pyUseful options:
| Option | Description |
|---|---|
--model_path |
Local path or HuggingFace model ID |
--seq_len |
Sequence length |
--batch_size |
Per-step batch size |
--epochs |
Number of training epochs |
--attn_implementation |
flash_attention_2 or sdpa |
--ac_offload_nvme |
Offload saved activations to NVMe |
--nvme_offload_fraction |
Optimizer-state NVMe offload fraction |
--offload_dir |
Directory used when NVMe offload is enabled |
scripts/main_real.py provides a real fine-tuning example on
QizhiPei/MathFusionQA.
python scripts/main_real.py \
--model_path /path/to/Llama-3.1-8B-Instruct \
--seq_len 4096 \
--batch_size 16 \
--epochs 1 \
--output_dir ./mathfusion-ft-resultsThe script saves the trained model, tokenizer, loss curve, and loss CSV to
--output_dir.
SlideFormer tracks a DeepSpeed ZeRO-3 CPU-offload reference on both 40 distinct real-data batches and a repeated-batch overfit probe.
Single benchmark run:
python scripts/main_bench.py \
--model_path /path/to/Llama-3.1-8B-Instruct \
--seq_len 1024 \
--batch_size 64 \
--warm_step 3 \
--test_step 10 \
--result_file ./outputs/bench.csvSweep over configured models and batch sizes:
bash scripts/bench.shFor reproducibility, baseline implementations and adapted configs are collected
under bench/, including DeepSpeed,
ColossalAI,
MegaTrain, and
LoHan. The supplemental figures below add a
MegaTrain comparison alongside the paper baselines.
![]() (a) Batch-size scaling and CPU memory for Llama-3.1-8B on RTX 4090. |
![]() (b) Model-size scaling and CPU memory for Qwen3 models on RTX 4090. |
![]() (c) GPU allocated memory vs. batch size for Llama-3.1-8B. |

(d) Long-context training on RTX 4090. Left: sequence-length scaling for Llama-3.1-8B at batch size 1. Right: maximum trainable sequence length for Qwen3 models at batch size 1; labels above bars report TFLOPS at the corresponding maximum.
With the flash attention used here, activation-related memory scales
approximately with batch_size * seq_len. For example, SlideFormer shows nearly
identical CPU and allocated GPU memory at bs=64, seq=1K and
bs=1, seq=64K. Compute scales differently: self-attention costs
O(B * S^2), so increasing sequence length adds substantially more attention
work even at a fixed token budget and makes training more compute-intensive.
Long-context TFLOPS can consequently exceed the 1K no-offload peak, which is
a workload-specific reference rather than a hardware peak.
Note: For fairness, all evaluated systems use the same workload and enable the same fused kernels unless a framework already provides an equivalent implementation. SlideFormer and all baselines except MegaTrain follow mixed-precision training semantics with BF16 compute, FP32 master parameters, and FP32 optimizer states. In the official MegaTrain single-GPU path, CPU-resident model parameters are loaded and updated in BF16. Its CPU-memory footprints are therefore shown for reference only.
The 2026.06 release adds a chunked asynchronous transfer/update pipeline that splits long FP32-to-BF16 conversion, H2D parameter movement, D2H gradient return, and CPU Adam update into smaller overlapped segments. This reduces exposed transfer/update time under pipeline contention and improves Qwen3-8B single-GPU throughput, especially at small batch sizes.
- Qwen2, Qwen2.5, and Qwen3
- Llama 3, 3.1, 3.2, and 3.3
- Mistral models
- Other HuggingFace decoder-only Transformers
Some model families may require light adapter changes for full compatibility. We will continue expanding tested model support.
SlideFormer/
├── offload_transformer.py # Runtime engine and scheduling
├── transformer_layer.py # Layer wrappers and CPU-GPU transfers
├── sliding_checkpoint.py # Activation offload and prefetch
├── optimizer/ # Layer-wise CPU Adam optimizer
├── utils/ # Datasets, metrics, and monitor helpers
├── scripts/ # Train, bench, and profile entries
├── bench/ # Baselines and reproducibility artifacts
└── environment.yml # Conda environment
If you use SlideFormer, its code, or its design ideas in your research, please cite our DAC 2026 paper.
- Paper: An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
- DAC 2026 presentation: 63rd ACM/IEEE Design Automation Conference
- DOI: 10.1145/3770743.3804125
- ISBN:
979-8-4007-2254-7
@misc{yang2026efficientheterogeneouscodesignfinetuning,
title={An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU},
author={Ruijia Yang and Zeyi Wen},
year={2026},
eprint={2603.16428},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2603.16428},
}This project is released under the Apache-2.0 License. See NOTICE for attribution and bundled third-party software information.




