Skip to content

RegiaYoung/SlideFormer

Repository files navigation

SlideFormer

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Paper DOI DAC 2026 Presentation GitHub stars License

Installation | Quick Start | Benchmarking | Supported Models | Citation

SlideFormer is a PyTorch-based heterogeneous runtime for full-parameter fine-tuning on a single GPU. It co-designs layer-sliding execution, heterogeneous memory management, and asynchronous transfer/compute pipelines across GPU memory, CPU RAM, and optional NVMe storage. Only a compact active layer window is materialized on the GPU, while persistent training states are kept in CPU memory and cross-tier data movement is carefully scheduled throughout execution.

In our evaluation, SlideFormer achieves 1.40×–6.27× higher throughput than related offloading baselines, reduces peak GPU memory by up to 50% and peak CPU memory by up to 40%, supports up to 8× larger batch sizes, and enables full-parameter fine-tuning of 100B+ models on a single commodity GPU.

If this repository is useful to your work, please consider starring it.

News

  • 2026.06: We released SlideFormer with expanded model compatibility, reproducibility scripts, and a new fine-grained chunked pipeline. A multi-GPU extension of SlideFormer's heterogeneous runtime is under active development and will be documented soon.
  • 2026.03: SlideFormer was accepted to DAC 2026. The conference presentation page is available here, and the paper is available on arXiv.
  • Early 2025: The initial SlideFormer system was developed and evaluated; subsequent development has focused on system-level optimization.

Highlights

  • Layer-sliding execution: keeps only a compact active layer window on the GPU, allowing full-parameter fine-tuning of models that exceed GPU memory capacity.

  • Lightweight asynchronous engine: overlaps GPU computation with CPU optimizer updates, FP32-to-BF16 conversion and H2D parameter transfers, D2H gradient movement, activation offload/prefetch, and optional NVMe I/O.

  • Heterogeneous memory hierarchy: coordinates GPU memory, CPU RAM, and optional local NVMe storage to reduce memory pressure while preserving full-parameter mixed-precision training semantics.

  • Memory-efficient implementation: uses pre-allocated GPU cache units, layer-shared host buffers, and a layer-wise CPU LayerAdam path to reduce allocation overhead and peak GPU/CPU memory usage.

Installation

SlideFormer is designed for single-GPU machines with sufficient CPU memory. The exact memory requirement depends on the model size, sequence length, batch size, optimizer state, and offload configuration.

From our evaluation sweep, a practical estimate is about 12 GB CPU RAM per 1B parameters + batch-dependent activation/buffer overhead. NVMe optimizer/activation offload can further reduce the CPU-memory requirement at the cost of lower throughput.

git clone https://github.com/RegiaYoung/SlideFormer.git
cd SlideFormer
conda env create -f environment.yml
conda activate slideformer

Core dependencies are torch, transformers, and datasets. Since the core runtime is mostly torch-native, SlideFormer can also run on AMD/ROCm GPUs with a compatible PyTorch/ROCm stack. Optional dependencies for acceleration/offload are flash-attn, liger-kernel, and tensornvme; the code falls back when they are not available. We recommend installing all optional dependencies to match the reported performance.

Quick Start

Run a dummy-data end-to-end sanity check:

python scripts/main_dummy.py \
  --model_path /path/to/Llama-3.1-8B-Instruct \
  --seq_len 1024 \
  --batch_size 64

On multi-NUMA machines, binding the process to the GPU's NUMA node can improve CPU-GPU bandwidth:

numactl --cpunodebind=0 --membind=0 python scripts/main_dummy.py

Useful options:

Option Description
--model_path Local path or HuggingFace model ID
--seq_len Sequence length
--batch_size Per-step batch size
--epochs Number of training epochs
--attn_implementation flash_attention_2 or sdpa
--ac_offload_nvme Offload saved activations to NVMe
--nvme_offload_fraction Optimizer-state NVMe offload fraction
--offload_dir Directory used when NVMe offload is enabled

Real Data Example

scripts/main_real.py provides a real fine-tuning example on QizhiPei/MathFusionQA.

python scripts/main_real.py \
  --model_path /path/to/Llama-3.1-8B-Instruct \
  --seq_len 4096 \
  --batch_size 16 \
  --epochs 1 \
  --output_dir ./mathfusion-ft-results

The script saves the trained model, tokenizer, loss curve, and loss CSV to --output_dir.

Correctness Validation

SlideFormer tracks a DeepSpeed ZeRO-3 CPU-offload reference on both 40 distinct real-data batches and a repeated-batch overfit probe.

Correctness loss curves

Benchmarking

Single benchmark run:

python scripts/main_bench.py \
  --model_path /path/to/Llama-3.1-8B-Instruct \
  --seq_len 1024 \
  --batch_size 64 \
  --warm_step 3 \
  --test_step 10 \
  --result_file ./outputs/bench.csv

Sweep over configured models and batch sizes:

bash scripts/bench.sh

For reproducibility, baseline implementations and adapted configs are collected under bench/, including DeepSpeed, ColossalAI, MegaTrain, and LoHan. The supplemental figures below add a MegaTrain comparison alongside the paper baselines.

Llama-3.1-8B baseline
(a) Batch-size scaling and CPU memory for Llama-3.1-8B on RTX 4090.
Model scaling
(b) Model-size scaling and CPU memory for Qwen3 models on RTX 4090.
GPU memory vs batch size
(c) GPU allocated memory vs. batch size for Llama-3.1-8B.

Sequence-length scaling and maximum trainable context
(d) Long-context training on RTX 4090. Left: sequence-length scaling for Llama-3.1-8B at batch size 1. Right: maximum trainable sequence length for Qwen3 models at batch size 1; labels above bars report TFLOPS at the corresponding maximum.

With the flash attention used here, activation-related memory scales approximately with batch_size * seq_len. For example, SlideFormer shows nearly identical CPU and allocated GPU memory at bs=64, seq=1K and bs=1, seq=64K. Compute scales differently: self-attention costs O(B * S^2), so increasing sequence length adds substantially more attention work even at a fixed token budget and makes training more compute-intensive. Long-context TFLOPS can consequently exceed the 1K no-offload peak, which is a workload-specific reference rather than a hardware peak.

Note: For fairness, all evaluated systems use the same workload and enable the same fused kernels unless a framework already provides an equivalent implementation. SlideFormer and all baselines except MegaTrain follow mixed-precision training semantics with BF16 compute, FP32 master parameters, and FP32 optimizer states. In the official MegaTrain single-GPU path, CPU-resident model parameters are loaded and updated in BF16. Its CPU-memory footprints are therefore shown for reference only.

Technical Updates

Fine-grained Chunked Overlap

The 2026.06 release adds a chunked asynchronous transfer/update pipeline that splits long FP32-to-BF16 conversion, H2D parameter movement, D2H gradient return, and CPU Adam update into smaller overlapped segments. This reduces exposed transfer/update time under pipeline contention and improves Qwen3-8B single-GPU throughput, especially at small batch sizes.

Chunked pipeline efficiency

Supported Models

Some model families may require light adapter changes for full compatibility. We will continue expanding tested model support.

Repository Layout

SlideFormer/
├── offload_transformer.py      # Runtime engine and scheduling
├── transformer_layer.py        # Layer wrappers and CPU-GPU transfers
├── sliding_checkpoint.py       # Activation offload and prefetch
├── optimizer/                  # Layer-wise CPU Adam optimizer
├── utils/                      # Datasets, metrics, and monitor helpers
├── scripts/                    # Train, bench, and profile entries
├── bench/                      # Baselines and reproducibility artifacts
└── environment.yml             # Conda environment

Citation

If you use SlideFormer, its code, or its design ideas in your research, please cite our DAC 2026 paper.

@misc{yang2026efficientheterogeneouscodesignfinetuning,
      title={An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU},
      author={Ruijia Yang and Zeyi Wen},
      year={2026},
      eprint={2603.16428},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2603.16428},
}

License

This project is released under the Apache-2.0 License. See NOTICE for attribution and bundled third-party software information.

About

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU (DAC ‘26)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors