Skip to content

NathanMaine/el-barto-serve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

El Barto Serve

demo.mp4

"I didn't write it. Nobody saw me write it. You can't prove anything."

β€” Bart Simpson, on how diffusion models generate code

Watch with captions on the landing page

OpenAI-compatible API server for Stable-DiffCoder β€” ByteDance's mask-diffusion code LLM that spray-paints code through iterative denoising instead of boring left-to-right token generation.

Built for the NVIDIA DGX Spark (Grace Blackwell GB10, 128GB unified memory), but runs on any CUDA GPU.

Why This Exists

Stable-DiffCoder-8B-Instruct tops the benchmarks for 8B code models β€” beating Qwen2.5-Coder, CodeLlama, and every other diffusion LLM. But it uses a non-standard diffusion inference pipeline that no existing serving framework supports (not vLLM, not Ollama, not TensorRT-LLM).

El Barto wraps the custom diffusion generation in a standard /v1/chat/completions endpoint so you can use it with:

[Your Mac / VS Code]              [DGX Spark]
       |                                |
  Continue.dev  ----HTTP:8000---->  El Barto Serve
  Open WebUI                        (Stable-DiffCoder)
  curl

Quick Start

Option A: Native Install (DGX Spark)

git clone https://github.com/NathanMaine/el-barto-serve.git
cd el-barto-serve

# Automated setup (creates venv, installs CUDA 13.0 PyTorch, deps)
./setup-spark.sh

# Activate and run
source .venv/bin/activate
python server.py

Option B: Docker (NGC Container)

docker build -t el-barto-serve .
docker run -it --gpus all \
  -p 8000:8000 \
  -e ELBARTO_MODEL_PATH=/models/Stable-DiffCoder-8B-Instruct \
  -v /path/to/your/model:/models/Stable-DiffCoder-8B-Instruct \
  el-barto-serve

Option C: Other CUDA GPUs

git clone https://github.com/NathanMaine/el-barto-serve.git
cd el-barto-serve
python -m venv .venv && source .venv/bin/activate
pip install torch  # Standard PyTorch for your GPU
pip install -r requirements.txt
python server.py

Usage

Test with curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stable-diffcoder",
    "messages": [{"role": "user", "content": "Write a binary search in Python"}],
    "temperature": 0.0
  }'

Connect from VS Code (Continue.dev)

  1. Install the Continue extension
  2. Edit ~/.continue/config.yaml
  3. Add El Barto as a model:
models:
  - name: El Barto (Stable-DiffCoder)
    provider: openai
    model: stable-diffcoder
    apiBase: http://YOUR_SPARK_IP:8000/v1
    apiKey: not-needed
    requestOptions:
      extraBodyProperties:
        max_tokens: 384

Tip: Cap max_tokens at ~384. Diffusion models fill their entire token budget β€” without a cap, you'll get gibberish noise after the actual code ends.

See examples/continue-config.yaml for the full config.

Configuration

All settings via environment variables (or .env file β€” copy from .env.example):

Variable Default Description
ELBARTO_MODEL_PATH ByteDance-Seed/Stable-DiffCoder-8B-Instruct Local path or HuggingFace model ID
ELBARTO_MODEL_REVISION None Pin to a specific model revision (commit hash) for reproducibility
ELBARTO_HOST 0.0.0.0 Bind address
ELBARTO_PORT 8000 Server port
ELBARTO_API_KEY (empty) API key for auth; clients send as Authorization: Bearer <key>
ELBARTO_STEPS 256 Diffusion denoising steps (more = higher quality, slower)
ELBARTO_GEN_LENGTH 512 Max output tokens
ELBARTO_BLOCK_LENGTH 4 Block diffusion granularity
ELBARTO_THRESHOLD None Early stopping confidence (0.0-1.0); lower = faster
ELBARTO_REMASKING low_confidence Remasking strategy (low_confidence or random)

Tuning for Speed vs Quality

# Maximum quality (slow β€” 512 steps, no early stopping)
ELBARTO_STEPS=512 ELBARTO_THRESHOLD= python server.py

# Balanced (default)
ELBARTO_STEPS=256 python server.py

# Fast mode (fewer steps + early stopping)
ELBARTO_STEPS=128 ELBARTO_THRESHOLD=0.5 python server.py

# Fastest (aggressive early stopping β€” "eat my shorts" mode)
ELBARTO_STEPS=64 ELBARTO_THRESHOLD=0.3 python server.py

Running in Production

Background Mode

nohup python server.py > /tmp/elbarto.log 2>&1 &
tail -f /tmp/elbarto.log  # Watch logs

Download Model Locally

The default ELBARTO_MODEL_PATH auto-downloads from HuggingFace on first run. To pre-download or use a local copy:

# Option A: Pre-download from HuggingFace
pip install huggingface-hub
huggingface-cli download ByteDance-Seed/Stable-DiffCoder-8B-Instruct \
  --local-dir ~/models/Stable-DiffCoder-8B-Instruct

# Option B: Copy from network/NAS storage
rsync -ah --progress /path/to/Stable-DiffCoder-8B-Instruct/ \
  ~/models/Stable-DiffCoder-8B-Instruct/

Then set the path in .env:

ELBARTO_MODEL_PATH=/home/YOUR_USER/models/Stable-DiffCoder-8B-Instruct

Tip: On DGX Spark, always copy models to local NVMe. NFS is ~120 MB/s for sequential reads but much slower for the random access patterns during inference.

DGX Spark Notes

For a full step-by-step setup guide (SSH, NAS mounts, LM Studio, Continue.dev), see docs/dgx-spark-setup-guide.md.

Things we learned so the Spark doesn't have a cow:

  • Flash Attention is broken on SM 12.1 β€” El Barto auto-patches to use PyTorch's native SDPA, which is actually ~2% faster on Blackwell with cuDNN 9.13+. No action needed.
  • PyTorch CUDA capability warning β€” PyTorch may warn that the GB10 (SM 12.1) exceeds its officially supported range (8.0–12.0). This is harmless β€” the CUDA 13.0 wheels work correctly. The server suppresses this warning automatically.
  • CUDA 13.0 required β€” The setup script handles this. Don't use standard PyTorch pip wheels.
  • Python 3.12.x recommended β€” 3.13.x has known issues on Spark.
  • Unified memory is your friend β€” The 15GB model leaves ~113GB free. No CPU-to-GPU transfer overhead.
  • Static memory footprint β€” Unlike autoregressive models with growing KV caches, diffusion operates on fixed-size tensors. No OOM surprises mid-generation.
  • Performance bug workaround β€” If throughput suddenly drops 50% with GPU stuck at ~14W, do a full AC power cycle (unplug from wall for 60 seconds). This is a known firmware issue.

How Diffusion Code Generation Works

Traditional LLMs generate code left-to-right, one token at a time. Stable-DiffCoder works differently:

  1. Mask β€” Start with the full output length filled with [MASK] tokens
  2. Denoise β€” Iteratively predict and unmask the most confident tokens
  3. Refine β€” Each step reveals more of the code, like graffiti appearing on a wall

This "any-order" generation means the model can consider the full structure simultaneously, making it naturally better at maintaining syntax, matching brackets, and reasoning about code structure.

Step 0:   [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] ...
Step 32:  def    [MASK] sort   [MASK] [MASK] :      ...
Step 64:  def    quick  sort   (      arr    :      ...
Step 128: def    quick  sort   (      arr    :    list) -> list: ...

API Reference

POST /v1/chat/completions

Standard OpenAI chat completions format. Supports both streaming and non-streaming.

Extra fields for diffusion control (pass via request body):

{
  "steps": 256,
  "gen_length": 512,
  "block_length": 4,
  "threshold": null,
  "remasking": "low_confidence"
}

GET /v1/models

List available models.

GET /health

Health check with model status and device info.

Benchmarks

Stable-DiffCoder-8B-Instruct vs other ~8B models:

Model HumanEval MBPP MHPP BigCodeBench
Qwen2.5-Coder-7B-Instruct 88.4 83.5 26.7 48.8
Seed-Coder-8B-Instruct 84.8 85.2 36.2 53.3
Stable-DiffCoder-8B-Instruct 86.6 85.7 42.4 54.8

License

MIT


El Barto was here.

About

πŸ›Ή El Barto was here. OpenAI-compatible API server for Stable-DiffCoder diffusion code models on NVIDIA DGX Spark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors