El Barto Serve

demo.mp4

"I didn't write it. Nobody saw me write it. You can't prove anything."

— Bart Simpson, on how diffusion models generate code

Watch with captions on the landing page

OpenAI-compatible API server for Stable-DiffCoder — ByteDance's mask-diffusion code LLM that spray-paints code through iterative denoising instead of boring left-to-right token generation.

Built for the NVIDIA DGX Spark (Grace Blackwell GB10, 128GB unified memory), but runs on any CUDA GPU.

Why This Exists

Stable-DiffCoder-8B-Instruct tops the benchmarks for 8B code models — beating Qwen2.5-Coder, CodeLlama, and every other diffusion LLM. But it uses a non-standard diffusion inference pipeline that no existing serving framework supports (not vLLM, not Ollama, not TensorRT-LLM).

El Barto wraps the custom diffusion generation in a standard /v1/chat/completions endpoint so you can use it with:

Continue.dev in VS Code
Open WebUI
curl, Python scripts, or any OpenAI-compatible client

[Your Mac / VS Code]              [DGX Spark]
       |                                |
  Continue.dev  ----HTTP:8000---->  El Barto Serve
  Open WebUI                        (Stable-DiffCoder)
  curl

Quick Start

Option A: Native Install (DGX Spark)

git clone https://github.com/NathanMaine/el-barto-serve.git
cd el-barto-serve

# Automated setup (creates venv, installs CUDA 13.0 PyTorch, deps)
./setup-spark.sh

# Activate and run
source .venv/bin/activate
python server.py

Option B: Docker (NGC Container)

docker build -t el-barto-serve .
docker run -it --gpus all \
  -p 8000:8000 \
  -e ELBARTO_MODEL_PATH=/models/Stable-DiffCoder-8B-Instruct \
  -v /path/to/your/model:/models/Stable-DiffCoder-8B-Instruct \
  el-barto-serve

Option C: Other CUDA GPUs

git clone https://github.com/NathanMaine/el-barto-serve.git
cd el-barto-serve
python -m venv .venv && source .venv/bin/activate
pip install torch  # Standard PyTorch for your GPU
pip install -r requirements.txt
python server.py

Usage

Test with curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stable-diffcoder",
    "messages": [{"role": "user", "content": "Write a binary search in Python"}],
    "temperature": 0.0
  }'

Connect from VS Code (Continue.dev)

Install the Continue extension
Edit ~/.continue/config.yaml
Add El Barto as a model:

models:
  - name: El Barto (Stable-DiffCoder)
    provider: openai
    model: stable-diffcoder
    apiBase: http://YOUR_SPARK_IP:8000/v1
    apiKey: not-needed
    requestOptions:
      extraBodyProperties:
        max_tokens: 384

Tip: Cap max_tokens at ~384. Diffusion models fill their entire token budget — without a cap, you'll get gibberish noise after the actual code ends.

See examples/continue-config.yaml for the full config.

Configuration

All settings via environment variables (or .env file — copy from .env.example):

Variable	Default	Description
`ELBARTO_MODEL_PATH`	`ByteDance-Seed/Stable-DiffCoder-8B-Instruct`	Local path or HuggingFace model ID
`ELBARTO_MODEL_REVISION`	`None`	Pin to a specific model revision (commit hash) for reproducibility
`ELBARTO_HOST`	`0.0.0.0`	Bind address
`ELBARTO_PORT`	`8000`	Server port
`ELBARTO_API_KEY`	(empty)	API key for auth; clients send as `Authorization: Bearer <key>`
`ELBARTO_STEPS`	`256`	Diffusion denoising steps (more = higher quality, slower)
`ELBARTO_GEN_LENGTH`	`512`	Max output tokens
`ELBARTO_BLOCK_LENGTH`	`4`	Block diffusion granularity
`ELBARTO_THRESHOLD`	`None`	Early stopping confidence (0.0-1.0); lower = faster
`ELBARTO_REMASKING`	`low_confidence`	Remasking strategy (`low_confidence` or `random`)

Tuning for Speed vs Quality

# Maximum quality (slow — 512 steps, no early stopping)
ELBARTO_STEPS=512 ELBARTO_THRESHOLD= python server.py

# Balanced (default)
ELBARTO_STEPS=256 python server.py

# Fast mode (fewer steps + early stopping)
ELBARTO_STEPS=128 ELBARTO_THRESHOLD=0.5 python server.py

# Fastest (aggressive early stopping — "eat my shorts" mode)
ELBARTO_STEPS=64 ELBARTO_THRESHOLD=0.3 python server.py

Running in Production

Background Mode

nohup python server.py > /tmp/elbarto.log 2>&1 &
tail -f /tmp/elbarto.log  # Watch logs

Download Model Locally

The default ELBARTO_MODEL_PATH auto-downloads from HuggingFace on first run. To pre-download or use a local copy:

# Option A: Pre-download from HuggingFace
pip install huggingface-hub
huggingface-cli download ByteDance-Seed/Stable-DiffCoder-8B-Instruct \
  --local-dir ~/models/Stable-DiffCoder-8B-Instruct

# Option B: Copy from network/NAS storage
rsync -ah --progress /path/to/Stable-DiffCoder-8B-Instruct/ \
  ~/models/Stable-DiffCoder-8B-Instruct/

Then set the path in .env:

ELBARTO_MODEL_PATH=/home/YOUR_USER/models/Stable-DiffCoder-8B-Instruct

Tip: On DGX Spark, always copy models to local NVMe. NFS is ~120 MB/s for sequential reads but much slower for the random access patterns during inference.

DGX Spark Notes

For a full step-by-step setup guide (SSH, NAS mounts, LM Studio, Continue.dev), see docs/dgx-spark-setup-guide.md.

Things we learned so the Spark doesn't have a cow:

Flash Attention is broken on SM 12.1 — El Barto auto-patches to use PyTorch's native SDPA, which is actually ~2% faster on Blackwell with cuDNN 9.13+. No action needed.
PyTorch CUDA capability warning — PyTorch may warn that the GB10 (SM 12.1) exceeds its officially supported range (8.0–12.0). This is harmless — the CUDA 13.0 wheels work correctly. The server suppresses this warning automatically.
CUDA 13.0 required — The setup script handles this. Don't use standard PyTorch pip wheels.
Python 3.12.x recommended — 3.13.x has known issues on Spark.
Unified memory is your friend — The 15GB model leaves ~113GB free. No CPU-to-GPU transfer overhead.
Static memory footprint — Unlike autoregressive models with growing KV caches, diffusion operates on fixed-size tensors. No OOM surprises mid-generation.
Performance bug workaround — If throughput suddenly drops 50% with GPU stuck at ~14W, do a full AC power cycle (unplug from wall for 60 seconds). This is a known firmware issue.

How Diffusion Code Generation Works

Traditional LLMs generate code left-to-right, one token at a time. Stable-DiffCoder works differently:

Mask — Start with the full output length filled with [MASK] tokens
Denoise — Iteratively predict and unmask the most confident tokens
Refine — Each step reveals more of the code, like graffiti appearing on a wall

This "any-order" generation means the model can consider the full structure simultaneously, making it naturally better at maintaining syntax, matching brackets, and reasoning about code structure.

Step 0:   [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] ...
Step 32:  def    [MASK] sort   [MASK] [MASK] :      ...
Step 64:  def    quick  sort   (      arr    :      ...
Step 128: def    quick  sort   (      arr    :    list) -> list: ...

API Reference

POST `/v1/chat/completions`

Standard OpenAI chat completions format. Supports both streaming and non-streaming.

Extra fields for diffusion control (pass via request body):

{
  "steps": 256,
  "gen_length": 512,
  "block_length": 4,
  "threshold": null,
  "remasking": "low_confidence"
}

GET `/v1/models`

List available models.

GET `/health`

Health check with model status and device info.

Benchmarks

Stable-DiffCoder-8B-Instruct vs other ~8B models:

Model	HumanEval	MBPP	MHPP	BigCodeBench
Qwen2.5-Coder-7B-Instruct	88.4	83.5	26.7	48.8
Seed-Coder-8B-Instruct	84.8	85.2	36.2	53.3
Stable-DiffCoder-8B-Instruct	86.6	85.7	42.4	54.8

License

MIT

El Barto was here.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
examples		examples
patches		patches
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
demo.mp4		demo.mp4
post-captions.srt		post-captions.srt
post-captions.vtt		post-captions.vtt
requirements.txt		requirements.txt
server.py		server.py
setup-spark.sh		setup-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

El Barto Serve

Why This Exists

Quick Start

Option A: Native Install (DGX Spark)

Option B: Docker (NGC Container)

Option C: Other CUDA GPUs

Usage

Test with curl

Connect from VS Code (Continue.dev)

Configuration

Tuning for Speed vs Quality

Running in Production

Background Mode

Download Model Locally

DGX Spark Notes

How Diffusion Code Generation Works

API Reference

POST `/v1/chat/completions`

GET `/v1/models`

GET `/health`

Benchmarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

El Barto Serve

Why This Exists

Quick Start

Option A: Native Install (DGX Spark)

Option B: Docker (NGC Container)

Option C: Other CUDA GPUs

Usage

Test with curl

Connect from VS Code (Continue.dev)

Configuration

Tuning for Speed vs Quality

Running in Production

Background Mode

Download Model Locally

DGX Spark Notes

How Diffusion Code Generation Works

API Reference

POST /v1/chat/completions

GET /v1/models

GET /health

Benchmarks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/v1/chat/completions`

GET `/v1/models`

GET `/health`

Packages