skyllm

Cheap, on-demand cloud GPU running an OpenAI-compatible vLLM or llama.cpp endpoint, reachable from any tool via a stable public URL.

One skyllm up spins up a 24 GB+ NVIDIA GPU on RunPod, starts the engine selected by the model you launched (vLLM for safetensors/AWQ/GPTQ, llama.cpp for GGUF), and exposes it through a Cloudflare Tunnel at a hostname you control. Clients point at https://llm.yourdomain.com/v1 forever — the actual GPU comes and goes, the URL stays.

Why

Run bigger models than your local GPU can handle, without paying for a 24/7 cloud instance. Designed for occasional home use: spin up, poke at a model for an hour, tear down, pay cents.

Single-user design. This project is optimized for one person using one model at a time — especially the llama.cpp variants, which are tuned for single-thread inference. You can configure it for concurrent users, but that's not currently the goal. If you need multi-user serving with request queuing, consider a managed API (Together, Fireworks, Modal).

Stack

Piece	What it does
SkyPilot	Provisions the GPU on RunPod, handles autostop/teardown
vLLM or llama.cpp	Serves the model with an OpenAI-compatible API. Engine is selected per catalog entry — vLLM for safetensors/AWQ/GPTQ, llama.cpp for GGUF (incl. CPU-offloaded MoE). vLLM runs `vllm/vllm-openai:latest` directly; llama.cpp installs via pixi on the pod. No custom Docker image to maintain either way.
Cloudflare Tunnel	Gives you a stable public URL without opening ports

Safeguards against surprise bills

Belt, suspenders, and a third belt:

Idle auto-shutdown. scripts/idle-watch.sh watches vLLM's Prometheus metrics (vllm:generation_tokens_total); when no tokens have been generated for $IDLE_MINUTES (default 15), it exits the SkyPilot run block. Combined with sky launch --down, this terminates the cluster.
Wall-clock cap. sudo shutdown -h +$MAX_RUNTIME_MINUTES runs at launch (4 h default on sky.yaml, 1 h on the 80 GB preset since hourly rates are several × higher). Even if the idle-watcher wedges, the box powers off.
SkyPilot autostop. --idle-minutes-to-autostop 30 --down tells SkyPilot itself to terminate the cluster if the whole job finishes and nothing takes its place.
Monthly budget check. scripts/budget-check.sh is cron-able on your laptop; it reads sky cost-report and runs sky down if you've spent over $MONTHLY_BUDGET_USD this month.
Provider-side spend limit (the real backstop). Set a hard monthly limit at https://www.runpod.io/console/user/billing. The other safeguards protect against mistakes; this one protects against bugs in the other safeguards.

Setup

This section walks you through everything from zero to a working endpoint. If you're already familiar with Cloudflare and RunPod, you can skim.

Prerequisites

Requirement	Why you need it	How to get it
Cloudflare account	Provides a stable public URL via Tunnel (no port forwarding needed)	Free at https://dash.cloudflare.com/sign-up
A domain on Cloudflare	The tunnel routes `llm.yourdomain.com` to your pod	~$10/yr domain registration, or use a free subdomain on a domain you already manage. The domain must be managed by Cloudflare (DNS settings → nameservers point to CF).
RunPod account	Spins up the GPU on demand	Sign up at https://www.runpod.io/ and add a payment method
pixi	Manages the local CLI environment (single static binary, no Python install needed)	https://pixi.sh/latest/ — one-liner install on Linux/macOS
SkyPilot CLI	Provisions the GPU on RunPod	Installed automatically by `pixi install` (step 5) — listed as a dependency in `pyproject.toml`
Docker (optional)	Only needed if SkyPilot asks for it — most setups work without it

Step 1 — Install pixi

# Install pixi (Linux/macOS one-liner):
curl -fsSL https://pixi.sh/install.sh | sh
# Then restart your shell or run: source ~/.bashrc (or ~/.zshrc)

SkyPilot (with RunPod support) is declared in pyproject.toml, so pixi install in step 5 will pull it into the local env automatically — no separate pip install needed.

Step 2 — Configure RunPod

Go to https://www.runpod.io/console/user/settings → API Keys → Create New Key.
Copy the key — you'll paste it into .env (step 4).
Set a monthly spend limit (non-optional — protects you from surprise bills): Go to https://www.runpod.io/console/user/billing and cap monthly spend at whatever you're willing to lose. $20/mo is plenty for occasional home use.

Step 3 — Create a Cloudflare Tunnel

This gives you a stable public URL (llm.yourdomain.com) that always points to your pod, even though the pod itself comes and goes.

Go to https://one.dash.cloudflare.com/ → Networks → Tunnels → Create a tunnel.
Choose connector type: Cloudflared.
Name it something like llm-gpu.
Under Public Hostname, add a route:
- Subdomain: llm (or whatever you like — this becomes llm.yourdomain.com)
- Domain: your Cloudflare-managed domain
- Service type: HTTP
- URL: localhost:8080
Click Save tunnel.
Go to the Tunnels page, click your tunnel name, then Public Hostname → Edit → scroll to Token.
Copy the token (a long base64 string) — you'll paste it into .env (step 4).

💡 Cloudflare auto-creates the DNS record for you. The hostname is now permanently pointed at whichever machine runs cloudflared with that token. You don't need to do anything with DNS manually.

Step 4 — Fill in `.env`

cp .env.example .env

Edit .env and set these four values:

Variable	Where to get it	Example
`LLM_HOSTNAME`	Your chosen hostname	`llm.yourdomain.com`
`CF_TUNNEL_TOKEN`	Cloudflare Tunnel page (step 3, item 7)	`abc123+longbase64string==`
`LLM_API_KEY`	Generate with `openssl rand -hex 32`	`a1b2c3d4...` (64 hex chars)
`RUNPOD_API_KEY`	RunPod settings → API Keys (step 2, item 2)	`pod-abc123...`

Generate a strong API key:

openssl rand -hex 32

⚠️ LLM_API_KEY is the only thing gating your endpoint from the public internet. The Cloudflare Tunnel routes traffic to your pod but does not authenticate clients — anyone who resolves the hostname can probe it. A strong random key (the openssl command produces 256 bits of entropy) is what keeps scanners out. Do not use a short or memorable string. If you want edge-level auth (Cloudflare Access, etc.), see docs/roadmap/edge-auth.md.

Optional: if you plan to use gated HuggingFace models (Llama, Gemma, Mistral-Instruct, etc.), add your HF token:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Step 5 — Create the local environment and launch

# Create and install the local CLI environment
pixi install

# Drop into the environment (optional — you can also prefix every command with `pixi run`)
pixi shell

# Launch the default model (qwen-0.5b, vLLM, 24 GB tier) — fast stack-test
skyllm up

# Or pick any model from the catalog:
skyllm list          # see all available models
skyllm up qwen3.6-27b   # 27B dense VLM on 24 GB, ~40 tok/s

First launch takes ~5 minutes (provisioning + image pull + model download). The vllm/vllm-openai image is ~10 GB — the first pull is slow, but it's cached by RunPod thereafter.

Step 6 — Use it

Because the endpoint speaks the OpenAI API format, it plugs into virtually any consumer tool that accepts an OpenAI-compatible base URL. Just point the tool at https://llm.yourdomain.com/v1 and supply your LLM_API_KEY.

Popular options:

Tool	What it is	How to connect
Open WebUI	Full-featured browser chat UI (Ollama-compatible)	Add a new OpenAI-compatible provider with your hostname + API key
Cherry Studio	Desktop chat client with multi-provider support	Add OpenAI provider, set base URL and key
AnythingLLM	RAG chat with document upload	Add OpenAI endpoint in settings
FastChat	Web UI for chatting with LLMs	Set `--server-base-url` to your hostname
Any OpenAI SDK client	Your own scripts, bots, automations	`base_url="https://llm.yourdomain.com/v1"`, `api_key="..."`

The key is always the same two values:

Base URL: https://llm.yourdomain.com/v1
API key: the LLM_API_KEY you set in .env

Quick curl test

From anywhere on the internet:

curl https://llm.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer $LLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "any", "messages": [{"role": "user", "content": "hi"}]}'

Usage with the OpenAI Python SDK

from openai import OpenAI
client = OpenAI(base_url="https://llm.yourdomain.com/v1", api_key="<your LLM_API_KEY>")
response = client.chat.completions.create(model="any", messages=[{"role": "user", "content": "hi"}])
print(response.choices[0].message.content)

Step 7 — Tear down when done

skyllm down

If you forget, the safeguards (idle auto-shutdown, wall-clock cap, budget check) will eventually shut it down. But skyllm down is instant and saves you pennies per minute.

That's it! You now have a stable public URL that spins up a GPU on demand. Read on for daily-use commands, bigger models, and cost-saving tips.

Daily use

All commands below are pixi run skyllm <cmd> (drop the pixi run prefix inside pixi shell). cli is the default pixi env at the repo root — no -e <name> ever needed.

Command	What it does
`skyllm --help`	List all commands
`skyllm list`	List available models (name / engine / tier / HF repo)
`skyllm up [<model>]`	Launch GPU + start serving. Default model: `qwen-0.5b`. `--dry-run` prints the resolved `sky launch` command
`skyllm down`	Terminate cluster
`skyllm status`	Is it running?
`skyllm logs`	Tail engine + cloudflared logs
`skyllm health`	Hit the public URL and confirm it responds
`skyllm cost`	SkyPilot's running cost report
`skyllm budget`	Run the budget guard once (also cron-able)

Each model lives in its own directory under models/<name>/model.yaml — that's the "catalog". Run skyllm list to see what's available, or add a new model by dropping in another directory with a model.yaml conforming to skyllm/schema.py (auto-discovered, no registration step). Model identity is not set in .env.

Bigger models

The default qwen-0.5b model is a 0.5B toy — fine for testing the pipeline, useless for real work. To launch something bigger, either pick an existing model (skyllm list):

pixi run skyllm up qwen3.6-27b   # 27B dense VLM on 24 GB, ~40 tok/s

…or add a new model by dropping a models/<name>/model.yaml (pixi run validate checks the schema). Gated HF models (Llama, Gemma, Mistral-Instruct, etc.) need HF_TOKEN=... in .env. skyllm down && skyllm up <name> to apply.

If the model is > a few GB, the re-download on every launch gets annoying. Add an HF cache bucket:

Pick any S3/GCS-compatible bucket you control (Cloudflare R2 is cheap and you already have a CF account).

Add to whichever preset YAML the catalog entry resolves to (one of the files in sky/):

file_mounts:
  ~/.cache/huggingface:
    name: <your-bucket-name>
    store: r2   # or s3, gcs
    mode: MOUNT

First launch caches the download into the bucket; subsequent launches mount the bucket and skip the download.

Engine presets

Four sibling YAMLs cover the (engine, tier) matrix. skyllm up <model> picks the right one from each catalog entry's engine + tier fields:

YAML	Engine	GPU tier	Used when the catalog entry has…
`sky.yaml`	vLLM	24 GB	`engine: vllm`, `tier: 24gb` (default for stack-test)
`sky-llamacpp.yaml`	llama.cpp	24 GB	`engine: llamacpp`, `tier: 24gb`
`sky-llamacpp-cpumoe.yaml`	llama.cpp	24 GB + CPU-offloaded MoE	`engine: llamacpp`, `tier: 24gb-cpumoe`
`sky-llamacpp-80gb.yaml`	llama.cpp	80 GB pure-GPU	`engine: llamacpp`, `tier: 80gb`

See docs/alternatives.md for why we don't pin a custom Docker image on RunPod.

Scaling up to bigger GPUs

The 24 GB tier (RTX 3090/4090/A5000/A6000/L40S) is fine for models up to ~14B at Q4 or ~7B at Q8. For bigger MoE models, two paths are wired up — pick based on cost vs. speed:

tier: 24gb-cpumoe — cheap 24 GB card + ~96 GB system RAM, expert weights offloaded to CPU. Order-of-magnitude slower than pure-GPU but 3–5× cheaper per hour and far better availability. Good for correctness smoke tests.
tier: 80gb — A100-80GB or H100, everything in VRAM. Fast (~100 tok/s gen on Qwen3-Coder-Next MXFP4) but several × more expensive and availability-constrained.

pixi run skyllm up qwen3.6-27b        # 24 GB dense (fits on 3090/4090)

The 80 GB preset ships with a shorter MAX_RUNTIME_MINUTES default (60 vs 240) because hourly costs are several × higher — an overnight wedge on H100 is a $200+ mistake. Everything else (tunnel, auth, idle-watch, budget-check) is identical.

Rough fit table. All prices are for RunPod Secure Cloud (SkyPilot's RunPod catalog is Secure-Cloud-only by design, so there's no "random host with root" in the data path — just RunPod itself):

Tier	GPU options	Models that fit	~$/hr
`24gb`	3090/4090/A5000/A6000/L40S	≤8B FP16, ≤13B FP8/AWQ/GPTQ (vLLM); small GGUFs (llama.cpp)	0.50–1.20
`24gb-cpumoe`	same, + 96 GB RAM floor	Big MoE GGUFs (e.g. 80B/3B-active at MXFP4) with experts in CPU RAM	0.80–1.20
`80gb`	A100-80GB / H100	Large GGUFs up to ~50 GB pure-GPU	1.40–4.50

Multi-node (8+ GPUs across boxes) is out of scope — rarely needed since even 405B models fit on a single 4× or 8× H100 box.

Multi-provider (unlock if you want)

v1 targets RunPod because it's simplest. To have SkyPilot pick the cheapest GPU across providers:

Run sky check for each provider you want (aws, gcp, lambda, vast, etc.) — fill in creds as prompted.

Edit the preset YAML your catalog entry resolves to (e.g. sky.yaml):

resources:
  # remove: cloud: runpod
  accelerators: {RTX4090:1, RTX3090:1, L4:1, A10:1, A10G:1, L40S:1}

SkyPilot will try providers in cheapest-first order.

Migrating to FRP (v2)

The Cloudflare Tunnel in v1 terminates TLS at Cloudflare's edge — CF has the plaintext of every request. For an LLM API where the prompts are the sensitive content, that may not be what you want long-term.

The migration path is intentionally small:

Stand up a $5/mo VPS (Hetzner, Vultr, Oracle Free Tier) with a public IP.
Install frps on the VPS and caddy in front of it. Use caddy/Caddyfile.placeholder as a starting point.
In sky.yaml, swap the cloudflared docker block for frpc pointing at your VPS.
In Cloudflare DNS, change llm.yourdomain.com from the tunnel CNAME to an A-record pointing at your VPS's IP.
Clients change nothing. Same URL, same API key, same everything.

This is the reason we used a stable hostname from day one.

Privacy note

Even with FRP, your VPS provider can see plaintext traffic unless you also arrange end-to-end TLS (e.g. by having frpc speak HTTPS to a self-signed cert on the origin and letting Caddy act as a pure TCP pass-through). For the threat model "I don't want Cloudflare Inc. reading my prompts" the FRP swap is sufficient. For the threat model "I don't want my VPS provider reading my prompts either," pick a VPS provider you trust and/or do E2E.

Tailscale + WireGuard is the only configuration in this repo's design space that's end-to-end encrypted by architecture, but it requires every client device to run the Tailscale daemon — which is why it wasn't picked here.

Layout

Two pixi workspaces, kept deliberately separate:

Root (pixi.toml + pixi.lock) — the cli env, used locally. No CUDA. This is what pixi install / pixi run skyllm / pixi run validate use.
pod/pixi.toml + pod/pixi.lock — the vllm + llamacpp envs that run on RunPod. Nothing else from this repo is ever uploaded to the pod; each sky YAML's file_mounts: allowlist rsyncs only pod/pixi.toml, pod/pixi.lock, and scripts/idle-watch.sh. This prevents accidental secret leakage (stray files, .env, scratch work) from ever riding up with the workdir.

skyllm/
├── .env.example              # secrets + infra knobs (no model identity)
├── .gitignore
├── README.md                 # you are here
├── pyproject.toml            # skyllm package + `skyllm` entry point
├── pixi.toml / pixi.lock     # LOCAL — cli env (default)
├── pod/
│   ├── pixi.toml             # POD — vllm + llamacpp envs
│   └── pixi.lock
├── sky/                          # SkyPilot preset YAMLs (one per (engine, tier))
│   ├── sky.yaml                  # vLLM, 24 GB tier
│   ├── sky-llamacpp.yaml         # llama.cpp, 24 GB tier (small GGUFs)
│   ├── sky-llamacpp-cpumoe.yaml  # llama.cpp, 24 GB + CPU-offloaded MoE experts
│   └── sky-llamacpp-80gb.yaml    # llama.cpp, 80 GB pure-GPU (A100-80GB / H100)
├── skyllm/                       # CLI + catalog schema
│   ├── cli.py                    # list / up / down / status / logs / health / cost / budget
│   ├── schema.py                 # pydantic ModelSpec
│   └── validate.py               # `pixi run validate`
├── models/                       # model catalog — one dir per entry
│   ├── qwen-0.5b/model.yaml                 # vLLM, 24gb (default stack-test)
│   ├── qwen3.6-27b/model.yaml               # llama.cpp, 24gb (dense 27B Q4_K_M)
│   ├── qwen3-coder-next/model.yaml          # llama.cpp, 24gb-cpumoe route
│   └── qwen3-coder-next-80gb/model.yaml     # llama.cpp, 80gb pure-GPU route
├── docs/
│   ├── alternatives.md       # why not SkyServe / dstack
│   ├── landscape.md          # commercial / open-source competitors
│   ├── pixi.md               # pixi env shape + RunPod lessons
│   ├── roadmap/              # phased plan (pixi → catalog → CLI → multi-provider)
│   └── toc.md                # repo tour
├── scripts/
│   ├── idle-watch.sh         # exits the run block when the engine is idle
│   └── budget-check.sh       # cron-able spend guard
└── caddy/
    └── Caddyfile.placeholder # v2 FRP migration stub

Alternatives considered

Before writing this scaffold I evaluated SkyPilot SkyServe, dstack, and an existing reference implementation (Borjagodoy/gpt-oss-runpod-on-demand). None fit cleanly — write-up at docs/alternatives.md. TL;DR: SkyServe has a $6/mo controller floor and cold-start 503s; dstack doesn't support RunPod and needs ~$11–20/mo of always-on infra. Revisit if dstack adds RunPod, or if you start needing real concurrent-user burst handling (SkyServe becomes attractive then).

For the commercial landscape — Ollama Cloud, HuggingFace Inference Endpoints, Modal, Baseten, Together / Fireworks / Groq, etc. — see docs/landscape.md. Short version: category 1 (managed APIs) genuinely wins for low-volume hobbyist use; category 2 (HF Endpoints, Modal) is the closest peer and wins for ops polish; this repo wins when you care about reproducibility, region transparency, and not being locked into a vendor's control plane.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skyllm

Why

Stack

Safeguards against surprise bills

Setup

Prerequisites

Step 1 — Install pixi

Step 2 — Configure RunPod

Step 3 — Create a Cloudflare Tunnel

Step 4 — Fill in `.env`

Step 5 — Create the local environment and launch

Step 6 — Use it

Quick curl test

Usage with the OpenAI Python SDK

Step 7 — Tear down when done

Daily use

Bigger models

Engine presets

Scaling up to bigger GPUs

Multi-provider (unlock if you want)

Migrating to FRP (v2)

Privacy note

Layout

Alternatives considered

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
caddy		caddy
docs		docs
models		models
pod		pod
scripts		scripts
sky		sky
skyllm		skyllm
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

skyllm

Why

Stack

Safeguards against surprise bills

Setup

Prerequisites

Step 1 — Install pixi

Step 2 — Configure RunPod

Step 3 — Create a Cloudflare Tunnel

Step 4 — Fill in .env

Step 5 — Create the local environment and launch

Step 6 — Use it

Quick curl test

Usage with the OpenAI Python SDK

Step 7 — Tear down when done

Daily use

Bigger models

Engine presets

Scaling up to bigger GPUs

Multi-provider (unlock if you want)

Migrating to FRP (v2)

Privacy note

Layout

Alternatives considered

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 4 — Fill in `.env`

Packages