A deployment turns an optimized pipeline into a live HTTP endpoint. You get a URL, a scoped API key, autoscaling replicas, Instant Start cold-start caching, and OpenAI-compatible semantics.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Deployment requires the Pro plan ($49/mo) or higher. Starter lets you build, optimize, and test in the playground without deploying.
Request lifecycle
Every request hits the edge, authenticates against your API key, routes to a replica with headroom, and streams tokens back. Cold starts pull from Instant Start; warm requests skip it.Deploy
- From chat
- From the Deploy tab
Deployment modes
- Flex (Pro+)
- Active (Team+)
Scale-to-zero. Replicas shed after 5 idle minutes, spin back up in under 2 seconds on next request.
| Property | Value |
|---|---|
| Cost | Per-token only, nothing when idle |
| Cold start | Under 2 s on RunInfra Cloud |
| Idle timeout | 5 minutes of no traffic |
| Best for | Development, bursty traffic, cost-sensitive workloads |
Calling the endpoint
OpenAI-compatible. Drop in any OpenAI SDK, change two lines:Capacity
Each replica serves up to 30 concurrent requests. Replica budgets by plan:| Plan | Max replicas | Max concurrent requests |
|---|---|---|
| Pro | 8 | 240 |
| Team | 32 | 960 |
| Enterprise | Custom | Custom |
Retry-After. See Autoscaling to raise caps and tune concurrency.
Playground
Test before or after deploying from the Deploy tab. Send prompts, inspect token counts, compare variants, measure latency, without writing a line of code. After deployment, the playground targets the selected endpoint row, so the latency and output come from the endpoint you are inspecting.| Plan | Playground requests |
|---|---|
| Starter | 100/day |
| Pro+ | Unlimited |
The first playground request after an idle period pays a cold start (up to 2 s on Flex deployments). Subsequent requests are fast.
Manage endpoints
From chat or Deployments:Stop
Pause the endpoint. No charges while stopped.
Start
Resume from stopped. Cached weights make restarts fast.
Change GPU
Switch tier. The agent warns if re-optimization is recommended.
Known limitations
- Request timeout is long but finite. For large
max_tokens, stream the response so you don’t hit the timeout. - First deploy of a pipeline pays the full weight warm-up (minutes). Subsequent cold starts reuse Instant Start’s weight cache and are much faster.
- Active mode requires Team plan or higher.
Common questions
How do I know when the deployment is ready?
How do I know when the deployment is ready?
The Deployments dashboard shows
provisioning and transitions to active when the replica is serving. You can watch logs from the deployment’s detail page to confirm the model finished loading.Is there a dry-run or preview mode?
Is there a dry-run or preview mode?
The playground under the Deploy tab is the preview. Send real prompts, inspect quality and latency, compare variants, before you commit to deploying an endpoint that serves external traffic.
How do I roll back to a previous variant?
How do I roll back to a previous variant?
From the pipeline page, pick any prior optimization variant and redeploy it. Weights are usually still cached from the earlier run, so the rollback is fast.
Next steps
Autoscaling
Replica budget, concurrency, Flex vs Active knobs.
Instant Start
Cold-start weight caching explained.
Speculation
Draft-model speculative decoding for throughput.
Monitoring
Observe latency, queue depth, and cost.