Deployments overview

A deployment turns an optimized pipeline into a live HTTP endpoint. You get a URL, a scoped API key, autoscaling replicas, Instant Start cold-start caching, and OpenAI-compatible semantics.

Deployment requires the Pro plan ($49/mo) or higher. Starter lets you build, optimize, and test in the playground without deploying.

Request lifecycle

Every request hits the edge, authenticates against your API key, routes to a replica with headroom, and streams tokens back. Cold starts pull from Instant Start; warm requests skip it.

Deploy

From chat
From the Deploy tab

Deploy this pipeline

The agent picks the winning variant, provisions a GPU, and returns the endpoint URL and key.

Deployment modes

Flex (Pro+)
Active (Team+)

Scale-to-zero. Replicas shed after 5 idle minutes, spin back up in under 2 seconds on next request.

Property	Value
Cost	Per-token only, nothing when idle
Cold start	Under 2 s on RunInfra Cloud
Idle timeout	5 minutes of no traffic
Best for	Development, bursty traffic, cost-sensitive workloads

Always-on. Replicas stay warm 24/7, zero cold start on any request.

Property	Value
Cost	Per-token rate + flat base fee per warm replica
Cold start	None
Best for	Production APIs with SLAs, latency-critical apps

Calling the endpoint

OpenAI-compatible. Drop in any OpenAI SDK, change two lines:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

Works with LangChain, LlamaIndex, Vercel AI SDK, Instructor, and every other OpenAI-compatible library. See OpenAI compatibility for the full contract.

Capacity

Each replica serves up to 30 concurrent requests. Replica budgets by plan:

Plan	Max replicas	Max concurrent requests
Pro	8	240
Team	32	960
Enterprise	Custom	Custom

Exceeding the budget returns 429 with Retry-After. See Autoscaling to raise caps and tune concurrency.

Playground

Test before or after deploying from the Deploy tab. Send prompts, inspect token counts, compare variants, measure latency, without writing a line of code. After deployment, the playground targets the selected endpoint row, so the latency and output come from the endpoint you are inspecting.

Plan	Playground requests
Starter	100/day
Pro+	Unlimited

The first playground request after an idle period pays a cold start (up to 2 s on Flex deployments). Subsequent requests are fast.

Manage endpoints

From chat or Deployments:

Stop

Pause the endpoint. No charges while stopped.

Start

Resume from stopped. Cached weights make restarts fast.

Change GPU

Switch tier. The agent warns if re-optimization is recommended.

Known limitations

Request timeout is long but finite. For large max_tokens, stream the response so you don’t hit the timeout.
First deploy of a pipeline pays the full weight warm-up (minutes). Subsequent cold starts reuse Instant Start’s weight cache and are much faster.
Active mode requires Team plan or higher.

Common questions

How do I know when the deployment is ready?

The Deployments dashboard shows provisioning and transitions to active when the replica is serving. You can watch logs from the deployment’s detail page to confirm the model finished loading.

Is there a dry-run or preview mode?

The playground under the Deploy tab is the preview. Send real prompts, inspect quality and latency, compare variants, before you commit to deploying an endpoint that serves external traffic.

How do I roll back to a previous variant?

From the pipeline page, pick any prior optimization variant and redeploy it. Weights are usually still cached from the earlier run, so the rollback is fast.

Next steps

Autoscaling

Replica budget, concurrency, Flex vs Active knobs.

Instant Start

Cold-start weight caching explained.

Speculation

Draft-model speculative decoding for throughput.

Monitoring

Observe latency, queue depth, and cost.

Get started

Using the agent

Features

Deployments

Guides

Using with other libraries

Cookbook

Deployments overview

Request lifecycle

Deploy

Deployment modes

Calling the endpoint

Capacity

Playground

Manage endpoints

Stop

Start

Change GPU

Known limitations

Common questions

Next steps

Autoscaling

Instant Start

Speculation

Monitoring

Get started

Using the agent

Features

Deployments

Guides

Using with other libraries

Cookbook

Documentation Index

​Request lifecycle

​Deploy

​Deployment modes

​Calling the endpoint

​Capacity

​Playground

​Manage endpoints

Stop

Start

Change GPU

​Known limitations

​Common questions

​Next steps

Autoscaling

Instant Start

Speculation

Monitoring

Request lifecycle

Deploy

Deployment modes

Calling the endpoint

Capacity

Playground

Manage endpoints

Known limitations

Common questions

Next steps