Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

A deployment turns an optimized pipeline into a live HTTP endpoint. You get a URL, a scoped API key, autoscaling replicas, Instant Start cold-start caching, and OpenAI-compatible semantics.
Deployment requires the Pro plan ($49/mo) or higher. Starter lets you build, optimize, and test in the playground without deploying.

Request lifecycle

Every request hits the edge, authenticates against your API key, routes to a replica with headroom, and streams tokens back. Cold starts pull from Instant Start; warm requests skip it.

Deploy

Deploy this pipeline
The agent picks the winning variant, provisions a GPU, and returns the endpoint URL and key.

Deployment modes

Scale-to-zero. Replicas shed after 5 idle minutes, spin back up in under 2 seconds on next request.
PropertyValue
CostPer-token only, nothing when idle
Cold startUnder 2 s on RunInfra Cloud
Idle timeout5 minutes of no traffic
Best forDevelopment, bursty traffic, cost-sensitive workloads

Calling the endpoint

OpenAI-compatible. Drop in any OpenAI SDK, change two lines:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
Works with LangChain, LlamaIndex, Vercel AI SDK, Instructor, and every other OpenAI-compatible library. See OpenAI compatibility for the full contract.

Capacity

Each replica serves up to 30 concurrent requests. Replica budgets by plan:
PlanMax replicasMax concurrent requests
Pro8240
Team32960
EnterpriseCustomCustom
Exceeding the budget returns 429 with Retry-After. See Autoscaling to raise caps and tune concurrency.

Playground

Test before or after deploying from the Deploy tab. Send prompts, inspect token counts, compare variants, measure latency, without writing a line of code. After deployment, the playground targets the selected endpoint row, so the latency and output come from the endpoint you are inspecting.
PlanPlayground requests
Starter100/day
Pro+Unlimited
The first playground request after an idle period pays a cold start (up to 2 s on Flex deployments). Subsequent requests are fast.

Manage endpoints

From chat or Deployments:

Stop

Pause the endpoint. No charges while stopped.

Start

Resume from stopped. Cached weights make restarts fast.

Change GPU

Switch tier. The agent warns if re-optimization is recommended.

Known limitations

  • Request timeout is long but finite. For large max_tokens, stream the response so you don’t hit the timeout.
  • First deploy of a pipeline pays the full weight warm-up (minutes). Subsequent cold starts reuse Instant Start’s weight cache and are much faster.
  • Active mode requires Team plan or higher.

Common questions

The Deployments dashboard shows provisioning and transitions to active when the replica is serving. You can watch logs from the deployment’s detail page to confirm the model finished loading.
The playground under the Deploy tab is the preview. Send real prompts, inspect quality and latency, compare variants, before you commit to deploying an endpoint that serves external traffic.
From the pipeline page, pick any prior optimization variant and redeploy it. Weights are usually still cached from the earlier run, so the rollback is fast.

Next steps

Autoscaling

Replica budget, concurrency, Flex vs Active knobs.

Instant Start

Cold-start weight caching explained.

Speculation

Draft-model speculative decoding for throughput.

Monitoring

Observe latency, queue depth, and cost.