Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra is the fastest way to ship open-source AI models as production APIs. Describe the endpoint you need in plain English and RunInfra picks the model, benchmarks real GPUs, applies kernel optimizations, and deploys an OpenAI-compatible HTTP endpoint.

Get started in minutes

Start fast with a chat prompt

Describe your use case in Pipes. The agent builds, optimizes, and deploys in one flow.

Optimize on real GPUs

Profile across L4 to B200. Search AWQ, GPTQ, FP8 variants. Apply Forge kernels.

Deploy with one click

Flex scale-to-zero or Active always-on. Cold starts under 2 seconds.
Not sure where to start? Pick a model with the model catalog, then choose Flex to prototype and move to Active for production traffic. Need help tuning a workload? Talk to our team.

What you can build

Low-latency chatbots

Sub-200ms P99 on Llama, Qwen, Mistral, Phi.

Migrate from OpenAI

Drop-in replacement. Just change the base URL.

Multi-model routing

Cheap small model for easy queries, large for hard ones.

Speech to text

Whisper large, turbo, and distilled variants.

Text to speech

XTTS and Bark for expressive, multilingual voice.

Batch summarizers

Throughput-tuned pipelines with per-token cost control.

Resources and help

Which model should I use?

Pick the right model for your use case.

Example prompts

Copy-ready prompts for every pipeline shape.

API reference

Complete OpenAI-compatible HTTP API.

Plans and pricing

Compare Starter, Pro, Team, and Enterprise.

Troubleshooting

Fix 4xx, 5xx, cold starts, and deploy failures.

Talk to sales

Volume pricing, SLAs, and SOC 2 or HIPAA.