RunInfra is the fastest way to ship open-source AI models as production APIs. Describe the endpoint you need in plain English and RunInfra picks the model, benchmarks real GPUs, applies kernel optimizations, and deploys an OpenAI-compatible HTTP endpoint.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Get started in minutes
Start fast with a chat prompt
Describe your use case in Pipes. The agent builds, optimizes, and deploys in one flow.
Optimize on real GPUs
Profile across L4 to B200. Search AWQ, GPTQ, FP8 variants. Apply Forge kernels.
Deploy with one click
Flex scale-to-zero or Active always-on. Cold starts under 2 seconds.
What you can build
Low-latency chatbots
Sub-200ms P99 on Llama, Qwen, Mistral, Phi.
Migrate from OpenAI
Drop-in replacement. Just change the base URL.
Multi-model routing
Cheap small model for easy queries, large for hard ones.
Speech to text
Whisper large, turbo, and distilled variants.
Text to speech
XTTS and Bark for expressive, multilingual voice.
Batch summarizers
Throughput-tuned pipelines with per-token cost control.
Resources and help
Which model should I use?
Pick the right model for your use case.
Example prompts
Copy-ready prompts for every pipeline shape.
API reference
Complete OpenAI-compatible HTTP API.
Plans and pricing
Compare Starter, Pro, Team, and Enterprise.
Troubleshooting
Fix 4xx, 5xx, cold starts, and deploy failures.
Talk to sales
Volume pricing, SLAs, and SOC 2 or HIPAA.