# RunInfra > Plain English to production AI inference endpoints. RunInfra selects models, benchmarks GPUs, applies kernel optimizations, and deploys OpenAI-compatible APIs. ## Docs - [AI onboarding prompt](https://runinfra.ai/docs/ai-onboarding/prompt-block.md): Copy-paste prompt for any LLM. Teaches your AI assistant how to ship code against RunInfra correctly. - [Audio](https://runinfra.ai/docs/api-reference/audio.md): POST /v1/audio/speech and /v1/audio/transcriptions — text-to-speech and speech-to-text. - [Authentication](https://runinfra.ai/docs/api-reference/authentication.md): API key scopes, creation, rotation, and expiration for the RunInfra inference API. - [Chat completions](https://runinfra.ai/docs/api-reference/chat-completions.md): POST /v1/chat/completions — OpenAI-compatible chat with streaming, tools, and structured output. - [Embeddings](https://runinfra.ai/docs/api-reference/embeddings.md): POST /v1/embeddings — vector embeddings for semantic search, RAG, and clustering. - [Error codes](https://runinfra.ai/docs/api-reference/errors.md): HTTP status codes RunInfra returns, what causes them, and how to recover. - [API reference](https://runinfra.ai/docs/api-reference/introduction.md): OpenAI-compatible inference API. Set base_url once, reach any deployed model with a workspace-scoped key. - [Models](https://runinfra.ai/docs/api-reference/models.md): GET /v1/models — list every model deployed in your workspace. - [Rate limits](https://runinfra.ai/docs/api-reference/rate-limits.md): Per-key request budgets, response headers, 429 behavior, and how to raise limits. - [Changelog](https://runinfra.ai/docs/changelog.md): A full record of RunInfra releases, feature launches, and platform updates, newest entries first, with a look at what is coming next. - [Cookbook](https://runinfra.ai/docs/cookbook/overview.md): Copy-paste recipes for the most common RunInfra inference patterns. Every recipe runs out of the box with a free Starter key. - [Retrieval-augmented generation](https://runinfra.ai/docs/cookbook/rag.md): Embed, retrieve, generate. A complete RAG loop in 30 lines using two RunInfra pipelines. - [Streaming responses](https://runinfra.ai/docs/cookbook/streaming.md): Token-by-token responses with the OpenAI SDK. Server-sent events over HTTPS, same format as OpenAI. - [Structured output](https://runinfra.ai/docs/cookbook/structured-output.md): Guaranteed-parseable JSON responses via JSON Schema. Works with every RunInfra model that supports tool calling. - [Tool calling](https://runinfra.ai/docs/cookbook/tool-calling.md): Function calling with typed arguments. Model picks a tool, you run it, feed the result back. Multi-turn loop. - [Autoscaling](https://runinfra.ai/docs/deployments/autoscaling.md): How RunInfra replicas scale up and down in response to traffic. Flex scale-to-zero or Active always-on. - [Instant Start](https://runinfra.ai/docs/deployments/instant-start.md): RunInfra's weight-caching layer that keeps cold starts fast even on scale-to-zero deployments. - [Deployments overview](https://runinfra.ai/docs/deployments/overview.md): Deploy any optimized RunInfra pipeline as an OpenAI-compatible production API. Two modes, sub-2s cold starts, per-token billing. - [Speculative decoding](https://runinfra.ai/docs/deployments/speculation.md): A small draft model proposes tokens, the target model verifies them in a single pass. Higher throughput with no quality change. - [Account and access](https://runinfra.ai/docs/faq/account.md): FAQ about sign up, API keys, workspaces, seats, and dashboard access. - [Billing](https://runinfra.ai/docs/faq/billing.md): FAQ about token pricing, optimization sessions, invoices, overage, and credits. - [Infrastructure](https://runinfra.ai/docs/faq/infrastructure.md): FAQ about GPUs, regions, data residency, uptime, and security. - [Models and inference](https://runinfra.ai/docs/faq/models-inference.md): FAQ about supported models, quantization, context windows, streaming, tool calling, and fine-tuning. - [GPUs and pricing](https://runinfra.ai/docs/features/gpu-pricing.md): RunInfra bills per million tokens, not per GPU hour. Understand how GPU selection, deployment mode, and model size affect your inference cost. - [Image generation](https://runinfra.ai/docs/features/image-generation.md): Text-to-image inference on RunInfra: FLUX, SDXL, and Stable Diffusion 3.5 served through a Diffusers FastAPI runtime with torchao FP8 + torch.compile on Ada / Hopper / Blackwell GPUs. - [Models](https://runinfra.ai/docs/features/models.md): RunInfra supports thousands of LLMs, embeddings, vision-language, speech-to-text, and text-to-speech models from Hugging Face, with custom model upload available on Team plan. - [Monitoring](https://runinfra.ai/docs/features/monitoring.md): Track requests, latency percentiles, throughput, token usage, and cost across all your RunInfra endpoints from a single real-time dashboard. - [Optimization](https://runinfra.ai/docs/features/optimization.md): GPU profiling, quantized-variant search, Forge kernels, and speculation. The RunInfra optimizer picks the right configuration so you don't have to. - [Build with RunInfra](https://runinfra.ai/docs/index.md): Plain English to production AI endpoints. - [LangChain](https://runinfra.ai/docs/integrations/langchain.md): Use RunInfra as the LLM provider for any LangChain application. One-line config change. - [LlamaIndex](https://runinfra.ai/docs/integrations/llamaindex.md): Use RunInfra as the LLM and embedding provider in any LlamaIndex pipeline. - [Using RunInfra with other libraries](https://runinfra.ai/docs/integrations/overview.md): RunInfra's OpenAI-compatible HTTP API works with any library that speaks OpenAI. The fastest path for the most common frameworks. - [Vercel AI SDK](https://runinfra.ai/docs/integrations/vercel-ai-sdk.md): Use RunInfra with the Vercel AI SDK. Works with Next.js, SvelteKit, Nuxt, and Remix. - [Which model should I use?](https://runinfra.ai/docs/introduction/model-picker.md): Pick the right model for your use case. Decision table by task, size, and performance priority. - [Plans and pricing](https://runinfra.ai/docs/introduction/plans.md): Compare Starter, Pro, Team, and Enterprise plans, including optimization sessions, token pricing, rollover rules, and overage costs. - [Quickstart](https://runinfra.ai/docs/introduction/quickstart.md): Create an account, describe your pipeline, optimize it, and deploy a live OpenAI-compatible inference endpoint, all without writing infrastructure code. - [What is RunInfra?](https://runinfra.ai/docs/introduction/welcome.md): RunInfra turns plain English into production AI inference endpoints. Describe your use case and the AI agent builds, optimizes, and deploys it for you. - [Prompting best practices](https://runinfra.ai/docs/prompting/best-practices.md): Learn what to include in every RunInfra prompt so the agent builds the right pipeline the first time, without back-and-forth clarification. - [Debugging](https://runinfra.ai/docs/prompting/debugging.md): Fix common RunInfra issues: wrong model selection, poor optimization results, slow cold starts, and failed deployments, with direct corrective prompts. - [Example prompts](https://runinfra.ai/docs/prompting/example-prompts.md): Copy-ready prompts for chatbots, summarizers, code generation, multilingual APIs, and more, with notes on what the RunInfra agent builds for each. - [Glossary](https://runinfra.ai/docs/reference/glossary.md): RunInfra domain terms in one page. GPU, quantization, serving, and agent vocabulary. - [Idea to pipeline](https://runinfra.ai/docs/tips/from-idea-to-pipeline.md): Walk through every step of building, optimizing, deploying, and integrating a RunInfra AI pipeline, from blank page to production endpoint. - [Troubleshooting](https://runinfra.ai/docs/tips/troubleshooting.md): Fix common issues with RunInfra pipeline building, optimization, deployment, and API integration, organized by category for fast diagnosis. - [OpenAI compatibility](https://runinfra.ai/docs/tools-sdks/openai-compatibility.md): RunInfra exposes an OpenAI-shaped HTTP API for the endpoints it supports today. Point any OpenAI SDK at a RunInfra deployment and it works. ## OpenAPI Specs - [openapi](https://runinfra.ai/docs/api-reference/openapi.json)