A curated list of tools, frameworks, platforms, and resources for Large Language Model Operations (LLMOps) — enabling production-ready, scalable, and reliable LLM applications.
LLMOps is the emerging practice of managing the lifecycle of large language models, including fine-tuning, deployment, monitoring, evaluation, versioning, and observability — similar to MLOps but optimized for LLMs and generative AI systems.
- Overview & Learning
- Model Training & Fine-Tuning
- Evaluation & Benchmarking
- Serving & Inference
- Monitoring & Observability
- Prompt Engineering & Management
- Data Management
- Security & Safety
- Platforms & Frameworks
- Tooling Ecosystem
- Related Awesome Lists
- LLMOps Guide (Weights & Biases) – High-level overview of LLMOps concepts and tools.
- LLMOps Field Guide (Fiddler) – A breakdown of the infrastructure stack for LLMOps.
- LangChain Cookbook – Recipes for building with LangChain and LLMs.
- Full Stack Deep Learning – Practical LLM lifecycle, from training to deployment.
- Hugging Face Transformers – Leading library for pre-trained and fine-tunable LLMs.
- PEFT – Parameter-Efficient Fine-Tuning methods for LLMs.
- LoRA – Lightweight fine-tuning strategy for large models.
- Colossal-AI – Framework for efficient distributed LLM training.
- Open LLM Leaderboard – Benchmarking open LLMs.
- Helm – Stanford’s framework for evaluating LLMs across tasks.
- LM Evaluation Harness – Test harness for LLM evaluation.
- TruLens – LLM observability and feedback tracking.
- vLLM – Fast and memory-efficient inference for LLMs with continuous batching.
- TGI (Text Generation Inference) – High-performance inference server by Hugging Face.
- DeepSpeed MII – Low-latency inference for Hugging Face models.
- Ray Serve – Scalable model serving via Ray.
- PromptLayer – Log, monitor, and manage prompts across LLM providers.
- Arize AI – LLM monitoring, evaluation, and prompt tracing.
- Opik – Open-source LLM observability, evaluation, and tracing platform.
- WhyLabs – Observability for ML and LLM deployments.
- TruLens – Feedback loop framework for evaluating and improving LLM apps.
- LangChain – Modular framework for chaining LLM calls and prompts.
- Prompt Engineering Guide – Structured guide to writing effective prompts.
- PromptFoo – Compare, test, and evaluate LLM prompts easily.
- Guidance – Prompt programming with structured control over model output.
- Label Studio – Open-source data labeling for fine-tuning and RAG pipelines.
- Weaviate – Vector database for semantic search and hybrid retrieval.
- Pinecone – Managed vector DB for similarity search and retrieval-augmented generation.
- ChromaDB – Open-source embeddings DB built for LLMs.
- Guardrails AI – Validating and controlling LLM outputs.
- Rebuff – Open-source framework for prompt injection defense.
- Giskard – Testing, debugging, and securing LLM applications.
- OpenAI Moderation API – API for detecting harmful or unsafe content.
- LangChain – Infrastructure to build end-to-end LLM-powered apps.
- LLamaIndex – Connect data sources to LLMs via indexing.
- RAGStack (Haystack) – Retrieval-augmented generation framework.
- FastChat – Open platform for serving and fine-tuning chat LLMs.
- Weights & Biases – Track and visualize model training and performance.
- MLflow – Platform for managing the ML lifecycle.
- PromptLayer – Middleware for logging and versioning prompt inputs and outputs.
- OpenLLM – Open-source platform to deploy and manage LLMs in production.
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.