Infera — A High-Performance Inference Engine for Large Language Models.

Infera is a production-grade model inference engine designed to deliver high throughput, low latency, and efficient resource utilization for large-scale language model serving.

It focuses on optimized execution, intelligent scheduling, and memory-efficient inference, enabling scalable and cost-effective deployment of modern LLMs.

✨ Key Features

High-Performance Inference
- Optimized autoregressive decoding
- Efficient batching and request coalescing
- Designed for both low-latency and high-throughput workloads
Intelligent Scheduling
- Fine-grained request scheduling
- Dynamic batching and token-level orchestration
- Optimized for high-concurrency serving
Memory-Efficient Execution
- Advanced KV cache management
- Reduced memory fragmentation
- Maximized GPU utilization
Production-Ready Architecture
- Modular and extensible engine design
- Clean separation of scheduler, executor, and runtime
- Designed for long-running services
Flexible Model Support
- Native support for modern Transformer-based LLMs
- Easy integration with existing model ecosystems

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
examples		examples
infera		infera
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Infera — A High-Performance Inference Engine for Large Language Models.

✨ Key Features

About

Uh oh!

Releases

Packages

Languages

License

Yuan-ManX/infera

Folders and files

Latest commit

History

Repository files navigation

Infera — A High-Performance Inference Engine for Large Language Models.

✨ Key Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages