LLM Inference Load Balancer

A production-grade API service for handling AI inference requests across multiple providers. It uses load balancing, failover capabilities, and Redis-based rate limiting. Initially meant to be used in this project but you can adapt it to your own app.

Load Balancer Overview

The system features two specialized load balancers:

1. Roleplay Load Balancer

Purpose: Handles conversational AI requests with multiple message exchanges
Providers: 13+ providers including Groq, Together, Fireworks, Replicate, Hyperbolic, DeepInfra, OpenRouter, and more
Features:
- Smart provider selection based on real-time capacity
- Redis-based rate limiting with per-minute and per-second quotas
- Concurrent request tracking to prevent overloading
- Automatic failover to healthy providers

2. Content Usage Load Balancer

Purpose: Handles content moderation and usage verification requests
Providers: OpenAI and Anthropic
Features:
- Simplified provider selection for high-reliability use cases
- Focused on content safety and compliance

Load Balancing Algorithm

The system uses a Redis-based selection algorithm that considers:

Rate Limits: Per-minute and per-second quotas for each API key
Concurrent Requests: Active request tracking to prevent overloading
Provider Health: Automatic exclusion of unhealthy providers
Weighted Selection: Prioritizes providers with higher availability

Environment Setup

Prerequisites

Node.js 18+ and TypeScript
Docker and Docker Compose
Redis instance (for rate limiting and coordination)
SSH access to bare metal servers
SSL certificates for HTTPS endpoints

Project Structure

llm-roleplay-inference-api/
├── environment/          # Server-specific configurations
│   ├── server-1/         # Configuration for first bare metal server
│   │   └── .env          # Server-1 specific environment variables
│   ├── server-2/         # Configuration for second bare metal server
│   │   └── .env          # Server-2 specific environment variables
│   └── server-N/         # Additional servers as needed
├── src/                  # Source code
├── scripts/              # Deployment and maintenance scripts
├── .env                  # Root configuration (server IPs)
└── README.md

Setting Up Environment Directories

Create Root Environment File Create .env in the root directory with your server IPs:

# Server IP addresses for deployment
SERVER_1_IP=192.168.1.100
SERVER_2_IP=192.168.1.101

Create Server-Specific Directories

mkdir -p environment/server-1
mkdir -p environment/server-2
# Add more servers as needed

Configure Each Server's Environment Create .env files in each server directory with the configuration specific to that server.

Deployment Guide

Step 1: Configure Environment Variables

For each server, create an .env file in environment/server-N/ with the following structure:

Redis Configuration (Required)

# Redis connection for rate limiting and coordination
REDIS_REST_URL=your-redis-url.com
REDIS_REST_PORT=6379
REDIS_PASSWORD=your-redis-password

API Security (Required)

# API authentication token
API_TOKEN=your-api-token-here

Provider Configurations (Required)

Configure at least two providers (main ones are Together and Fireworks but can change this). Each provider config is a JSON array of API keys with rate limits:

# OpenAI Configuration (required for content usage)
OPENAI_CONFIGS='[
  {
    "apiKey": "sk-your-openai-key-1",
    "maxRequestsPerMinute": 500
  },
  {
    "apiKey": "sk-your-openai-key-2", 
    "maxRequestsPerMinute": 1000
  }
]'

# Together AI Configuration
TOGETHER_CONFIGS='[
  {
    "apiKey": "your-together-key-1",
    "maxRequestsPerMinute": 600
  }
]'

# Fireworks AI Configuration
FIREWORKS_CONFIGS='[
  {
    "apiKey": "your-fireworks-key-1",
    "maxRequestsPerMinute": 400,
    "maxConcurrentRequests": 8
  }
]'

Optional Provider Configurations

Add any of these providers based on your access:

# Groq Configuration
GROQ_CONFIGS='[{"apiKey": "gsk_your-groq-key", "maxRequestsPerMinute": 30}]'

# Anthropic Configuration  
ANTHROPIC_CONFIGS='[{"apiKey": "sk-ant-your-key", "maxRequestsPerMinute": 50}]'

# Replicate Configuration
REPLICATE_CONFIGS='[{"apiKey": "r8_your-replicate-key", "maxRequestsPerMinute": 100}]'

# Hyperbolic Configuration
HYPERBOLIC_CONFIGS='[{"apiKey": "your-hyperbolic-key", "maxRequestsPerMinute": 200}]'

# DeepInfra Configuration
DEEPINFRA_CONFIGS='[{"apiKey": "your-deepinfra-key", "maxRequestsPerMinute": 300}]'

# OpenRouter Configuration
OPENROUTER_CONFIGS='[{"apiKey": "sk-or-your-key", "maxRequestsPerMinute": 200}]'

# AwanLLM Configuration
AWAN_CONFIGS='[{"apiKey": "your-awan-key", "maxRequestsPerMinute": 100}]'

# KlusterAI Configuration
KLUSTER_AI_CONFIGS='[{"apiKey": "your-kluster-key", "maxRequestsPerMinute": 150}]'

# AvianIO Configuration
AVIAN_IO_CONFIGS='[{"apiKey": "your-avian-key", "maxRequestsPerMinute": 120}]'

# Lambda Labs Configuration
LAMBDA_LABS_CONFIGS='[{"apiKey": "your-lambda-key", "maxRequestsPerMinute": 80}]'

# Novita AI Configuration
NOVITA_AI_CONFIGS='[{"apiKey": "your-novita-key", "maxRequestsPerMinute": 200}]'

# Inference.net Configuration
INFERENCE_NET_CONFIGS='[{"apiKey": "your-inference-key", "maxRequestsPerMinute": 100}]'

Step 2: Deploy to Bare Metal Servers

Prepare SSH Access

eval $(ssh-agent -s)
ssh-add ~/.ssh/your_private_key

Deploy to All Servers
```
./scripts/deploy-to-server.sh
```
This script will:
- Verify all environment files exist and are valid
- Copy source code to each server
- Apply server-specific configurations
- Create deployment backups
- Set proper permissions

Step 3: Start API Services

Start Services on All Servers
```
./scripts/start-api.sh
```
Verify Individual Server Health
```
./scripts/manual-health-check.sh
```

Step 4: Configure Load Balancer

Set up your external load balancer (e.g., DigitalOcean Load Balancer) to distribute traffic between your servers:

Health Check: GET /health
Sticky Sessions: Not required
Protocol: HTTPS with SSL termination
Ports: 443 → 3000 (or your configured port)

Configuration

Rate Limiting Configuration

Each provider supports these rate limiting parameters:

{
  "apiKey": "your-api-key",
  "maxRequestsPerMinute": 500,        // RPM limit
  "maxConcurrentRequests": 10,        // Concurrent request limit (for providers like DeepInfra)
}

Docker Configuration

The service runs in Docker containers with:

Image: Node.js 18 Alpine
Port: 3000 (configurable)
Health Check: Built-in endpoint
Auto-restart: On failure

API Endpoints

Roleplay Chat

POST /roleplay-balancer
Authorization: Bearer <API_TOKEN>
Content-Type: application/json

{
  "messages": [
    {
      "role": "user",
      "content": [{"type": "text", "text": "Hello!"}]
    }
  ],
  "systemPrompt": "You are a helpful assistant.",
  "maxTokens": 1000,
  "stream": false
}

Content Usage Verification

POST /content-usage-balancer  
Authorization: Bearer <API_TOKEN>
Content-Type: application/json

{
  "prompt": "Content to analyze",
  "maxTokens": 500,
  "stream": false
}

Health Check

GET /health

Response: {
  "status": "healthy",
  "timestamp": "2024-01-01T00:00:00.000Z",
  "uptime": 3600,
  "providers": {
    "roleplay": 13,
    "content": 2
  }
}

Maintenance

Available Scripts

Script	Purpose
`deploy-to-server.sh`	Deploy code to all configured servers
`start-api.sh`	Start API services on all servers
`manual-health-check.sh`	Verify deployment health
`check-logs.sh`	Review application logs
`check-port-access.sh`	Validate port configurations
`delete-server-resources.sh`	Clean up server resources
`install-deps.sh`	Install dependencies
`reboot-server.sh`	Restart servers
`check-dist-exists.sh`	Verify build artifacts

Monitoring and Logs

Log Location: /opt/llm-roleplay-inference-api/logs/
Log Format: JSON with timestamps and request IDs
Health Monitoring: Built-in health check endpoint
Metrics: Request counts, response times, error rates

Scaling

To add more servers:

Add Server IP to Root .env
```
SERVER_3_IP=192.168.1.102
```
Create Environment Directory
```
mkdir -p environment/server-3
```

Configure Server Environment

cp environment/server-1/.env environment/server-3/.env
# Edit server-3/.env with appropriate configurations

Update Deployment Scripts The scripts automatically detect new servers based on environment directories.
Deploy
```
./scripts/deploy-to-server.sh
```

Troubleshooting

Provider Selection Issues

Check Redis connectivity
Verify provider API keys are valid
Review rate limit configurations

Deployment Failures

Ensure SSH keys are properly configured
Verify server connectivity
Check disk space on target servers

Performance Issues

Monitor Redis performance
Check concurrent request limits
Review provider response times

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts		scripts
src		src
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Load Balancer

Table of Contents

Load Balancer Overview

1. Roleplay Load Balancer

2. Content Usage Load Balancer

Load Balancing Algorithm

Environment Setup

Prerequisites

Project Structure

Setting Up Environment Directories

Deployment Guide

Step 1: Configure Environment Variables

Redis Configuration (Required)

API Security (Required)

Provider Configurations (Required)

Optional Provider Configurations

Step 2: Deploy to Bare Metal Servers

Step 3: Start API Services

Step 4: Configure Load Balancer

Configuration

Rate Limiting Configuration

Docker Configuration

API Endpoints

Roleplay Chat

Content Usage Verification

Health Check

Maintenance

Available Scripts

Monitoring and Logs

Scaling

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages