SimpleInference

A lightweight, Fiber-friendly Ruby client for OpenAI-compatible LLM APIs. Works seamlessly with OpenAI, Azure OpenAI, 火山引擎 (Volcengine), DeepSeek, Groq, Together AI, and any other provider that implements the OpenAI API specification.

Designed for simplicity and compatibility – no heavy dependencies, just pure Ruby with Net::HTTP.

Features

🔌 Universal compatibility – Works with any OpenAI-compatible API provider
🌊 Streaming support – Native SSE streaming for chat completions
🧵 Fiber-friendly – Compatible with Ruby 3 Fiber scheduler, works great with Falcon
🔧 Flexible configuration – Customizable API prefix for non-standard endpoints
🎯 Simple interface – Receive-an-Object / Return-an-Object style API
📦 Zero runtime dependencies – Uses only Ruby standard library

Installation

Add to your Gemfile:

gem "simple_inference"

Then run:

bundle install

Quick Start

require "simple_inference"

# Connect to OpenAI
client = SimpleInference::Client.new(
  base_url: "https://api.openai.com",
  api_key: ENV["OPENAI_API_KEY"]
)

result = client.chat(
  model: "gpt-4o-mini",
  messages: [{ "role" => "user", "content" => "Hello!" }]
)

puts result.content
p result.usage

Configuration

Options

Option	Env Variable	Default	Description
`base_url`	`SIMPLE_INFERENCE_BASE_URL`	`http://localhost:8000`	API base URL
`api_key`	`SIMPLE_INFERENCE_API_KEY`	`nil`	API key (sent as `Authorization: Bearer <token>`)
`api_prefix`	`SIMPLE_INFERENCE_API_PREFIX`	`/v1`	API path prefix (e.g., `/v1`, empty string for some providers)
`timeout`	`SIMPLE_INFERENCE_TIMEOUT`	`nil`	Request timeout in seconds
`open_timeout`	`SIMPLE_INFERENCE_OPEN_TIMEOUT`	`nil`	Connection open timeout
`read_timeout`	`SIMPLE_INFERENCE_READ_TIMEOUT`	`nil`	Read timeout
`raise_on_error`	`SIMPLE_INFERENCE_RAISE_ON_ERROR`	`true`	Raise exceptions on HTTP errors
`headers`	–	`{}`	Additional headers to send with requests
`adapter`	–	`Default`	HTTP adapter (see Adapters)

Provider Examples

OpenAI

client = SimpleInference::Client.new(
  base_url: "https://api.openai.com",
  api_key: ENV["OPENAI_API_KEY"]
)

火山引擎 (Volcengine / ByteDance)

火山引擎的 API 路径不包含 /v1 前缀，需要设置 api_prefix: ""：

client = SimpleInference::Client.new(
  base_url: "https://ark.cn-beijing.volces.com/api/v3",
  api_key: ENV["ARK_API_KEY"],
  api_prefix: ""  # 重要：火山引擎不使用 /v1 前缀
)

result = client.chat(
  model: "deepseek-v3-250324",
  messages: [
    { "role" => "system", "content" => "你是人工智能助手" },
    { "role" => "user", "content" => "你好" }
  ]
)

puts result.content

DeepSeek

client = SimpleInference::Client.new(
  base_url: "https://api.deepseek.com",
  api_key: ENV["DEEPSEEK_API_KEY"]
)

Groq

client = SimpleInference::Client.new(
  base_url: "https://api.groq.com/openai",
  api_key: ENV["GROQ_API_KEY"]
)

Together AI

client = SimpleInference::Client.new(
  base_url: "https://api.together.xyz",
  api_key: ENV["TOGETHER_API_KEY"]
)

Local inference servers (Ollama, vLLM, etc.)

# Ollama
client = SimpleInference::Client.new(
  base_url: "http://localhost:11434"
)

# vLLM
client = SimpleInference::Client.new(
  base_url: "http://localhost:8000"
)

Custom authentication header

Some providers use non-standard authentication headers:

client = SimpleInference::Client.new(
  base_url: "https://my-service.example.com",
  api_prefix: "/v1",
  headers: {
    "x-api-key" => ENV["MY_SERVICE_KEY"]
  }
)

API Methods

Chat

result = client.chat(
  model: "gpt-4o-mini",
  messages: [
    { "role" => "system", "content" => "You are a helpful assistant." },
    { "role" => "user", "content" => "Hello!" }
  ],
  temperature: 0.7,
  max_tokens: 1000
)

puts result.content
p result.usage

Streaming Chat

result = client.chat(
  model: "gpt-4o-mini",
  messages: [{ "role" => "user", "content" => "Tell me a story" }],
  stream: true,
  include_usage: true
) do |delta|
  print delta
end
puts

p result.usage

Low-level streaming (events) is also available, and can be used as an Enumerator:

stream = client.chat_completions_stream(
  model: "gpt-4o-mini",
  messages: [{ "role" => "user", "content" => "Hello" }]
)

stream.each do |event|
  # process event
end

Or as an Enumerable of delta strings:

stream = client.chat_stream(
  model: "gpt-4o-mini",
  messages: [{ "role" => "user", "content" => "Hello" }],
  include_usage: true
)

stream.each { |delta| print delta }
puts
p stream.result&.usage

Embeddings

response = client.embeddings(
  model: "text-embedding-3-small",
  input: "Hello, world!"
)

vector = response.body["data"][0]["embedding"]

Rerank

response = client.rerank(
  model: "bge-reranker-v2-m3",
  query: "What is machine learning?",
  documents: [
    "Machine learning is a subset of AI...",
    "The weather today is sunny...",
    "Deep learning uses neural networks..."
  ]
)

Audio Transcription

response = client.audio_transcriptions(
  model: "whisper-1",
  file: File.open("audio.mp3", "rb")
)

puts response.body["text"]

Audio Translation

response = client.audio_translations(
  model: "whisper-1",
  file: File.open("audio.mp3", "rb")
)

List Models

model_ids = client.models

Health Check

# Returns full response
response = client.health

# Returns boolean
if client.healthy?
  puts "Service is up!"
end

Response Format

All HTTP methods return a SimpleInference::Response with:

response.status   # Integer HTTP status code
response.headers  # Hash with downcased String keys
response.body     # Parsed JSON (Hash/Array), raw String, or nil (SSE success)
response.success? # true for 2xx

Error Handling

By default, non-2xx responses raise exceptions:

begin
  client.chat_completions(model: "invalid", messages: [])
rescue SimpleInference::Errors::HTTPError => e
  puts "HTTP #{e.status}: #{e.message}"
  p e.body      # parsed body (Hash/Array/String)
  puts e.raw_body # raw response body string (if available)
end

Other exception types:

SimpleInference::Errors::TimeoutError – Request timed out
SimpleInference::Errors::ConnectionError – Network error
SimpleInference::Errors::DecodeError – JSON parsing failed
SimpleInference::Errors::ConfigurationError – Invalid configuration

To handle errors manually:

client = SimpleInference::Client.new(
  base_url: "https://api.openai.com",
  api_key: ENV["OPENAI_API_KEY"],
  raise_on_error: false
)

response = client.chat_completions(model: "gpt-4o-mini", messages: [...])

if response.success?
  # success
else
  puts "Error: #{response.status} - #{response.body}"
end

HTTP Adapters

Default (Net::HTTP)

The default adapter uses Ruby's built-in Net::HTTP. It's thread-safe and compatible with Ruby 3 Fiber scheduler.

HTTPX Adapter

For better performance or async environments, use the optional HTTPX adapter:

# Gemfile
gem "httpx"

adapter = SimpleInference::HTTPAdapters::HTTPX.new(timeout: 30.0)

client = SimpleInference::Client.new(
  base_url: "https://api.openai.com",
  api_key: ENV["OPENAI_API_KEY"],
  adapter: adapter
)

Custom Adapter

Implement your own adapter by subclassing SimpleInference::HTTPAdapter:

class MyAdapter < SimpleInference::HTTPAdapter
  def call(request)
    # request keys: :method, :url, :headers, :body, :timeout, :open_timeout, :read_timeout
    # Must return: { status: Integer, headers: Hash, body: String }
  end

  def call_stream(request, &block)
    # For streaming support (optional)
    # Yield raw chunks to block for SSE responses
  end
end

Rails Integration

Create an initializer config/initializers/simple_inference.rb:

INFERENCE_CLIENT = SimpleInference::Client.new(
  base_url: ENV.fetch("INFERENCE_BASE_URL", "https://api.openai.com"),
  api_key: ENV["INFERENCE_API_KEY"]
)

Use in controllers:

class ChatsController < ApplicationController
  def create
    response = INFERENCE_CLIENT.chat_completions(
      model: "gpt-4o-mini",
      messages: [{ "role" => "user", "content" => params[:prompt] }]
    )

    render json: response.body
  end
end

Use in background jobs:

class EmbedJob < ApplicationJob
  def perform(text)
    response = INFERENCE_CLIENT.embeddings(
      model: "text-embedding-3-small",
      input: text
    )

    vector = response.body["data"][0]["embedding"]
    # Store vector...
  end
end

Thread Safety

The client is thread-safe:

No global mutable state
Per-client configuration only
Each request uses its own HTTP connection

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
bin		bin
lib		lib
sig		sig
test		test
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
simple_inference.gemspec		simple_inference.gemspec

Folders and files

Latest commit

History

Repository files navigation

SimpleInference

Features

Installation

Quick Start

Configuration

Options

Provider Examples

OpenAI

火山引擎 (Volcengine / ByteDance)

DeepSeek

Groq

Together AI

Local inference servers (Ollama, vLLM, etc.)

Custom authentication header

API Methods

Chat

Streaming Chat

Embeddings

Rerank

Audio Transcription

Audio Translation

List Models

Health Check

Response Format

Error Handling

HTTP Adapters

Default (Net::HTTP)

HTTPX Adapter

Custom Adapter

Rails Integration

Thread Safety

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages