Power Lever

Inspiration

What's the derivative of arccos(x)? I asked Claude Opus 4.6. Too many times, we've overkilled our prompts by feeding simple tasks into state-of-the-art models. The fact that inference providers only allow a small number of model options means that every task is either underserved or overserved. There’s no way to precisely match compute to problem difficulty; you either spin up a sledgehammer or settle for something too weak. We wanted inference to behave less like a fixed menu and more like a tunable dial.

What it does

We introduce three core technical contributions:

1) On-demand speculative decoding (via vLLM)

We implement speculative decoding as a first-class, runtime-configurable primitive using vLLM, dynamically adjusting draft model selection and tokens-ahead per request. Instead of statically enabling acceleration, speculative decoding becomes an adaptive service layer that optimizes acceptance rate, latency, and compute efficiency in real time.

2) Hardware flexibility (via Modal GPU orchestration)

We decouple inference from fixed infrastructure by programmatically orchestrating Modal GPU profiles per request. The system can select between hardware configurations based on latency targets, utilization, and energy constraints, transforming GPU selection into a tunable parameter rather than a deployment-time decision.

3) Agentic routing (via Claude Agent SDK)

We implement a Router Agent using the Claude Agent SDK that observes hardware state, user-defined performance constraints, and prior run metrics to plan and apply optimal inference configurations. The agent closes the loop between intent (“faster,” “cheaper,” “more efficient”) and execution by autonomously selecting hardware and speculative decoding hyperparameters before each inference call.

How we built it

Modal, vLLM, FastAPI, Claude Agent SDK, Vercel, Next.js

Accomplishments that we're proud of

Serving models with less water waste.

What's next for Power Lever

Use live telemetry (TTFT, throughput, early perplexity) to auto-escalate or downshift mid-generation.
Improve speculative decoding instrumentation: report real per-token accept/reject rates and tune $k$ and $\tau$ automatically per request.

Built With

claude
fastapi
modal
node.js
openai
vercel
vllm

Updates

Nick Rui started this project — Feb 15, 2026 06:42 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.