Inspiration
We were inspired by a simple problem: most LLM safety systems live in prompts, and prompts are fragile. If a model can be jailbroken, injected, or simply ignore instructions, then the safety layer is sitting in the weakest possible place. We wanted to build something deeper and more durable, a system that operates inside the model at inference time rather than just asking the model to behave.
That led us to representation engineering and steering vectors. Instead of retraining an entire model or layering on more prompt rules, we built Lobo as a real-time inference firewall that can suppress harmful concepts like deception, toxicity, and dangerousness directly in the model’s internal activations while it generates.
What it does
Lobo is a real-time inference firewall for LLMs. Instead of using system prompts or output filters (which jailbreaks can easily get around), Lobo intervenes inside the model during generation itself. Using representation engineering, we precompute a direction vector in activation space for each of seven harmful concepts: toxicity, deception, danger, stereotypes, legal risk, formality violations, and coldness. At inference time, a PyTorch forward hook fires at layer 14 of the residual stream and subtracts those weighted directions before each token is produced. The safety constraint is purely mathematical, so there's no prompt text to inject around.
The operator side is a live admin dashboard where per-concept multiplier sliders let you dial in exactly how aggressively each concept gets suppressed. Changes propagate instantly to the GPU inference worker through a shared Modal config store, no redeployment needed. Every conversation gets logged to Supabase with the exact multipliers that were active during that generation, and a Gemini background function automatically scores each concept as flagged or clean right after, giving you a second-opinion audit layer on top of the steering. The whole thing is demoed through Cowboy Cafe, a Western-themed restaurant chatbot that runs on the same steered model, showing what safe LLM deployment actually looks like in a real product.
How we built it
The backend is built entirely on Modal. A CPU class handles admin config through a shared persistent key-value store, and a GPU class (A10G) runs the 8B model and serves generation requests.
The steering vectors are the core technical piece. For each of the seven concepts, we ran contrasting toxic and safe prompt pairs through the model and used a forward pre-hook to capture the hidden state activations at layer 14 of the residual stream. The steering vector is the difference between the mean toxic activation and the mean safe activation across those prompts. This gives us a direction in the model's internal representation space that corresponds to each concept.
At startup, the GPU inference class loads all seven vectors, L2-normalizes each one so multiplier values are interpretable, and registers a forward pre-hook on layer 14. At inference time, it builds a weighted sum of the active concept vectors using the current admin multipliers, then subtracts that combined vector from the hidden states before the layer processes them. This happens on every forward pass during generation, so every token is affected.
The steering vector formula is:
$$v_c = \mu_c^{\text{toxic}} - \mu_c^{\text{safe}}$$
where each mean is computed from layer-input residual activations:
$$= \frac{1}{T} \sum_{t=1}^{T} h_{l,t}(p), \quad \mu_c^{\text{toxic}} = \frac{1}{|P_c^{\text{toxic}}|} \sum_{p \in P_c^{\text{toxic}}} r(p), \quad \mu_c^{\text{safe}} = \frac{1}{|P_c^{\text{safe}}|} \sum_{p \in P_c^{\text{safe}}} r(p)$$
Then at inference time they apply:
$$\hat{v}_c = \frac{v_c}{\Vert v_c \Vert_2}$$
$$u = \sum_c m_c \hat{v}_c$$
$$h_{l,t} \leftarrow h_{l,t} - u$$
with optional global scaling and norm cap before subtraction.
The multipliers can be updated at any time through the admin dashboard without restarting anything. The CPU admin class writes the new values to the shared key-value store, and the GPU inference class picks them up on the next request.
After each generation, the response and active multipliers are logged to Supabase, and a separate Gemini background function independently scores each concept as flagged or clean, writing results back to the same row.
Challenges we ran into
One of the biggest challenges was getting steering to behave reliably at inference time. In theory, subtracting a harmful concept direction should reduce that concept, but in practice the effect depends heavily on vector quality, layer alignment, and scale. If the steering vector is computed in one representation space and applied in a mismatched one, the output can become noisy or even worse.
We also had to solve practical engineering issues:
- Migrating from transformer-lens to pytorch since the model was not supported in its registrar
- Reducing latency and cold-start pain when serving an 8B model on GPU infrastructure
- Designing a UI that made abstract steering controls understandable in real time
- Syncing the admin controls, shared config, and inference service cleanly
- Making our demo metrics and visualizations reflect the underlying values accurately
- Logging generations and attaching post-hoc Gemini concept evaluations without breaking the core chat flow
Accomplishments that we're proud of
We’re proud that Lobo is not just a mockup. It is a working end-to-end system that:
- Applies inference-time steering to a real open-source LLM
- Exposes those controls through a clean admin dashboard
- Powers a customer-facing chat experience using the same backend
- Logs generations and steering settings for traceability
- Adds automated concept-level evaluation and review tooling
We’re also proud of the core idea: instead of relying only on prompts, we built a safety layer that acts directly on model internals. That makes Lobo feel more like infrastructure than just another wrapper around an LLM
What we learned
We learned that LLM safety is as much a systems problem as it is a model problem. It is not enough to say “make the model safer” unless you can control, measure, and inspect that behavior in production.
We also learned:
- Inference-time steering is powerful, but very sensitive to vector quality and calibration
- Visualization matters a lot when you are explaining invisible model behavior
- Operators need fast feedback loops, not just offline evaluations
- Building a trustworthy AI tool means connecting model control, observability, and user experience into one system
What's next for Lobo
Next, we want to make Lobo more robust, more measurable, and easier to deploy.
Our roadmap includes:
- Improving steering vector quality and alignment across models and layers
- Letting users add controllable concepts
- Adding stronger benchmarking and side-by-side evaluation workflows
- Improving latency and reducing cold starts for production settings
Longer term, we want Lobo to become a practical control plane for safe open-model inference.
Log in or sign up for Devpost to join the conversation.