Training for Behavior, Not Knowledge
Your eval scores went up. Your agent still does the wrong thing.
Standard benchmarks measure what a model knows. MMLU, HellaSwag, ARC-Challenge. Fine-tune a model on domain-specific text and those scores often nudge up.
The agent still does the wrong thing.
That’s not a contradiction. Knowledge and behavior are different properties of a model, and the thing you measure shapes what training optimizes for. Train for the role. Conflating the two is how you ship something that scores well on paper and embarrasses you in production.
It took me longer than I’d like to take this seriously. Then I trained for behavior and watched the difference show up where it mattered.
The two roles
The system has two agents with different jobs.
The first is a foreman. It receives a high-level objective, decomposes it into tasks, delegates to the worker and validates the results. Its failure modes: doing the worker’s job itself, delegating tasks that exceed the worker’s scope, committing to conclusions before verification.
The second is a worker. It receives delegated tasks, executes them using tools and reports evidence back. It shouldn’t interpret beyond what the evidence shows. Fabricated tool output is the worst version of this: the model generates a plausible-sounding result when the tool fails, with no signal it happened. The third failure mode is scope expansion: touching things it wasn’t asked to touch.
Both agents are built on Gemma 4. The foreman is the 31B dense model. The worker is the 26B MoE variant, which activates roughly 3.8B of its 26B parameters per forward pass through specialized sub-networks. That sparse activation pattern fits what a worker does: diverse tasks, each individually narrow.
Out of the box, neither model behaves correctly for its role. They’re general-purpose instruction-following models. They produce helpful, fluent, varied responses. “Varied” is exactly wrong when a role requires consistent behavioral patterns. The base models needed shaping, not augmentation.
What QLoRA is actually doing
QLoRA (Quantized Low-Rank Adaptation) is the standard memory-efficient approach for fine-tuning at this size. The base model loads in 4-bit NF4 quantization, frozen, not trained (NF4 minimizes precision loss on normally distributed model weights, which large model weights generally follow.) Small adapter matrices (separate from the frozen base) train in full precision on top of selected layers. Their product approximates the weight update you’d get from full fine-tuning at a fraction of the memory cost. After training, you merge the adapter into the full-precision base. That’s the deployed model.
The configuration choices that matter, not as a recipe: a moderate LoRA rank (enough capacity for behavioral shaping), an alpha-to-rank ratio that amplifies adapter influence without overpowering the base, gradient accumulation to simulate a usable batch size and a cosine learning rate schedule with warmup. Three epochs. The defaults in most fine-tuning guides target domain knowledge transfer. Behavioral fine-tuning wants a lighter hand.
Rank 16 is enough for behavioral shaping (Every tutorial I found recommended rank 32. Too high for this goal.) Reaching for rank 64 usually means teaching information, not behavior.
The Gemma 4 module gotcha
Every QLoRA implementation requires you to specify which layer types to adapt. Most architectures expose the standard attention projection layers under recognizable names: q_proj, k_proj, v_proj, o_proj (the matrices controlling how the model weighs different parts of its input.)
Gemma 4 is different. Its attention layers wrap the standard linear projection inside a custom class, which means the actual weight matrix isn’t where you’d expect it from reading any other fine-tuning guide. If you target the names that work everywhere else, PEFT (the adapter training library) can’t find the modules. Training “succeeds.” The model barely changes.
This is the worst kind of failure. Silent. Plausible. The loss curve improves, eval scores climb a little. The adapter learned almost nothing about attention patterns, and you don’t find out until production behavior tells you so.
I caught this by checking the trainable parameter count before training started. A rank-16 adapter over seven projection layers in a 31B model should produce roughly 80 to 100 million trainable parameters. If the count is dramatically lower, the target modules are wrong. That check takes 30 seconds and saves hours.
The Gemma 4 wrapping isn’t in any official fine-tuning guide. You find it by reading the model source, noticing the weights aren’t where you expected and adjusting. Or you find it the way I almost did: by training a model that looked fine and behaved unchanged. I won’t make that mistake twice. Trainable parameter count is now the first check in every training script I write.
What behavioral training data looks like
Knowledge training data is text. Documents, articles, conversations. The goal is changing what the model knows.
Behavioral training data is examples of correct behavior: inputs paired with outputs that correctly execute the role. The goal is changing how it responds to the context signals of its role, not what it knows.
For the foreman, correct behavior looks like: receive an objective, break it into scoped tasks with clear boundaries, wait for results before drawing conclusions. The training examples demonstrate that pattern consistently. The model learns the shape of a correct foreman response, not new information.
The worker’s version is simpler in scope but harder to lock down. Take the task. Use the specified tools. Report exactly what the tool returned, not what the output implies. Flag anything outside scope. The training data needs enough variety that “use the tool” doesn’t quietly become “use the tool and interpret the result.”
The harder behavioral constraints to reinforce are the negative ones. Don’t do the other role’s job. Don’t fabricate. Don’t interpret beyond the evidence. A foreman trained only on good delegation examples will still do the work itself when the objective looks small enough. A foreman trained on examples that show not doing the work, even when it’s tempting, learns the boundary.
The overfitting trap
Behavioral fine-tuning has a specific failure mode: the model learns the format of your training examples instead of the behavior.
If every foreman training example uses the same planning structure, the model learns to produce that structure and forces it even when it doesn’t fit. The output looks right. The behavior is brittle.
The symptom: strong performance on eval examples that resemble training, weak performance on novel inputs. The fix: vary the surface form while holding the behavioral pattern constant. The behavior generalizes. The phrasing shouldn’t.
Three epochs (three full passes through the training data) is conservative for this reason. Validation loss is the primary metric; if it diverges from training loss after the first pass, stop early.
The model doesn’t learn what you intend. It learns what the training data rewards. Those are not always the same thing.
Next week: how you actually decide whether a fine-tuned agent is ready for production, and why standard benchmarks won’t tell you.



