Status: โ Functional | Pipeline: 100% Passing | Models: Gemini Flash / Live
This document serves as an "honest readme" regarding the evolution of our agent evaluation pipeline, from initial failures to a robust, three-tier testing strategy.
We initially encountered critical failures when upgrading to Gemini 2.5 Flash. The core issue was a strict constraint in the new model architecture: It does not support mixing Google Search citations with other function calls in the same turn.
To resolve the ClientError: 400 INVALID_ARGUMENT (Mixed Tools), we refactored the monolithic ResearcherAgent into two specialized components:
SearchAgent: Dedicated solely to using thegoogle_searchtool. It outputs raw search results.ResearchAnalysisAgent: Dedicated to "thinking". It takes the search results as input (context) and uses internal tools/logic to synthesize an answer.SequentialAgent: Orchestrates them (Search -> Analysis), ensuring the model never sees conflicting tool definitions in a single context.
To "stretch" our evaluation and ensure reliability beyond just "it didn't crash", we implemented the Agent Testing Pyramid.
- Goal: Ensure individual agents are configured correctly and select the right tools in isolation.
- Implementation:
pilot/tests/test_search_agent.py&test_analysis_agent.py. - What Works: We now verify that
SearchAgenthas the correct instructions and tool definitions without needing to run the full expensive pipeline.
- Goal: Verify the agent behaves correctly, not just that it produced an answer.
- Implementation:
- Updated
evaluation_dataset.jsonto include"expected_tool_sequence": ["MainWorkflowAgent"]. - Updated
benchmark_prompts.pyto trace the execution path.
- Updated
- Metric:
trajectory_score. We require a score of 0.8+ (along with semantic similarity) to pass. This catches cases where the agent might hallucinate an answer without actually using the required tools.
- Goal: Allow humans to inspect the reasoning process for complex queries.
- Implementation:
pilot/evaluation/human_review.py. - Result: Each run generates a clean Markdown report in
pilot/evaluation_reports/containing the full Q&A trace, tool usage, and scores. This is uploaded as a CI Artifact (human-review-reports) for easy inspection.
Our agent architecture has undergone a significant metamorphosis to address the "infinite loop" problem and optimize for cost/latency.
Initially, the agent was a Reactive entity. It blindly entered a research loop for every query.
- The Flaw: When validating jargon-heavy queries (e.g., "activate RACE START"), the validator would reject imperfect answers, forcing the agent to research again and againโspinning indefinitely.
- The Diagram:

We re-architected the system into a Predictive "Intelligence Center". The agent now acts as a Planner, routing queries based on knowledge state.
- The Fix: A Memory-First strategy.
- Recall: The agent must check Long-Term Memory (Vertex AI) first. If the answer exists, it returns immediately (0 searching cost).
- Research: Only if memory misses does it deploy the heavy
DeepResearchWorkflowtool.
- The Diagram:

This shift transforms the agent from a simple tool-user to a state-aware orchestrator.
We further evolved the architecture to handle Vision capabilities while maintaining the strict tool definitions of the backend agents.
- The Challenge: The
IntelligenceCenterAgentand its tools (Search, Research) are text-based and "blind" to images. - The Flow: The Orchestrator (Alora) acts as the vision layer.
- See: Alora receives the user's image + query.
- Describe: Alora generates a high-fidelity text description of the image (colors, objects, text).
- Delegate: Alora passes this description + the original query to the
IntelligenceCenterAgent. - Solve: The backend agents research the concept of the image (e.g., specific car part) without needing raw pixel access.
We integrated Google Cloud Model Armor to sanitize inputs before they ever reach our agent logic.
- Mechanism: A
before_model_callbackintercepts every request. - Filters: We use the
alora-ma-templatewhich enforces:- PII Detection: Blocks sharing sensitive personal info.
- Jailbreak/Attack: Prevents prompt injection attempts.
- Malicious URIs: Filters unsafe links.
- Result: If a thread is detected, the prompt is scrubbed and replaced with a system refusal instruction, protecting the LLM context.
- Monte Carlo Tree Search (MCTS): While intended to be part of the advanced planning capabilities, the MCTS component is currently not fully functional and disabled in the active evaluation path. We are relying on the deterministic
SequentialAgentflow for now. - Dependency Speed: The
sentence-transformerslibrary (used for similarity scoring) is heavy. We implemented a robust fallback to a mock scorer if the download times out, ensuring the pipeline doesn't flake due to network issues, but this means local runs might sometimes skip semantic verification if the environment isn't cached.
You can run the pilot server locally for development.
Option 1: Standard Run (Fast & Simple) Use this for quick logic iteration.
cd pilot
# Run via Uvicorn module
uv run python -m uvicorn main:app --host 0.0.0.0 --port 8080 --reloadOption 2: With Datadog Tracing (Full Observability) Use this to debug traces and LLM Observability spans locally.
cd pilot
# Run with ddtrace wrapper
export DD_ENV=local
uv run ddtrace-run uvicorn main:app --host 0.0.0.0 --port 8080 --reloadWe have integrated ElevenLabs TTS to provide on-demand audio insights for our widgets.
- On-Demand Generation: Audio is synthesized only when requested (clicked) to optimize costs.
- Custom Waveform: A custom canvas-based visualizer mimics the ElevenLabs UI style.
- Caching: Generated audio files are stored in a public Google Cloud Storage bucket (
audio_assets/) and served via CDN to avoid re-generating the same audio.
Required environment variables in .env or Google Cloud Run:
ELEVENLABS_API_KEY: Your API Key.AUDIO_BUCKET_NAME: GCS Bucket name (defaults tovigilant-journey-assets).
The service automatically creates the bucket and sets public-read permissions if it doesn't exist.
### Evaluation Suite
```bash
# Full Suite (Tiers 1-3)
cd pilot
uv run python -m pytest evaluation/test_evaluation_pipeline.py
Environment variables AGENT_MODEL=gemini-2.5-flash and INTERNAL_MODEL=gemini-2.5-flash should be set (or configured in .env).
We utilize Datadog for full-stack observability, including APM traces, RUM, and Incident Management.
We have compiled a comprehensive incident report analyzing the stability of the Pilot launch and Outage tracking.
- Report: Alora Incident Report (PDF)
- Methodology: These incidents were manually declared and managed directly within the Datadog platform to demonstrate the end-to-end Incident Management lifecycle, generating authoritative reports from system records.
This report covers:
- Incident Trends: Breakdown of SEV-1/SEV-2 outages.
- Response Metrics: MTTR (Mean Time to Repair) analysis.
- Post-Mortems: Action items derived from simulated API and Audio failures.
