Devise

Inspiration

The inspiration for this project came from a significant, largely overlooked challenge in AI: when a complex, multi-agent system fails, it's incredibly difficult to figure out who failed and when.

This process, called failure attribution, is crucial for debugging, but it's typically a manual task. It requires significant labor, deep domain expertise, and is very time-consuming.

This project was inspired by the idea of solving this problem by harnessing the comprehensive judgment capabilities of LLMs. The goal was to create a system for automated failure attribution, bridging the substantial gap between evaluation results and failure attribution, which currently relies heavily on manual labor.

What it does

In this project, we focused on first developing an Agent using an "Inner monologue" producing a step-by-step trace of thoughts and actions. We then made a second agent, the Judge, which uses this trace to find actions or thoughts that relates to a user-given query the most about the Agent's output and function. For example, the user may ask the Agent "I'm organising a party for the 19th of November. I expect 50 people will come." Then, the Agent will book a room following those requirements outputting it's conclusion.

The Judge, is then asked about the Agent's output. For example, the Agent's programmer may ask "Has the AI considered whether food is allowed?"

The purpose of the Judge is to look through the trace of the Agent and see whether the Agent considered the room rules. For example, the Judge could highlight the part of the trace where the Agent called check_room_rules, saw that it mentioned "no food allowed" and still booked the room! The programmer could then use this to possibly add a tool or change the system prompt, etc.

Naturally, to allow for more error detection and output explanation, the Agent is not very good at it's job and not many tools have been coded for it. The focus of this project is the traceability of the Agent and the fault detection abilities of the Judge.

You can try it out by following the instructions on the GitHub repository!

How We Built It

As said, this project was built in two major parts: an Agent that does the work and a Judge system that analyzes its failures.

The Agent:
- We built a helpful Town Hall agent using LangGraph's create_react_agent function.
- This agent is powered by the claude-3-5-sonnet model.
- We gave it a specific set of tools (its components) to interact with its environment: check_calendar, check_room_rules, assign_task, and final_answer.
- To capture its Failure Logs, we integrated the Langfuse client for tracing and observation.
The Judge (Hybrid Failure Analysis):
- This is the core of the project. We implemented the Hybrid Method described in the research paper. This method combines the All-at-Once and Step-by-Step analysis methods to get the best results.
- We used claude-3-5-sonnet as the judge_llm for this analysis.
The two-step hybrid process works like this:

Step 1 (All-at-Once): The find_responsible_component function looks at the entire log. It identifies who failed, outputting the name of the component (e.g., Orchestrator for the AI's reasoning, or check_room_rules for a tool).

Step 2 (Step-by-Step): The find_decisive_error_step function takes the component name from Step 1. It then scans only the messages from that component and asks the judge Yes/No if the current step is the single decisive error. The moment the judge says Yes, the search stops.

Challenges Faced

The research paper clearly outlines how difficult this task is, which presented several challenges:

The Who vs. When Trade-off: The All-at-Once method (analyzing the whole log) is good at finding the responsible agent because it has a broader failure log context. However, it is obviously the worst - even performing below random - at finding the exact step.
The Needle in the Haystack Problem: This failure is due to the 'space-in-the-needle' problem, where LLMs struggle to retrieve specific information from long contexts.
The Step-by-Step Weakness: Conversely, the Step-by-Step method is good at finding the step but is less accurate at finding the agent, because its final decision can be made with incomplete information.
Context Length: A universal challenge is that as the failure logs get longer, the performance of all analysis methods declines. Step-level accuracy is more sensitive to this.

What We Learned

The most important takeaway was how to overcome these challenges.

The Hybrid Method is the Solution: We learned that the Hybrid Method is the best approach because it leverage[s] the advantages of both two different methods.
A Best of Both Worlds Process: The hybrid method (which we implemented) first uses the All-at-Once method's strength to find the agent. It then uses that agent's name to narrow the range of possible failure steps. This significantly reduces the difficulty of prediction for the Step-by-Step method, allowing it to accurately find the step.
Reasoning is Critical: We learned that how you prompt the judge matters. The paper shows that explicitly requiring the LLM to provide a reason for its judgment greatly boost[s] their performance. Removing these reasoning prompts causes a significant drop in performance.
There is a Trade-off: Finally, we learned that this superior accuracy comes at a cost. The hybrid method incurs higher computational costs because it requires running two algorithms sequentially.