Agentic Contract Framework

flow
gap
morph
sworn
dashboard
alert

Summary

Agentic Contract Framework (ACF) monitors AI agents based on behavioral contracts, not just infrastructure metrics. Features 3 verifier types (semantic, deterministic, natural language inference) and contract coverage to find monitoring gaps. Demo: Morph (AI tutor using Gemini) with 7 Datadog monitors tracking pedagogical alignment. When agents violate commitments, Datadog alerts trigger with Runbooks for remediation.

Key Deliverables and Where to Find Them

The references to files are all within this repo: github.com/kavishsathia/usemorph

Datadog Dashboard: Screenshot at /datadog/evidence/dashboard.png, JSON configuration at /datadog/dashboard.json
Actionable Records: Screenshot at /datadog/evidence/emailalert2.png, Runbook at https://github.com/kavishsathia/usemorph/wiki
Application using Gemini and Datadog: Repo at github.com/kavishsathia/usemorph, Hosted at usemorph.ai or usemorph.vercel.app (fallback)
- MIT License found in LICENSE
- Hosting Instructions in README.md
- Name of Datadog Organisation: Kavish (also in README.md)
- Traffic Generator at /python/traffic_generator.py which configuration details in README.md (towards the end)
Datadog Integration Code: /python/main.py Line 63
7 Detection Rules (Monitors): Screenshot at /datadog/monitor.png, JSON configuration at /datadog/monitors/*.json
3 SLOs: Screenshot at /datadog/slo.png, JSON configuration at /datadog/slos/*.json
Rationale behind detection rules and use of case or incident at /datadog/OBSERVABILITY.md
Evidence of working dashboard, monitors, SLOs and alerts shown at /datadog/evidence/*.png
Video at https://www.youtube.com/watch?v=nvUgCIXq7s0

Inspiration

Typical deterministic software uses tools like unit tests to ensure the code functions correctly. However, as we find ourselves becoming more reliant on AI agents to do our work, we will need a smarter and more efficient means of monitoring them and verifying their output is correct. To solve this, I came up with the Agentic Contract Framework (ACF), a strategy to monitor agents based on domain-specific behaviour.

In this submission, I use ACF to monitor agentic executions in a AI tutoring software called Morph (also built for this hackathon). When AI agents start taking on more sensitive tasks like healthcare, finance and learning, we need a way to monitor their execution based on deep domain knowledge. In my demo, I will be showing how a contract-based verifier can observe whether the agent meets pedagogical best practices, adapt to the learning pace and so on.

What it does

The Premise

It works by having the developer predefine a contract with multiple commitments. A contract is like a test suite and a commitment is a test case, each commitment has a verifier (which can be deterministic or semantic). And at the end of the agent's execution, the developer can verify the execution against the contract.

When I was developing this, I realised that I can develop a good framework-agnostic abstraction around this, so I turned it into a pip-installable library called Sworn: https://github.com/kavishsathia/sworn. To make it clear how a contract is defined, I will give a concrete example in Python.

observer = DatadogObservability()
contract = Contract(
    observer=observer,
    commitments=[
        Commitment(
            name="no_harmful_content",
            terms="The agent must not produce harmful or offensive content",
            verifier=no_harmful_content, # this is defined by the dev
            semantic_sampling_rate=0.5 # only runs evals on half the executions
        )
    ]
)

The terms here are simple and non domain specific for brevity. But in real world usage, a contract can double as both as a structured domain specific system prompt and set expectations for the verifier.

The Strategy

We want an end-to-end observability strategy, so it will serve us well to define the "ends" first. In an agent's execution, we start with an expectation. The expectation drives everything else including the prompt we give to our agents. We must capture this expectation to have a better understanding of the agent's execution. The other end is when we check whether our expectation was satisified by examining the misalignments. Then we figure out how to adjust our expectations in the future (by writing better prompts, by lowering expectations etc). It's very much a debugging process.

The Flow

My proposal here is that the strategy should be decomposed into 4 parts:

How to set expectations?
How to evaluate expectation against reality?
How to aggregate and find "gaps" in our expectations?
How do we adjust our expectations?

How to set expectations?

We can start with a candidate set of expectations (we don't know if the agent can fulfil this yet), and as we go through the cycle of adjusting these expectations, we will find that the agent will gain better alignment with us. For the first iteration, we can simply define a contract with commitments that we think should be fulfiled.

How to evaluate expectation against reality?

This is where the verifier comes in. We could use an LLM-as-a-judge approach to use another LLM to evaluate the execution. But as we make our expectations more bounded and well-defined, we could potentially cut back on LLM usage and deterministically check whether the expectation was met.

For instance, in my pedagogy based application, one of my commitments was "Do not bombard the user with information". Seemed pretty obvious to me what that meant, but the agent kept insisting on creating simulations (one of the app's features is to allow agents to create simulations) to illustrate its point. When I noticed this, I simply changed the commitment to have maximally 1 point, 2 elaborations, 1 example and 1 simulation per message. Now I can deterministically check whether the expectations were met.

I also created a few more substrategies for anyone beyond this hackathon to be able to run evaluations. Beyond just the LLM-based verifier, there is also a Natural Language Inference verifier which is able to detect contradictions for simpler tasks (it runs on CPU and its fast and cheap), and there is also a pattern based verifier, introducing a regex like approach to ensure a pattern exists or doesn't exist within the agent's tool call sequence (inspired by Manna Pnueli's work on safety and liveness guarantees).

How to aggregate and find "gaps" in our expectations?

Once we're done verifying, we submit the trace and the evaluation to Datadog. Datadog is GOATED for allowing custom evaluations to be sent (something I've not really seen on other observability platforms). Here is where our monitors and SLOs come in, and I'll be briefly describing the ones I've configured here because it plays directly into the structure of the contract.

I implemented 6 detection rules on Datadog. All monitors trigger at 70% pass rate (critical) and 80% pass rate (warning) for evaluation-based rules.:

1. Socratic Questioning Verification: Alerts when the AI provides direct answers instead of using guiding questions. Ensures core pedagogical principle adherence.

2. Challenge Level Verification: Monitors whether challenge difficulty appropriately matches student knowledge level. Prevents content that is too easy or too difficult.

3. Goal Commitment Verification: Ensures the AI guides students toward learning goals through active discovery rather than passive instruction.

4. Hint Frequency Verification: Tracks whether hints are provided at the right frequency - after reasonable student attempts, not immediately or too late.

5. Pacing Verification: Monitors conversation pacing to ensure balanced explanations and student input without overwhelming or under-engaging students.

6. High Response Latency: Alerts when response times exceed 15 seconds (warning) or 20 seconds (critical), indicating potential infrastructure or API issues.

6. Error Rate: Alerts when alert rate exceeds 5% (warning) or 10% (critical), indicating potential infrastructure or API issues.

Each one of the first one are a commitment in the contract itself, and Datadog calculates the rate of their failures and triggers alerts based on that.

Next, we can find gaps in this process. There are two main gaps here:

The gap between execution and the contract (this is the obvious one). Verifiers can check for this will report it on Datadog.
The gap between the contract and your expectation. In your mind, you have a model developed on how the agent should execute, but translating this model to a contract is no easy feat.

ACF attempts to detect both of these gaps. For the first gap, it is mainly detected through Datadog and the pass rates of different commitments. For the second gap, I developed a concept called contract coverage to identify which parts of the agent's execution was "covered" by a commitment. And the parts that are not covered are the parts that you haven't touched on in your contract. The coverage is also submitted as an evaluation to Datadog.

To illustrate that, when I wrote my contract for Morph (the demo), I missed out on covering on how the agent should use the tool to close simulations. When I used it, it never closed the simulations ever. On the frontend, iframes stacked up and caused a really bad user experience. It's what drove me to develop this concept so that I can catch the fact that I didn't "cover" a rule. In my mind I knew the ideal experience was that the agent closes simulations that are not needed, but for whatever reason, I didn't encode it within the contract.

The Gaps

How do we adjust our expectations?

Adjusting our expectations is equivalent to mutating the contract. There are a few ways to do this:

Add a new commitment when you realise you haven't enforced a rule
Make a commitment (and its verifier) stricter when you realize it's too lax
Relax a commitment (and its verifier) when you realize its way too strict

The first one is discovered through contract coverage, second one is discovered when agent's can take shortcuts to get to the goal, and the third one is discovered when the agent constantly fails to meet the mark.

The Crystallized Strategy

Now, it should be clear that getting an agent to work is very much an iterative process of adjusting your expectations and hence contract in a way that is accurate, clear and easy to follow.

What's the difference from LLM-as-a-judge?

LLM-as-a-judge is good because it abstracts away evaluation from the developer. ACF on the other hand, requires the developer to define their own evaluations in their own environment so that they can be imbued with the same deep domain knowledge that the agents themselves have. LLM-as-a-judge is good for situations where domain knowledge isn't needed (detecting hallucinations etc), ACF is meant for situations where deep domain knowledge is required.

How we built it

I started off by building Morph, and in the middle of it all I decided to switch from Vercel AI SDK to Google ADK. And I hit a massive roadblock because ACF was so closely coupled with Vercel AI SDK and its callbacks. Since I was about to switch away anyways, I took the liberty to design a framework-agnostic adaptation of ACF called Sworn.

I will first explain how Sworn was built, and then I'll explain Morph.

How I built Sworn

The Abstraction

It took me a few days to find the right level of abstraction. How do you intercept agent tool calls without plugging into framework level callbacks (like Google ADK's after_tool_callback)? It seems obvious in hindsight, but the solution was to wrap the tools given to the agents with a decorator that keeps track of the tool calls.

This approach has a few merits, it shows that the library has really clear boundaries, it is used only before and after the agent is running, not when it is running. Besides, this approach not only works for LLM-based agent but any agent that has Python functions as its actuators and sensors (it can apply to robotics, simple reflex agents, and even hardcoded function calls).

The Execution Context

A single contract can have many executions. They can run at the same time too. How then do we make sure the contract decorators are appending tool calls to the right execution. The solution I came up with was to have a ContextVar where execution is tied to the current context. Hence, the contract decorators add the tool calls to the execution accessible through the ContextVar.

The Cost of Verifying

This is probably the single biggest problem for the adoption of any LLM based verification. Why pay twice the price? I adopted a few strategies to help reduce costs.

The first is to not just rely on the semantic verifier, some commitments really can be verified just by using code (deterministic). A semantic verifier should usually only come in when we are dealing with language (like my demo showing the Socratic tutor). I also adopted Natural Language Inference so that people don't have to use LLMs for a simple text entailment task.

The second stategy is to have a sampling rate. Usually we don't want to verify every single agent execution (that is resource intensive). So what if we pick a random sample and verify those instead. The user can set this at the commitment level, so each commitment will only run {sampling_rate} of the total amount of times.

The third strategy is leveraging prompt caching. Typically in verification, you'd imagine you start the instruction to the verifier agent with the commitment terms, then the agent's execution, and then ask it to verify. However, since multiple verifiers for the same execution and different commitment terms are running at similar times, we can introduce an optimisation here. We can move the execution details before the commitment terms, this way all the LLM calls in that small period of time will have exactly the same prefix and hence, we can save some resources as the prefix will be cached.

Contract Coverage

As I explained earlier, this is something I didn't actually have on the roadmap and I only created it to satisfy my need to find out the "complement" to my contract, as in from all possible behaviours, what has my contract not enforced yet.

For the time being, the mechanism behind this is to get verifiers to self report which interactions are covered by them, and the complement of the union of the covers from all the verifiers is the uncovered parts. A percentage will be reported to Datadog along with the names of the interactions that are not covered.

How I built Morph

A good framework will need a good demo. I needed a demo where success is not equivalent to having no tool call or rate limit failures, its deeply domain specific, there is some challenge imposed for the agent to overcome (and failure to do so is caught).

Morph's Features

Morph allows both a conversational and illustrative way to learning. The agent can teach the student by creating a simulation on the side where students can learn about "gravity" etc. These are like windows in an operating system which the agent can control using tool calls. This imposes quite a sufficient challenge for agents, after all they cannot see what the student is seeing, but they have to manage what the student is seeing. The challenge is naturally present through an asymmetry in information.

The student can also present their prior experience, learning pace and hint frequency, presenting more areas for the agent to align itself.

The domain specificity comes from the pedagogical best practices that is encoded within the agent's contract. Here is a snippet:

The goal is not just to provide answers, but to help students develop robust understanding through guided exploration and practice. Follow these principles. You do not need to use all of them! Use your judgement on when it makes sense to apply one of the principles.

For advanced technical questions (PhD-level, research, graduate topics with sophisticated terminology), recognize the expertise level and provide direct, technical responses without excessive pedagogical scaffolding. Skip principles 1-3 below for such queries.

1. **Use leading questions rather than direct answers.** Ask targeted questions that guide students toward understanding while providing gentle nudges when they're headed in the wrong direction. Balance between pure Socratic dialogue and direct instruction.

This acts as both the system prompt and the expectation for the semantic verifier, hence the barrier to using this strategy is almost negligible since the developers don't have to instrument an entirely new prompt for the verifier (other than the cost of verification).

Morph's Integration Points into Sworn and Datadog and Gemini

Morph's contract has 5 commitments. They are stated above, and correspond directly to the monitors set on Datadog. When an alert is triggered on Datadog, a link to the Runbook will be attached with the incident.

An example of a Runbook is here: Challenge Level Verifier Runbook

The Runbook guides the developer in finding the gaps in the contract and suggests methods to alleviate this gap. The Runbook for each commitment is specific to that commitment.

Morph also uses the Google ADK with Gemini 3.0 Flash as its model. It has access to tool to create simulations and facilitate the student's learning process.

Challenges we ran into

All the challenges came from developing the core abstractions. The proceeding implementation was extremely smooth thanks to the solid abstractions that were built upon.

Early Days

A few challenges worth mentioning is coming up with the contract abstraction itself. In the very first iteration of this concept, I used the concept of confessions. The agent will confess its assumptions and report this to Datadog. Around a week after, OpenAI was working on a similar abstraction (https://openai.com/index/how-confessions-can-keep-language-models-honest/), so it validated my approach and I started actually testing it.

Then, I realised even when agent's don't want to lie, they get so stuck in their own incentives that they have successfully convinced themselves that they are correct. It's like having a delusional criminal at a lie-detection test. Even my tests with models as good as Sonnet 4.5 didn't bear proper results. My own coding agents would claim to have perfect sandboxing for Morph when they've simply used innerHTML as the "sandbox". When asked, it says it's completely perfect. This is not right.

So, I started modelling how the real world works, notably how freelancers accepts a set of expectations and execute upon it. For verification, I understood that incentive was a huge problem, so we needed to flip the incentive of the verifier from that of the implementor. While the implementor wants to defend their decisions, the verifier should critique it and break it down.

So, I came up with the Agentic Contract Framework, which allows a separate LLM with a different incentive to check agent outputs.

The Implementation

A few challenges here too again with regards to the fundamental nature of transformer models. In my initial concept, a contract was just a free string, there was completely no structure. The verifier sees this and tries to find as many violations as possible.

This is akin to having a single unit test for your entire application, except it's not a unit anymore because it's trying to test literally everything. The verifier would catch violations, but sometimes its nitpicky about some very specific part of the contract, instead of viewing all commitments, it focuses on the one that is hyperspecific (includes counting, specific words etc).

Deriving from my analogy of unit tests, I rewired the abstraction such that the contract is a test suite and the commitments are the unit tests. This allowed for more fine grained knowledge on what failed, and allows all commitments to actually be verified. But of course, this increases the cost, but I know this is the more natural thing to do because now each commitment can get a different verifier, and the type of verifier is based on the domain specific needs.

Accomplishments that we're proud of

Usually I'm more proud of what I can show people, but this time I'm more proud of the invisible abstractions that I've come up with. I've developed enough applications to know that when the core data model is flawed, everything that follows will be terrible to implement. However, developing Sworn felt completely different. Once I landed on the contract-commitment abstraction, everything else including the verifiers and Morph fit perfectly into place without needing to change the core data model. I am proud of that.

What we learned

I learned a lot throughout the course of this hackathon. But the one thing that stands out is learning to accept when I'm wrong and figure out a different path. When the confession model was wrong, I figured out the problem with incentives. When the atomic contract model was wrong, I figured out that it had to be broken down further.

And beyond that I was introduced to this massive field of observability that I have never tackled before. I have never created a monitor, SLO, dashboard or Runbook in my entire life. Now that I have, I'm equipped with the capabilities to implement these in my future projects, making them more production ready and easier to debug.

What's next for Agentic Contract Framework

The insights that I've gained over the past month are incredibly valuable to stop working on this project. Here are a few directions I am planning to take.

Agents as State Machines: Every action an agent takes changes the state of the agent, what if we can make this explicit, so that verification can be easier (is there an edge from this state to another?) and we can also use it as documentation for how the agent should work.
Behavioural Coverage: Contract coverage can tell you when a specific tool call is not covered in your contract, but how about a specific combination of tool calls, or a specific pattern of tool calls?
Finding Gaps: This is the biggest value proposition of ACF, the edge that turns what is usually a linear process into a cyclical one. We need a way to look at many evaluations and figure out the gaps from all of them, instead of having the user manually check through each one when an alert triggers.

Built With

datadog
gemini
googleadk
nextjs
python
trigger.dev