Skip to content

llm: static tool scheme#10366

Merged
vito merged 79 commits intomainfrom
llm-static
May 14, 2025
Merged

llm: static tool scheme#10366
vito merged 79 commits intomainfrom
llm-static

Conversation

@vito
Copy link
Copy Markdown
Contributor

@vito vito commented May 9, 2025

Switch mostly from dynamic tools to static

Instead of listing available objects + methods in tool descriptions, there are new tools:

  • list_objects - list all known objects and their descriptions
  • list_methods(type) - list known methods, optionally for a specific type
    • shows method name (http, Container.withExec), required args, and return type
    • does NOT show description - required args is enough

Instead of representing APIs as tools that dynamically become available, there are tools for calling methods:

  • call_method - call a single method against a self
  • chain_methods - call a chain of methods starting from a self
    • I experimented with this separately in llm: chaining tool calls #10229 but this PR's implementation seems to work much more consistently (maybe "method chaining" is a strong metaphor)

So what does "mostly" mean?:

  • The save tool still has a dynamic description. I experimented with making this static (by adding a list_outputs and having save save one arbitrary name+value output at a time) but it seemed to really degrade model behavior - it's tough to beat a single function with a schema for the required outputs.
  • The user_provided_values tool still works the same as before (a description derived from the inputs).
  • I think these both are fine for all intents and purposes. We don't need MCP clients to save values, and inputs are static anyway.

Remove the think tool

It's nifty, but not fully proven - evals pass just fine without it. Some clients (like Zed) already have their own thinking tool, some models have an explicit "thinking" mode, so let's wait until we're sure we need it and add it more thoughtfully (opt-in?).

Evals Report (analysis)

Model Eval Success Rate Input / Output Tokens Traces
claude-3-5-sonnet-latest Basic 100%
(15 → 30 (+15) attempts)
2268.1 → 708.0 (-1560.1)
99.1 → 4.0 (-95.1)
[1][2][3][4][5][6][7][8][9][10]
claude-3-5-sonnet-latest BuildMulti 100%
(15 → 30 (+15) attempts)
8213.5 → 7128.4 (-1085.1)
883.3 → 625.7 (-257.6)
[1][2][3][4][5][6][7][8][9][10]
claude-3-5-sonnet-latest BuildMultiNoVar 100%
(15 → 30 (+15) attempts)
6713.9 → 8931.2 (+2217.3)
918.5 → 632.5 (-286.0)
[1][2][3][4][5][6][7][8][9][10]
claude-3-5-sonnet-latest ReadImplicitVars 100%
(15 → 30 (+15) attempts)
5709.5 → 4315.9 (-1393.6)
398.8 → 394.1 (-4.7)
[1][2][3][4][5][6][7][8][9][10]
claude-3-5-sonnet-latest UndoChanges 100%
(15 → 30 (+15) attempts)
7912.5 → 9496.1 (+1583.6)
939.1 → 892.0 (-47.1)
[1][2][3][4][5][6][7][8][9][10]
claude-3-5-sonnet-latest WorkspacePattern 93% → 100% (+7%)
(15 → 30 (+15) attempts)
6377.1 → 8242.3 (+1865.2)
696.3 → 852.1 (+155.8)
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash Basic 100%
(50 → 100 (+50) attempts)
682.0 → 296.0 (-386.0)
2.0
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash BuildMulti 100% → 96% (-4%)
(50 → 100 (+50) attempts)
33393.5 → 29076.7 (-4316.7)
364.1 → 356.8 (-7.3)
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash BuildMultiNoVar 98% → 97% (-1%)
(50 → 100 (+50) attempts)
38555.9 → 24493.3 (-14062.6)
462.0 → 420.0 (-42.0)
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash ReadImplicitVars 100% → 98% (-2%)
(50 → 100 (+50) attempts)
5000.6 → 5792.9 (+792.4)
180.9 → 174.4 (-6.5)
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash UndoChanges 100% → 99% (-1%)
(50 → 100 (+50) attempts)
15152.1 → 14817.4 (-334.7)
393.1 → 422.8 (+29.7)
[1][2][3][4][5][6][7][8][9][10]
gemini-2.0-flash WorkspacePattern 70% → 100% (+30%)
(50 → 100 (+50) attempts)
8758.4 → 9985.9 (+1227.5)
274.6 → 391.9 (+117.3)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 Basic 100%
(25 → 50 (+25) attempts)
1442.6 → 316.0 (-1126.6)
47.6 → 3.9 (-43.7)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 BuildMulti 100%
(25 → 50 (+25) attempts)
33408.3 → 29501.2 (-3907.0)
260.0 → 253.6 (-6.4)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 BuildMultiNoVar 100%
(25 → 50 (+25) attempts)
33678.1 → 29210.7 (-4467.4)
208.5 → 250.4 (+41.9)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 ReadImplicitVars 100%
(25 → 50 (+25) attempts)
3879.3 → 4968.0 (+1088.7)
68.7 → 86.1 (+17.4)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 UndoChanges 100%
(25 → 50 (+25) attempts)
9430.6 → 10914.3 (+1483.7)
149.2 → 183.9 (+34.7)
[1][2][3][4][5][6][7][8][9][10]
gpt-4.1 WorkspacePattern 92% → 100% (+8%)
(25 → 50 (+25) attempts)
10423.2 → 9712.6 (-710.6)
184.0 → 232.3 (+48.3)
[1][2][3][4][5][6][7][8][9][10]

@vito vito force-pushed the llm-static branch 4 times, most recently from 8e14034 to 739303f Compare May 12, 2025 14:26
@vito vito mentioned this pull request May 12, 2025
@vito vito marked this pull request as ready for review May 12, 2025 16:12
@vito vito requested review from a team as code owners May 12, 2025 16:12
@vito vito force-pushed the llm-static branch 3 times, most recently from c0d7874 to 7374d3c Compare May 12, 2025 21:50
Comment thread core/integration/llm_test.go
Comment thread core/mcp.go
Comment thread core/llm_dagger_prompt.md
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scheme is actually pretty simple to explain now. From empirical testing and a bit of vibes, the "object + method" metaphor seems to carry us pretty far. For example, the model struggled with chained_tools (using it at times when it shouldn't) but seems adept with chain_methods.

@vito vito requested review from cwlbraa and shykes May 12, 2025 22:38
@vito vito added this to the v0.18.7 milestone May 12, 2025
vito added 8 commits May 13, 2025 09:46
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
* add describe_context, run_tool
* disable dynamic tool selection, to stress test

for compatibility with clients that only support static tools

evals (not passing):

    https://v3.dagger.cloud/dagger/traces/c76d5bcbfa579abfe442f02687b58fca

Signed-off-by: Alex Suraci <alex@dagger.io>
This reverts commit be25f2e75813457f0ce816c2e34333addec4d724.

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
this busts caches all the time and won't work generally with all clients

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 7 commits May 13, 2025 09:47
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
previously Env would be stuck with the `Object` at Env construction
time, which won't have module dependencies. instead we use the Root of
the *dagql.Server at runtime, which was added for this very kind of
behavior.

Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 2 commits May 13, 2025 11:09
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 9 commits May 13, 2025 11:13
seeing this misunderstanding occasionally across all models,
where they go all the way back to Directory#1

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
This almost works, but call_method and chain_methods currently need a
non-strict schema since they intentionally use additionalProperties for
the args schema. I don't think we want to sacrifice that, so I'll just
stop short of actually enabling it, since explicitly requiring every
param is probably a good idea - sometimes the model seems to make
assumptions about what omitting it means (sorry to vibesplain).

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
seems to hinder more than it helps - the model frequently hallucinates a
bogus value, it's better to just show it everything

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
see googleapis/go-genai#310

Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 2 commits May 14, 2025 12:33
started using these for strict: true compliance

Signed-off-by: Alex Suraci <alex@dagger.io>
we don't necessarily want to over-tune for this use case,
so just be more explicit in the test

Signed-off-by: Alex Suraci <alex@dagger.io>
Copy link
Copy Markdown
Contributor

@cwlbraa cwlbraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, and noting for posterity i did a bunch of manual testing of this on d8727ca yesterday.

i don't think any of my comments are blocking.

Comment thread core/integration/llm_test.go
Comment thread core/llm.go

func (r *LLMRouter) LoadConfig(ctx context.Context, getenv func(context.Context, string) (string, error)) error {
if getenv == nil {
getenv = func(ctx context.Context, key string) (string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the full intent of these changes? parallelize, obviously, refactor a little, but does this change error output at all?

sidenote: why do you pass ctx to roundly ignore it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is just to parallelize, since I saw CI was taking a very long time to chew through all of these one by one. There should be no behavior difference (besides speed).

ctx is only ignored by the getenv fallback function (if getenv == nil) - afaik it's not ignored by the "real" getter that's passed in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i read right past the getenv func param, now this makes sense

Comment thread core/llm.go
Comment on lines +655 to +657
llm.err = llm.loop(ctx, dag)
})
return err
return llm.err
Copy link
Copy Markdown
Contributor

@cwlbraa cwlbraa May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bugfix, isn't it? is there a test? (not requesting one rn, just noting for posterity that there's a case here where you can lose the error by calling sync twice)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a missing test here, and I don't think there's a very obvious place to put one at the moment. 😕

Comment thread core/llm_dagger_prompt.md
You will be given a task described through the combination of tool descriptions and user messages. The `select_tools` tool describes the available tools and objects. The `save` tool, if present, describes the desired outputs.
The Dagger tool system operates as a chain of transformations where:
1. Objects are referenced by IDs (e.g., Container#1, File#2)
2. All objects are immutable - methods return new objects rather than modifying existing ones
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol except host directories with my reloading thing ... i am kinda worried going down this path of mounts and live reloading and whatnot is confusing both for us and for the LLMs

Copy link
Copy Markdown
Contributor Author

@vito vito May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been thinking of walking back these 'immutable' assertions and trying to let the model just trust return values (like "if you call X against Container#1 and get Container#2, trust that Container#1 remains unmodified").

There are already places in the API where 'immutable' doesn't hold true (like starting/stopping a service).

We can burn that bridge when we get to it.

Comment thread core/llm_docs.md
return withLLMReport(ctx,
m.llm(dagger.LLMOpts{MaxAPICalls: 20}).
WithEnv(dag.Env(dagger.EnvOpts{Privileged: true}).
WithStringOutput("methods", "The list of methods that you can see.")).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the LLM can see the tools, obviously, but can i inspect them on an LLM object still? there's still a tools method and i don't think the schema changed, so why'd the $agent | tools test break?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methods aren't exposed as tools anymore - there's a static set of tools for calling methods, so $agent | tools is only going to show you list_methods, call_method, etc.

So the only way to see the set of methods available is to get the LLM to check for you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, duh... in retrospect very obvious.

with this scheme you can't "break in" to the black box of methods yourself. that's good for now, especially with us changing the internals on the regular, but i wouldn't be surprised if we want humans to be able to inspect the result of list_methods eventually.

Comment thread core/mcp.go
Comment thread core/mcp.go
@vito vito merged commit cc5e55e into main May 14, 2025
58 checks passed
@vito vito deleted the llm-static branch May 14, 2025 18:53
@cwlbraa cwlbraa modified the milestones: v0.18.7, v0.18.8 May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants