llm: static tool scheme by vito · Pull Request #10366 · dagger/dagger

vito · 2025-05-09T02:11:37Z

Switch mostly from dynamic tools to static

Instead of listing available objects + methods in tool descriptions, there are new tools:

list_objects - list all known objects and their descriptions
list_methods(type) - list known methods, optionally for a specific type
- shows method name (http, Container.withExec), required args, and return type
- does NOT show description - required args is enough

Instead of representing APIs as tools that dynamically become available, there are tools for calling methods:

call_method - call a single method against a self
chain_methods - call a chain of methods starting from a self
- I experimented with this separately in llm: chaining tool calls #10229 but this PR's implementation seems to work much more consistently (maybe "method chaining" is a strong metaphor)

So what does "mostly" mean?:

The save tool still has a dynamic description. I experimented with making this static (by adding a list_outputs and having save save one arbitrary name+value output at a time) but it seemed to really degrade model behavior - it's tough to beat a single function with a schema for the required outputs.
The user_provided_values tool still works the same as before (a description derived from the inputs).
I think these both are fine for all intents and purposes. We don't need MCP clients to save values, and inputs are static anyway.

Remove the `think` tool

It's nifty, but not fully proven - evals pass just fine without it. Some clients (like Zed) already have their own thinking tool, some models have an explicit "thinking" mode, so let's wait until we're sure we need it and add it more thoughtfully (opt-in?).

Evals Report (analysis)

Model	Eval	Success Rate	Input / Output Tokens	Traces
`claude-3-5-sonnet-latest`	Basic	100% (15 → 30 (+15) attempts)	2268.1 → 708.0 (-1560.1) 99.1 → 4.0 (-95.1)	[1][2][3][4][5][6][7][8][9][10]
`claude-3-5-sonnet-latest`	BuildMulti	100% (15 → 30 (+15) attempts)	8213.5 → 7128.4 (-1085.1) 883.3 → 625.7 (-257.6)	[1][2][3][4][5][6][7][8][9][10]
`claude-3-5-sonnet-latest`	BuildMultiNoVar	100% (15 → 30 (+15) attempts)	6713.9 → 8931.2 (+2217.3) 918.5 → 632.5 (-286.0)	[1][2][3][4][5][6][7][8][9][10]
`claude-3-5-sonnet-latest`	ReadImplicitVars	100% (15 → 30 (+15) attempts)	5709.5 → 4315.9 (-1393.6) 398.8 → 394.1 (-4.7)	[1][2][3][4][5][6][7][8][9][10]
`claude-3-5-sonnet-latest`	UndoChanges	100% (15 → 30 (+15) attempts)	7912.5 → 9496.1 (+1583.6) 939.1 → 892.0 (-47.1)	[1][2][3][4][5][6][7][8][9][10]
`claude-3-5-sonnet-latest`	WorkspacePattern	93% → 100% (+7%) (15 → 30 (+15) attempts)	6377.1 → 8242.3 (+1865.2) 696.3 → 852.1 (+155.8)	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	Basic	100% (50 → 100 (+50) attempts)	682.0 → 296.0 (-386.0) 2.0	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	BuildMulti	100% → 96% (-4%) (50 → 100 (+50) attempts)	33393.5 → 29076.7 (-4316.7) 364.1 → 356.8 (-7.3)	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	BuildMultiNoVar	98% → 97% (-1%) (50 → 100 (+50) attempts)	38555.9 → 24493.3 (-14062.6) 462.0 → 420.0 (-42.0)	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	ReadImplicitVars	100% → 98% (-2%) (50 → 100 (+50) attempts)	5000.6 → 5792.9 (+792.4) 180.9 → 174.4 (-6.5)	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	UndoChanges	100% → 99% (-1%) (50 → 100 (+50) attempts)	15152.1 → 14817.4 (-334.7) 393.1 → 422.8 (+29.7)	[1][2][3][4][5][6][7][8][9][10]
`gemini-2.0-flash`	WorkspacePattern	70% → 100% (+30%) (50 → 100 (+50) attempts)	8758.4 → 9985.9 (+1227.5) 274.6 → 391.9 (+117.3)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	Basic	100% (25 → 50 (+25) attempts)	1442.6 → 316.0 (-1126.6) 47.6 → 3.9 (-43.7)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	BuildMulti	100% (25 → 50 (+25) attempts)	33408.3 → 29501.2 (-3907.0) 260.0 → 253.6 (-6.4)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	BuildMultiNoVar	100% (25 → 50 (+25) attempts)	33678.1 → 29210.7 (-4467.4) 208.5 → 250.4 (+41.9)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	ReadImplicitVars	100% (25 → 50 (+25) attempts)	3879.3 → 4968.0 (+1088.7) 68.7 → 86.1 (+17.4)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	UndoChanges	100% (25 → 50 (+25) attempts)	9430.6 → 10914.3 (+1483.7) 149.2 → 183.9 (+34.7)	[1][2][3][4][5][6][7][8][9][10]
`gpt-4.1`	WorkspacePattern	92% → 100% (+8%) (25 → 50 (+25) attempts)	10423.2 → 9712.6 (-710.6) 184.0 → 232.3 (+48.3)	[1][2][3][4][5][6][7][8][9][10]

vito · 2025-05-12T22:36:12Z

This scheme is actually pretty simple to explain now. From empirical testing and a bit of vibes, the "object + method" metaphor seems to carry us pretty far. For example, the model struggled with chained_tools (using it at times when it shouldn't) but seems adept with chain_methods.