Conversation
8e14034 to
739303f
Compare
c0d7874 to
7374d3c
Compare
There was a problem hiding this comment.
This scheme is actually pretty simple to explain now. From empirical testing and a bit of vibes, the "object + method" metaphor seems to carry us pretty far. For example, the model struggled with chained_tools (using it at times when it shouldn't) but seems adept with chain_methods.
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
* add describe_context, run_tool
* disable dynamic tool selection, to stress test
for compatibility with clients that only support static tools
evals (not passing):
https://v3.dagger.cloud/dagger/traces/c76d5bcbfa579abfe442f02687b58fca
Signed-off-by: Alex Suraci <alex@dagger.io>
This reverts commit be25f2e75813457f0ce816c2e34333addec4d724. Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
this busts caches all the time and won't work generally with all clients Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
previously Env would be stuck with the `Object` at Env construction time, which won't have module dependencies. instead we use the Root of the *dagql.Server at runtime, which was added for this very kind of behavior. Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
seeing this misunderstanding occasionally across all models, where they go all the way back to Directory#1 Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
This almost works, but call_method and chain_methods currently need a non-strict schema since they intentionally use additionalProperties for the args schema. I don't think we want to sacrifice that, so I'll just stop short of actually enabling it, since explicitly requiring every param is probably a good idea - sometimes the model seems to make assumptions about what omitting it means (sorry to vibesplain). Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
seems to hinder more than it helps - the model frequently hallucinates a bogus value, it's better to just show it everything Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
see googleapis/go-genai#310 Signed-off-by: Alex Suraci <alex@dagger.io>
started using these for strict: true compliance Signed-off-by: Alex Suraci <alex@dagger.io>
we don't necessarily want to over-tune for this use case, so just be more explicit in the test Signed-off-by: Alex Suraci <alex@dagger.io>
|
|
||
| func (r *LLMRouter) LoadConfig(ctx context.Context, getenv func(context.Context, string) (string, error)) error { | ||
| if getenv == nil { | ||
| getenv = func(ctx context.Context, key string) (string, error) { |
There was a problem hiding this comment.
what's the full intent of these changes? parallelize, obviously, refactor a little, but does this change error output at all?
sidenote: why do you pass ctx to roundly ignore it?
There was a problem hiding this comment.
The goal is just to parallelize, since I saw CI was taking a very long time to chew through all of these one by one. There should be no behavior difference (besides speed).
ctx is only ignored by the getenv fallback function (if getenv == nil) - afaik it's not ignored by the "real" getter that's passed in.
There was a problem hiding this comment.
ah, i read right past the getenv func param, now this makes sense
| llm.err = llm.loop(ctx, dag) | ||
| }) | ||
| return err | ||
| return llm.err |
There was a problem hiding this comment.
this is a bugfix, isn't it? is there a test? (not requesting one rn, just noting for posterity that there's a case here where you can lose the error by calling sync twice)
There was a problem hiding this comment.
Yeah, there's a missing test here, and I don't think there's a very obvious place to put one at the moment. 😕
| You will be given a task described through the combination of tool descriptions and user messages. The `select_tools` tool describes the available tools and objects. The `save` tool, if present, describes the desired outputs. | ||
| The Dagger tool system operates as a chain of transformations where: | ||
| 1. Objects are referenced by IDs (e.g., Container#1, File#2) | ||
| 2. All objects are immutable - methods return new objects rather than modifying existing ones |
There was a problem hiding this comment.
lol except host directories with my reloading thing ... i am kinda worried going down this path of mounts and live reloading and whatnot is confusing both for us and for the LLMs
There was a problem hiding this comment.
Yeah I've been thinking of walking back these 'immutable' assertions and trying to let the model just trust return values (like "if you call X against Container#1 and get Container#2, trust that Container#1 remains unmodified").
There are already places in the API where 'immutable' doesn't hold true (like starting/stopping a service).
We can burn that bridge when we get to it.
| return withLLMReport(ctx, | ||
| m.llm(dagger.LLMOpts{MaxAPICalls: 20}). | ||
| WithEnv(dag.Env(dagger.EnvOpts{Privileged: true}). | ||
| WithStringOutput("methods", "The list of methods that you can see.")). |
There was a problem hiding this comment.
so the LLM can see the tools, obviously, but can i inspect them on an LLM object still? there's still a tools method and i don't think the schema changed, so why'd the $agent | tools test break?
There was a problem hiding this comment.
Methods aren't exposed as tools anymore - there's a static set of tools for calling methods, so $agent | tools is only going to show you list_methods, call_method, etc.
So the only way to see the set of methods available is to get the LLM to check for you.
There was a problem hiding this comment.
ah, duh... in retrospect very obvious.
with this scheme you can't "break in" to the black box of methods yourself. that's good for now, especially with us changing the internals on the regular, but i wouldn't be surprised if we want humans to be able to inspect the result of list_methods eventually.
Switch mostly from dynamic tools to static
Instead of listing available objects + methods in tool descriptions, there are new tools:
list_objects- list all known objects and their descriptionslist_methods(type)- list known methods, optionally for a specific typehttp,Container.withExec), required args, and return typeInstead of representing APIs as tools that dynamically become available, there are tools for calling methods:
call_method- call a single method against aselfchain_methods- call a chain of methods starting from aselfSo what does "mostly" mean?:
savetool still has a dynamic description. I experimented with making this static (by adding alist_outputsand havingsavesave one arbitrary name+value output at a time) but it seemed to really degrade model behavior - it's tough to beat a single function with a schema for the required outputs.user_provided_valuestool still works the same as before (a description derived from the inputs).Remove the
thinktoolIt's nifty, but not fully proven - evals pass just fine without it. Some clients (like Zed) already have their own
thinkingtool, some models have an explicit "thinking" mode, so let's wait until we're sure we need it and add it more thoughtfully (opt-in?).Evals Report (analysis)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
99.1 → 4.0 (-95.1)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
883.3 → 625.7 (-257.6)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
918.5 → 632.5 (-286.0)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
398.8 → 394.1 (-4.7)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
939.1 → 892.0 (-47.1)
claude-3-5-sonnet-latest(15 → 30 (+15) attempts)
696.3 → 852.1 (+155.8)
gemini-2.0-flash(50 → 100 (+50) attempts)
2.0
gemini-2.0-flash(50 → 100 (+50) attempts)
364.1 → 356.8 (-7.3)
gemini-2.0-flash(50 → 100 (+50) attempts)
462.0 → 420.0 (-42.0)
gemini-2.0-flash(50 → 100 (+50) attempts)
180.9 → 174.4 (-6.5)
gemini-2.0-flash(50 → 100 (+50) attempts)
393.1 → 422.8 (+29.7)
gemini-2.0-flash(50 → 100 (+50) attempts)
274.6 → 391.9 (+117.3)
gpt-4.1(25 → 50 (+25) attempts)
47.6 → 3.9 (-43.7)
gpt-4.1(25 → 50 (+25) attempts)
260.0 → 253.6 (-6.4)
gpt-4.1(25 → 50 (+25) attempts)
208.5 → 250.4 (+41.9)
gpt-4.1(25 → 50 (+25) attempts)
68.7 → 86.1 (+17.4)
gpt-4.1(25 → 50 (+25) attempts)
149.2 → 183.9 (+34.7)
gpt-4.1(25 → 50 (+25) attempts)
184.0 → 232.3 (+48.3)