Skip to content

eval: preserve input eval JSON "id" as session ID in results output #2853

@hamza-jeddad

Description

@hamza-jeddad

Summary

docker agent eval does not preserve the "id" field from input eval JSON files in the results output. Instead, it generates a new UUID for each session, discarding any "id" value supplied by the caller. The "title" field is preserved correctly — only "id" is dropped.


Steps to Reproduce

mkdir -p /tmp/probe-eval /tmp/probe-out

cat > /tmp/probe-eval/test.json <<'EOF'
{
  "id": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
  "title": "Probe eval",
  "evals": { "relevance": ["Answers the question"] },
  "messages": [{ "message": { "agentName": "", "message": { "role": "user", "content": "What is Docker?" } } }]
}
EOF

docker agent eval ~/Workspace/gordon/assistant/gordon_dev.yaml   /tmp/probe-eval --output /tmp/probe-out --concurrency 1

jq '.sessions[0] | {id, title}' /tmp/probe-out/*.json

Current Behaviour

The output session receives a freshly generated UUID, ignoring the "id" present in the input file:

{
  "id": "91907fe1-cd72-4e88-b1a5-1b439675f7c5",
  "title": "Probe eval"
}

Expected Behaviour

When an eval JSON file contains an "id" field, docker agent eval should carry that value through to the corresponding session entry in the results output:

{
  "id": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
  "title": "Probe eval"
}

If no "id" is present in the input file, the current behaviour (auto-generating a UUID) is acceptable.


Motivation / Impact

Any downstream system that writes eval JSON files with a known "id" — for example to correlate results back to a record in a source database — is broken by this behaviour. Because the output session ID is an unrelated UUID, there is no reliable way to map a result back to the originating eval record without resorting to fragile heuristics (e.g. matching on "title", which is not guaranteed to be unique).

Preserving the caller-supplied "id" is a minimal, non-breaking change: it only affects sessions whose input file already carries an "id", and leaves auto-generation in place for all other cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/testingTest infrastructure, CI/CD, test runners, evaluationkind/fixPR fixes a bug (maps to fix: commit prefix)
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions