fix: diff grader reads post-execution workspace files#196
Conversation
The diff grader was reading workspace files from the filesystem after the Copilot SDK session had disconnected. Since session.Disconnect() may restore workspace files to their pre-execution state, the grader would see the original file content instead of the agent's modifications. Fix: capture all workspace file contents into memory before the session disconnects, and have the diff grader prefer these captured files over the on-disk workspace. This guarantees graders always see the true post-execution state regardless of SDK disconnect behavior. Changes: - Add WorkspaceFiles field to ExecutionResponse and graders.Context - Add captureWorkspaceFiles() that snapshots workspace before disconnect - Add readWorkspaceFile() to diff grader that prefers captured files over filesystem reads, with forward-slash normalization for cross-platform consistency - Add tests for the capture function and grader behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d20ab87 to
5045760
Compare
There was a problem hiding this comment.
Pull request overview
Fixes diff-grader false negatives by snapshotting post-execution workspace files before the Copilot SDK session disconnects (which may revert files), then threading that snapshot through execution → runner → grader context so the diff grader can prefer in-memory content over on-disk reads. The PR also adds YAML-time validation for grader configs and updates the demo script accordingly.
Changes:
- Capture post-execution workspace files in-memory and pass them through
ExecutionResponse→graders.Context. - Update diff grader to prefer captured workspace content (with path normalization) and add targeted tests.
- Add grader-config validation during YAML unmarshalling (+ tests) and refresh
DEMO-SCRIPT.md.
Show a summary per file
| File | Description |
|---|---|
| internal/execution/engine.go | Adds WorkspaceFiles snapshot field to the execution response. |
| internal/execution/copilot.go | Captures workspace files pre-disconnect via captureWorkspaceFiles(). |
| internal/execution/mock.go | Mirrors workspace snapshot capture in the mock engine. |
| internal/execution/workspace_test.go | Adds unit tests for workspace snapshot capture behavior. |
| internal/orchestration/runner.go | Threads WorkspaceFiles into graders.Context. |
| internal/graders/grader.go | Extends grading context with WorkspaceFiles. |
| internal/graders/diff_grader.go | Prefers captured workspace files over filesystem reads. |
| internal/graders/diff_grader_test.go | Adds tests for captured-vs-filesystem precedence and fallback. |
| internal/models/spec.go | Adds grader-type required-field validation during YAML unmarshal. |
| internal/models/testcase.go | Adds validator required-field validation during YAML unmarshal. |
| internal/models/grader_validation_test.go | New tests covering validation failures/success cases. |
| DEMO-SCRIPT.md | Updates demo flow and example YAML. |
Copilot's findings
- Files reviewed: 8/8 changed files
- Comments generated: 2
| // captureWorkspaceFiles reads all files from the workspace directory into memory, | ||
| // preserving the post-execution state. This is called before session.Disconnect() | ||
| // to ensure graders see the agent's modifications even if the SDK restores the | ||
| // workspace to its pre-execution state during disconnect. | ||
| // Keys are forward-slash-separated paths relative to dir for cross-platform consistency. | ||
| func captureWorkspaceFiles(dir string) map[string][]byte { | ||
| if dir == "" { | ||
| return nil | ||
| } | ||
|
|
||
| files := make(map[string][]byte) | ||
| _ = filepath.WalkDir(dir, func(path string, d os.DirEntry, err error) error { | ||
| if err != nil || d.IsDir() { | ||
| return nil | ||
| } | ||
| rel, relErr := filepath.Rel(dir, path) | ||
| if relErr != nil { | ||
| return nil | ||
| } | ||
| content, readErr := os.ReadFile(path) | ||
| if readErr != nil { | ||
| return nil | ||
| } | ||
| // Normalize to forward slashes so map keys match eval YAML paths on all platforms. | ||
| files[filepath.ToSlash(rel)] = content | ||
| return nil | ||
| }) | ||
| return files | ||
| } |
There was a problem hiding this comment.
captureWorkspaceFiles silently ignores WalkDir/Rel/ReadFile errors and reads every non-directory entry into memory with no size limits (and it will also read through symlinks). This can cause graders to see incomplete snapshots without explanation, and can also lead to large memory spikes / potential OOM if the workspace contains big trees (e.g., node_modules) or a symlink to a large file outside the workspace. Consider (1) returning an error (or at least logging when a file can’t be captured), (2) skipping symlinks/non-regular files, and (3) enforcing per-file and total snapshot byte limits (or capturing only files referenced by graders).
| // readWorkspaceFile returns the content of a workspace file, preferring captured | ||
| // post-execution files over the filesystem. Returns (content, true) on success or | ||
| // (nil, false) when the file is not found. | ||
| func (dg *diffGrader) readWorkspaceFile(workspaceDir string, filePath string) ([]byte, bool) { | ||
| normalizedPath := filepath.ToSlash(filePath) | ||
|
|
||
| // Prefer captured workspace files — they reflect post-execution state | ||
| // even if session disconnect modified the on-disk workspace. | ||
| if dg.workspaceFiles != nil { | ||
| if content, ok := dg.workspaceFiles[normalizedPath]; ok { | ||
| return content, true | ||
| } | ||
| } | ||
|
|
||
| // Fall back to reading from the workspace directory on disk. | ||
| if workspaceDir != "" { | ||
| fullPath := filepath.Join(workspaceDir, filePath) | ||
| content, err := os.ReadFile(fullPath) | ||
| if err == nil { | ||
| return content, true | ||
| } | ||
| } | ||
|
|
||
| return nil, false | ||
| } |
There was a problem hiding this comment.
readWorkspaceFile collapses all filesystem read failures into “not found” (it drops the underlying error), which makes troubleshooting harder and can mislead users when the file exists but is unreadable (permissions, transient IO errors, etc.). Consider returning a third error value (or a sentinel) so checkExpectedFile can distinguish not-exist vs read-failed and preserve the more specific failure message.
Summary
Closes #165
The diff grader was reading workspace files from the filesystem after the Copilot SDK session had disconnected. Since
session.Disconnect()may restore workspace files to their pre-execution state, the grader would see original file content instead of the agent's modifications — causingcontainsfragment checks to fail even though the agent made the correct changes.Root Cause
In
CopilotEngine.Execute(),session.Disconnect()is deferred and fires when the function returns. By the time the runner callsrunGraders(), the disconnect has already occurred. If the SDK restores/modifies workspace files during disconnect, the diff grader reads pre-execution content from the filesystem.Fix
captureWorkspaceFiles()— walks the workspace dir and stores all file contents in amap[string][]byteExecutionResponse.WorkspaceFiles→graders.Context.WorkspaceFiles→diffGrader.workspaceFilesreadWorkspaceFile(), with forward-slash path normalization for cross-platform consistencyFiles Changed
internal/execution/engine.goWorkspaceFilesfield toExecutionResponseinternal/execution/copilot.gocaptureWorkspaceFiles(), call before responseinternal/execution/mock.gointernal/graders/grader.goWorkspaceFilesfield toContextinternal/orchestration/runner.goWorkspaceFilesinbuildGraderContext()internal/graders/diff_grader.goreadWorkspaceFile()preferring captured filesinternal/execution/workspace_test.gocaptureWorkspaceFiles()internal/graders/diff_grader_test.goTesting
go test ./internal/graders/... ./internal/execution/... ./internal/orchestration/...)go vet ./...cleanWorking as Linus (Backend Developer)