Skip to content

fix: diff grader reads post-execution workspace files#196

Merged
spboyer merged 1 commit into
mainfrom
squad/165-diff-grader-workspace
Apr 21, 2026
Merged

fix: diff grader reads post-execution workspace files#196
spboyer merged 1 commit into
mainfrom
squad/165-diff-grader-workspace

Conversation

@spboyer

@spboyer spboyer commented Apr 21, 2026

Copy link
Copy Markdown
Member

Summary

Closes #165

The diff grader was reading workspace files from the filesystem after the Copilot SDK session had disconnected. Since session.Disconnect() may restore workspace files to their pre-execution state, the grader would see original file content instead of the agent's modifications — causing contains fragment checks to fail even though the agent made the correct changes.

Root Cause

In CopilotEngine.Execute(), session.Disconnect() is deferred and fires when the function returns. By the time the runner calls runGraders(), the disconnect has already occurred. If the SDK restores/modifies workspace files during disconnect, the diff grader reads pre-execution content from the filesystem.

Fix

  1. Capture workspace files in memory before session disconnect via captureWorkspaceFiles() — walks the workspace dir and stores all file contents in a map[string][]byte
  2. Thread captured files through ExecutionResponse.WorkspaceFilesgraders.Context.WorkspaceFilesdiffGrader.workspaceFiles
  3. Diff grader prefers captured files over filesystem reads via readWorkspaceFile(), with forward-slash path normalization for cross-platform consistency
  4. Falls back to filesystem when no captured files are available (backward compatible)

Files Changed

File Change
internal/execution/engine.go Add WorkspaceFiles field to ExecutionResponse
internal/execution/copilot.go Add captureWorkspaceFiles(), call before response
internal/execution/mock.go Capture files in MockEngine for consistency
internal/graders/grader.go Add WorkspaceFiles field to Context
internal/orchestration/runner.go Pass WorkspaceFiles in buildGraderContext()
internal/graders/diff_grader.go Add readWorkspaceFile() preferring captured files
internal/execution/workspace_test.go Tests for captureWorkspaceFiles()
internal/graders/diff_grader_test.go 4 new tests for workspace file preference behavior

Testing

  • All existing tests pass (go test ./internal/graders/... ./internal/execution/... ./internal/orchestration/...)
  • go vet ./... clean
  • New tests verify: captured files preferred over filesystem, filesystem fallback, missing fragment detection, file-not-found handling

Working as Linus (Backend Developer)

Copilot AI review requested due to automatic review settings April 21, 2026 15:03
@spboyer spboyer added the squad:linus Assigned to Linus (Backend Developer) label Apr 21, 2026
@github-actions github-actions Bot enabled auto-merge (squash) April 21, 2026 15:06
The diff grader was reading workspace files from the filesystem after the
Copilot SDK session had disconnected. Since session.Disconnect() may
restore workspace files to their pre-execution state, the grader would
see the original file content instead of the agent's modifications.

Fix: capture all workspace file contents into memory before the session
disconnects, and have the diff grader prefer these captured files over
the on-disk workspace. This guarantees graders always see the true
post-execution state regardless of SDK disconnect behavior.

Changes:
- Add WorkspaceFiles field to ExecutionResponse and graders.Context
- Add captureWorkspaceFiles() that snapshots workspace before disconnect
- Add readWorkspaceFile() to diff grader that prefers captured files
  over filesystem reads, with forward-slash normalization for
  cross-platform consistency
- Add tests for the capture function and grader behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/165-diff-grader-workspace branch from d20ab87 to 5045760 Compare April 21, 2026 15:10
@spboyer spboyer merged commit f1a0fe6 into main Apr 21, 2026
3 of 5 checks passed
@spboyer spboyer deleted the squad/165-diff-grader-workspace branch April 21, 2026 15:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes diff-grader false negatives by snapshotting post-execution workspace files before the Copilot SDK session disconnects (which may revert files), then threading that snapshot through execution → runner → grader context so the diff grader can prefer in-memory content over on-disk reads. The PR also adds YAML-time validation for grader configs and updates the demo script accordingly.

Changes:

  • Capture post-execution workspace files in-memory and pass them through ExecutionResponsegraders.Context.
  • Update diff grader to prefer captured workspace content (with path normalization) and add targeted tests.
  • Add grader-config validation during YAML unmarshalling (+ tests) and refresh DEMO-SCRIPT.md.
Show a summary per file
File Description
internal/execution/engine.go Adds WorkspaceFiles snapshot field to the execution response.
internal/execution/copilot.go Captures workspace files pre-disconnect via captureWorkspaceFiles().
internal/execution/mock.go Mirrors workspace snapshot capture in the mock engine.
internal/execution/workspace_test.go Adds unit tests for workspace snapshot capture behavior.
internal/orchestration/runner.go Threads WorkspaceFiles into graders.Context.
internal/graders/grader.go Extends grading context with WorkspaceFiles.
internal/graders/diff_grader.go Prefers captured workspace files over filesystem reads.
internal/graders/diff_grader_test.go Adds tests for captured-vs-filesystem precedence and fallback.
internal/models/spec.go Adds grader-type required-field validation during YAML unmarshal.
internal/models/testcase.go Adds validator required-field validation during YAML unmarshal.
internal/models/grader_validation_test.go New tests covering validation failures/success cases.
DEMO-SCRIPT.md Updates demo flow and example YAML.

Copilot's findings

  • Files reviewed: 8/8 changed files
  • Comments generated: 2

Comment on lines +418 to +446
// captureWorkspaceFiles reads all files from the workspace directory into memory,
// preserving the post-execution state. This is called before session.Disconnect()
// to ensure graders see the agent's modifications even if the SDK restores the
// workspace to its pre-execution state during disconnect.
// Keys are forward-slash-separated paths relative to dir for cross-platform consistency.
func captureWorkspaceFiles(dir string) map[string][]byte {
if dir == "" {
return nil
}

files := make(map[string][]byte)
_ = filepath.WalkDir(dir, func(path string, d os.DirEntry, err error) error {
if err != nil || d.IsDir() {
return nil
}
rel, relErr := filepath.Rel(dir, path)
if relErr != nil {
return nil
}
content, readErr := os.ReadFile(path)
if readErr != nil {
return nil
}
// Normalize to forward slashes so map keys match eval YAML paths on all platforms.
files[filepath.ToSlash(rel)] = content
return nil
})
return files
}

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captureWorkspaceFiles silently ignores WalkDir/Rel/ReadFile errors and reads every non-directory entry into memory with no size limits (and it will also read through symlinks). This can cause graders to see incomplete snapshots without explanation, and can also lead to large memory spikes / potential OOM if the workspace contains big trees (e.g., node_modules) or a symlink to a large file outside the workspace. Consider (1) returning an error (or at least logging when a file can’t be captured), (2) skipping symlinks/non-regular files, and (3) enforcing per-file and total snapshot byte limits (or capturing only files referenced by graders).

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +159
// readWorkspaceFile returns the content of a workspace file, preferring captured
// post-execution files over the filesystem. Returns (content, true) on success or
// (nil, false) when the file is not found.
func (dg *diffGrader) readWorkspaceFile(workspaceDir string, filePath string) ([]byte, bool) {
normalizedPath := filepath.ToSlash(filePath)

// Prefer captured workspace files — they reflect post-execution state
// even if session disconnect modified the on-disk workspace.
if dg.workspaceFiles != nil {
if content, ok := dg.workspaceFiles[normalizedPath]; ok {
return content, true
}
}

// Fall back to reading from the workspace directory on disk.
if workspaceDir != "" {
fullPath := filepath.Join(workspaceDir, filePath)
content, err := os.ReadFile(fullPath)
if err == nil {
return content, true
}
}

return nil, false
}

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readWorkspaceFile collapses all filesystem read failures into “not found” (it drops the underlying error), which makes troubleshooting harder and can mislead users when the file exists but is unreadable (permissions, transient IO errors, etc.). Consider returning a third error value (or a sentinel) so checkExpectedFile can distinguish not-exist vs read-failed and preserve the more specific failure message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

squad:linus Assigned to Linus (Backend Developer)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Diff grader reports fragments as missing when prompt grader confirms they exist in workspace

3 participants