Skip to content

Diff grader reports fragments as missing when prompt grader confirms they exist in workspace #165

Description

@haolingdong-msft

Bug: Diff grader reports fragments as missing when prompt grader confirms they exist

Summary

When running an evaluation task, the diff grader reports expected fragments as missing from a file, even though the prompt grader (LLM-as-judge) confirms the exact same changes are present and correct in the workspace.

Environment

  • waza version: latest
  • OS: Windows
  • Model: claude-opus-4.6-1m

Task Definition

graders:
  - type: diff
    name: version_enum_check
    config:
      expected_files:
        - path: "Microsoft.Widget/Widget/main.tsp"
          contains:
            - "+@previewVersion"
            - '+v2025-05-04-preview: "2025-05-04-preview"'
  - type: prompt
    name: setup_check
    config:
      prompt: |
        A new preview version `2025-05-04-preview` has been added to the Versions enum in main.tsp.
        The new version is decorated with @previewVersion and has the correct version string.
        If all criteria are met, call set_waza_grade_pass.
        Otherwise, call set_waza_grade_fail with your reasoning.
      model: "gpt-4o-mini"

Grader Results

setup_check (prompt grader) — ✅ PASSED (score: 1.0)

feedback: "All prompts passed"
response: "✅ Pass — 2025-05-04-preview is properly added to the Versions enum 
with @Azure.Core.previewVersion and the correct version string."

version_enum_check (diff grader) — ❌ FAILED (score: 0.33)

feedback: "File Microsoft.Widget/Widget/main.tsp missing expected fragment: @previewVersion; 
File Microsoft.Widget/Widget/main.tsp missing expected fragment: v2025-05-04-preview: \"2025-05-04-preview\""

Expected Behavior

Since the prompt grader confirms @previewVersion and the version string are present in the file, the diff grader should also detect these fragments.

Possible Causes

I suspect the behavior of diff grader is like: validating the pre-execution workspace files, not post-execution workspace files.

Full Test Result (relevant sections)

{
  "version_enum_check": {
    "type": "diff",
    "score": 0.3333333333333333,
    "passed": false,
    "feedback": "File Microsoft.Widget/Widget/main.tsp missing expected fragment: @previewVersion; File Microsoft.Widget/Widget/main.tsp missing expected fragment: v2025-05-04-preview: \"2025-05-04-preview\"",
    "details": {
      "expected_files": [
        {
          "contains": ["+@previewVersion", "+v2025-05-04-preview: \"2025-05-04-preview\""],
          "path": "Microsoft.Widget/Widget/main.tsp"
        }
      ],
      "failures": [
        "File Microsoft.Widget/Widget/main.tsp missing expected fragment: @previewVersion",
        "File Microsoft.Widget/Widget/main.tsp missing expected fragment: v2025-05-04-preview: \"2025-05-04-preview\""
      ],
      "workspace_dir": "C:\\Users\\HAOLIN~1\\AppData\\Local\\Temp\\waza-3490592053"
    }
  },
  "setup_check": {
    "type": "prompt",
    "score": 1,
    "passed": true,
    "feedback": "All prompts passed",
    "details": {
      "response": "✅ Pass — 2025-05-04-preview is properly added to the Versions enum with @Azure.Core.previewVersion and the correct version string."
    }
  }
}

Full output.json

claude-opus-4.6-1m.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions