Skip to content

feat: implement quality scoring Layers 2+3 -- LLM judge and human override #230

@Aureliolo

Description

@Aureliolo

Summary

Implement quality scoring Layers 2 and 3. Layer 1 (CI signals) is already implemented. Consolidates #230 (LLM judge) and #231 (human override).

Design Spec Reference

  • SS8.3 Performance Tracking -- D2

Layer 2: LLM judge (formerly #230)

  • Small-model LLM judge from a different model family than the agent being scored
  • Evaluates task output against acceptance criteria
  • Integration with QualityScoringStrategy protocol
  • Cost target: ~1 EUR/day
  • Specific model to be evaluated at implementation time

Layer 3: Human override via API (formerly #231)

  • API endpoint for human quality score override
  • Highest weight in the scoring composite
  • Integration with QualityScoringStrategy protocol and PerformanceTracker
  • Dashboard UI for submitting overrides

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:lowNice to have, can deferscope:medium1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent Systemspec:budgetDESIGN_SPEC Section 10 - Cost & Budget Managementspec:hrDESIGN_SPEC Section 8 - HR & Workforce Managementspec:providersDESIGN_SPEC Section 9 - Model Provider Layertype:featureNew feature implementationv0.6Minor version v0.6v0.6.4Patch release v0.6.4

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions