Skip to content

feat: five-pillar evaluation framework for HR performance tracking#1017

Merged
Aureliolo merged 7 commits intomainfrom
feat/hr-evaluation-framework
Apr 3, 2026
Merged

feat: five-pillar evaluation framework for HR performance tracking#1017
Aureliolo merged 7 commits intomainfrom
feat/hr-evaluation-framework

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

Implements a structured five-pillar agent evaluation framework based on the InfoQ evaluation framework, with fully pluggable pillars and metrics that can be independently enabled/disabled via EvaluationConfig.

Pillars

Pillar Strategy Data Sources
Intelligence/Accuracy QualityBlendIntelligenceStrategy (70% CI / 30% LLM calibration) QualityScoreResult, LlmCalibrationRecord
Performance/Efficiency Inline from WindowMetrics (40% cost, 30% time, 30% tokens) WindowMetrics averages
Reliability/Resilience TaskBasedResilienceStrategy (success rate, recovery, consistency, streaks) TaskMetricRecord sequences
Responsibility/Governance AuditBasedGovernanceStrategy (audit compliance, trust, autonomy) Audit log, trust system, autonomy system
User Experience FeedbackBasedUxStrategy (clarity, tone, helpfulness, trust, satisfaction) InteractionFeedback records

Key design decisions (D24)

  • Single PillarScoringStrategy protocol with EvaluationContext bag
  • Per-pillar sub-configs with metric-level enable/disable toggles and configurable weights
  • Disabled pillars/metrics have weight redistributed proportionally (redistribute_weights utility)
  • Pull-based evaluation (no background daemon) -- EvaluationService.evaluate() called on demand
  • Human-calibrated LLM labeling reuses existing LlmCalibrationSampler -- drift above threshold reduces intelligence pillar confidence
  • All pillars ship enabled with recommended defaults

New files

  • src/synthorg/hr/evaluation/ (10 source files)
  • src/synthorg/observability/events/evaluation.py
  • tests/unit/hr/evaluation/ (10 test files, 126 tests)

Also

  • Design spec D24 in docs/design/agents.md + docs/architecture/decisions.md
  • Updated CLAUDE.md Package Structure, logging examples
  • Updated DESIGN_SPEC.md and design/index.md descriptions

Test plan

  • 126 unit tests covering all models, configs, strategies, evaluator, and edge cases
  • All tests marked @pytest.mark.unit
  • mypy strict clean, ruff clean, 0 lint warnings
  • Pre-reviewed by 12 agents, 25 findings addressed

Review coverage

12 agents: code-reviewer, python-reviewer, test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, conventions-enforcer, async-reviewer, issue-verifier, docs-consistency, resilience-audit, comment-analyzer. All 25 valid findings implemented.

Closes #699

Implement structured five-pillar agent evaluation based on the InfoQ
evaluation framework. Each pillar and its individual metrics can be
independently enabled/disabled via EvaluationConfig.

Pillars:
- Intelligence/Accuracy: blends CI quality score with LLM calibration
- Performance/Efficiency: normalized cost, time, token metrics
- Reliability/Resilience: success rate, recovery, consistency, streaks
- Responsibility/Governance: audit compliance, trust, autonomy
- User Experience: clarity, tone, helpfulness, trust, satisfaction

New hr/evaluation/ subpackage (10 files):
- EvaluationPillar enum, PillarScore/EvaluationReport/InteractionFeedback/
  ResilienceMetrics/EvaluationContext models
- EvaluationConfig with per-pillar sub-configs and metric toggles
- PillarScoringStrategy protocol (single protocol, single context bag)
- Four default strategies + inline efficiency computation
- EvaluationService orchestrator with concurrent pillar scoring
- redistribute_weights() utility for weight redistribution

Also:
- Observability events (eval.* namespace)
- Design spec D16 decision in docs/design/agents.md
- 118 unit tests, mypy clean, ruff clean

Closes #699
Pre-reviewed by 12 agents, 25 findings addressed:

Source fixes:
- Extract evaluate() into 4 helper methods (was 139 lines, now <50 each)
- Extract _score_efficiency into sub-score + builder helpers
- Extract _compute_resilience_metrics into module-level helpers
- Add EVAL_CALIBRATION_DRIFT_HIGH log on drift detection
- Add EVAL_PILLAR_INSUFFICIENT_DATA logs on efficiency early returns
- Add EVAL_WEIGHTS_REDISTRIBUTED log on weight redistribution
- Add confidence kwarg to efficiency pillar log (consistency)
- Change record_feedback to sync def (no await needed)
- Use setdefault pattern for feedback dict
- Fix data_points = len(...) or 1 -> len(...) in intelligence
- Add _FULL_CONFIDENCE_DATA_POINTS named constant (replaces magic 10.0)
- Remove unreachable max(1, total_audits) guards in governance
- Add warning log for unknown trust levels in governance
- Add at-least-one-metric-enabled validators to all 5 sub-configs
- Add agent_id consistency validator to EvaluationContext
- Fix docstrings: PillarScore 'mirrors' -> 'extends', CI spelled out,
  resilience 'inverse' -> 'linear penalty', config module docstring

Docs fixes:
- Renumber D16 -> D24 (collision with Docker sandbox decision)
- Add D24 row to docs/architecture/decisions.md
- Update CLAUDE.md Package Structure with evaluation/
- Add evaluation event example to CLAUDE.md logging section
- Add 'evaluation' to DESIGN_SPEC.md and design/index.md descriptions

Test fixes:
- Add pytestmark = pytest.mark.unit to all 8 test files
- Add tests: shuffled records, failure-ending pattern, explicit now,
  CI disabled + LLM only, unknown trust level, feedback-to-evaluation
  end-to-end pipeline
- Fix tests for sync record_feedback and new config validators
Copilot AI review requested due to automatic review settings April 2, 2026 20:57
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview April 2, 2026 20:59 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 221af46.
Ensure that dependencies are being submitted on PR branches. Re-running this action after a short time may resolve the issue. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bffe7901-ebbd-42c7-9a72-b22e103f7ae8

📥 Commits

Reviewing files that changed from the base of the PR and between fcff01f and 221af46.

📒 Files selected for processing (1)
  • tests/unit/hr/evaluation/test_models.py
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Build Sandbox
  • GitHub Check: Build Backend
  • GitHub Check: Build Web
  • GitHub Check: Dependency Review
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

  • tests/unit/hr/evaluation/test_models.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

  • tests/unit/hr/evaluation/test_models.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
🔇 Additional comments (6)
tests/unit/hr/evaluation/test_models.py (6)

21-21: Unit test marker is correctly applied.

pytestmark = pytest.mark.unit is present at module scope and aligns with test-marker requirements.


26-74: redistribute_weights coverage is strong and well-balanced.

This block checks happy paths, redistribution math, empty/all-disabled failures, and sum-to-one invariants.


79-199: InteractionFeedback tests cover key validation and immutability paths well.

Good coverage for bounds, max length, required-content rule, ID generation, and frozen-model behavior.


204-271: ResilienceMetrics invariant testing is thorough.

The cases validate cross-field constraints and frozen behavior clearly.


327-338: Agent/snapshot consistency validation is correctly exercised.

The mismatch test directly targets an important model boundary condition.


344-481: EvaluationReport tests provide solid guardrail coverage.

Duplicate detection, bounds checks, ID generation, frozen enforcement, and weights/scores alignment are all covered.


Walkthrough

Adds a five‑pillar HR evaluation framework under src/synthorg/hr/evaluation/: new Pydantic configs (EvaluationConfig and per‑pillar configs), domain models (PillarScore, EvaluationContext, EvaluationReport, InteractionFeedback, ResilienceMetrics), constants and enums, a PillarScoringStrategy protocol, weight‑redistribution helpers, and observability event constants. Implements EvaluationService (orchestrator with inline efficiency scoring) and four pillar strategies (intelligence, resilience, governance, experience). Adds extensive unit tests and fixtures, test coverage updates, and corresponding documentation and design/decision entries (including D24 and CLAUDE.md edits).

Suggested labels

autorelease: tagged

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: five-pillar evaluation framework for HR performance tracking' clearly and directly summarizes the main change: implementing a five-pillar evaluation framework, which aligns with all file additions and documentation updates.
Description check ✅ Passed The description comprehensively documents the five-pillar framework implementation, pillar strategies, design decisions, files added, and test coverage, all directly related to the changeset.
Linked Issues check ✅ Passed The PR fulfills all coding objectives from issue #699: implements all five pillars with pluggable strategies [#699], maps framework to HR tracking, designs structured UX measurement with InteractionFeedback, and uses pull-based EvaluationService with human-calibrated LLM integration via LlmCalibrationSampler.
Out of Scope Changes check ✅ Passed All changes are in-scope: 10 evaluation framework modules, strategies, configs, models, tests, observability events, and documentation updates directly support the five-pillar framework objective from issue #699. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 49.23% which is sufficient. The required threshold is 40.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive "Five-Pillar Evaluation Framework" for tracking agent performance within the HR module. The framework assesses agents across Intelligence, Efficiency, Resilience, Governance, and User Experience using pluggable scoring strategies and a centralized EvaluationService. Key features include configurable metric toggles with automatic weight redistribution, structured logging for evaluation events, and detailed documentation of the design and architectural decisions. Feedback was provided regarding the QualityBlendIntelligenceStrategy to ensure that weight redistribution logic remains consistent and robust when calibration data is unavailable.

Comment on lines +77 to +125
# Build enabled metrics list.
metrics: list[tuple[str, float, bool]] = []
if cfg.ci_quality_enabled:
metrics.append(("ci_quality", cfg.ci_quality_weight, True))
if cfg.llm_calibration_enabled:
metrics.append(("llm_calibration", cfg.llm_calibration_weight, True))

if not metrics:
return PillarScore(
pillar=self.pillar,
score=_NEUTRAL_SCORE,
confidence=0.0,
strategy_name=NotBlankStr(self.name),
data_point_count=0,
evaluated_at=context.now,
)

weights = redistribute_weights(metrics)

# Compute CI quality component.
breakdown: list[tuple[str, float]] = []
weighted_sum = 0.0
data_points = len(context.task_records)

if "ci_quality" in weights:
breakdown.append(("ci_quality", round(ci_score, 4)))
weighted_sum += ci_score * weights["ci_quality"]

# Compute LLM calibration component.
calibration_drift = 0.0
if "llm_calibration" in weights:
records = context.calibration_records
if records:
avg_llm = sum(r.llm_score for r in records) / len(records)
breakdown.append(("llm_calibration", round(avg_llm, 4)))
weighted_sum += avg_llm * weights["llm_calibration"]
calibration_drift = sum(r.drift for r in records) / len(records)
data_points += len(records)
else:
logger.debug(
EVAL_METRIC_SKIPPED,
agent_id=context.agent_id,
pillar=self.pillar.value,
metric="llm_calibration",
reason="no_calibration_records",
)
# Redistribute to CI quality only.
weighted_sum = ci_score
breakdown = [("ci_quality", round(ci_score, 4))]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for handling missing calibration records appears to be incorrect. When llm_calibration is enabled but no data is available, the code at line 124 (weighted_sum = ci_score) overwrites the previously calculated weighted ci_score. This results in the final score being the raw ci_score, rather than a correctly weighted score where ci_quality receives 100% of the weight.

This can be fixed by refactoring to follow the pattern used in other strategies (e.g., FeedbackBasedUxStrategy): first, determine which metrics have available data, then redistribute weights among only those metrics, and finally compute the weighted sum. This makes the logic more robust and consistent across strategies.

        # Build a list of metrics that are enabled and have data.
        available: list[tuple[str, float, float]] = []  # (name, weight, score)
        data_points = 0
        calibration_drift = 0.0

        if cfg.ci_quality_enabled:
            available.append(("ci_quality", cfg.ci_quality_weight, ci_score))
            data_points += len(context.task_records)

        if cfg.llm_calibration_enabled:
            records = context.calibration_records
            if records:
                avg_llm = sum(r.llm_score for r in records) / len(records)
                available.append(("llm_calibration", cfg.llm_calibration_weight, avg_llm))
                calibration_drift = sum(r.drift for r in records) / len(records)
                data_points += len(records)
            else:
                logger.debug(
                    EVAL_METRIC_SKIPPED,
                    agent_id=context.agent_id,
                    pillar=self.pillar.value,
                    metric="llm_calibration",
                    reason="no_calibration_records",
                )

        if not available:
            # This case is already handled by the initial ci_score check,
            # but it's a good safeguard.
            return PillarScore(
                pillar=self.pillar,
                score=_NEUTRAL_SCORE,
                confidence=0.0,
                strategy_name=NotBlankStr(self.name),
                data_point_count=0,
                evaluated_at=context.now,
            )

        # Redistribute weights among metrics with data.
        weights = redistribute_weights([(name, w, True) for name, w, _ in available])
        scores = {name: s for name, _, s in available}

        weighted_sum = sum(scores[k] * weights[k] for k in weights)
        breakdown = list(sorted(scores.items()))

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new five-pillar evaluation subsystem under hr/ to compute on-demand, configurable evaluation reports (intelligence, efficiency, resilience, governance, UX) and integrates it with structured observability events, tests, and design docs.

Changes:

  • Introduces EvaluationService orchestrator plus pluggable pillar strategies and frozen Pydantic models/configs under src/synthorg/hr/evaluation/.
  • Adds evaluation-specific structured logging event constants and updates event-module discovery tests.
  • Documents the new D24 decision and five-pillar framework in the design/spec docs; adds comprehensive unit tests.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/observability/test_events.py Adds evaluation to expected observability event domain modules.
tests/unit/hr/evaluation/test_resilience_strategy.py Unit tests for resilience strategy behavior and toggles.
tests/unit/hr/evaluation/test_models.py Unit tests for evaluation models + redistribute_weights.
tests/unit/hr/evaluation/test_intelligence_strategy.py Unit tests for intelligence strategy blending + drift behavior.
tests/unit/hr/evaluation/test_governance_strategy.py Unit tests for governance strategy scoring + toggles.
tests/unit/hr/evaluation/test_experience_strategy.py Unit tests for UX feedback-based scoring + redistribution.
tests/unit/hr/evaluation/test_evaluator.py Unit tests for EvaluationService orchestration and pipelines.
tests/unit/hr/evaluation/test_enums.py Unit tests for EvaluationPillar enum.
tests/unit/hr/evaluation/test_config.py Unit tests for per-pillar configs and validation rules.
tests/unit/hr/evaluation/conftest.py Shared fixtures/builders for evaluation tests.
tests/unit/hr/evaluation/init.py Test package marker for evaluation tests.
src/synthorg/observability/events/evaluation.py New structured logging event constants for evaluation domain.
src/synthorg/hr/evaluation/resilience_strategy.py TaskBasedResilienceStrategy implementation.
src/synthorg/hr/evaluation/pillar_protocol.py PillarScoringStrategy protocol for pluggable pillars.
src/synthorg/hr/evaluation/models.py Frozen Pydantic models for context, scores, reports, feedback, metrics.
src/synthorg/hr/evaluation/intelligence_strategy.py QualityBlendIntelligenceStrategy implementation.
src/synthorg/hr/evaluation/governance_strategy.py AuditBasedGovernanceStrategy implementation.
src/synthorg/hr/evaluation/experience_strategy.py FeedbackBasedUxStrategy implementation.
src/synthorg/hr/evaluation/evaluator.py EvaluationService orchestrator + inline efficiency scoring and resilience derivations.
src/synthorg/hr/evaluation/enums.py EvaluationPillar enum (five pillars).
src/synthorg/hr/evaluation/config.py EvaluationConfig and per-pillar sub-configs with toggles/weights.
src/synthorg/hr/evaluation/init.py Package docstring for the evaluation framework.
docs/design/index.md Updates design index summary to include evaluation under Agents & HR.
docs/design/agents.md Adds the five-pillar evaluation framework section + D24 note.
docs/DESIGN_SPEC.md Updates design spec index to include evaluation in Agents & HR description.
docs/architecture/decisions.md Adds decision D24 entry describing evaluation framework design choices.
CLAUDE.md Updates package structure and logging examples to include evaluation domain/events.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +107 to +125
if "llm_calibration" in weights:
records = context.calibration_records
if records:
avg_llm = sum(r.llm_score for r in records) / len(records)
breakdown.append(("llm_calibration", round(avg_llm, 4)))
weighted_sum += avg_llm * weights["llm_calibration"]
calibration_drift = sum(r.drift for r in records) / len(records)
data_points += len(records)
else:
logger.debug(
EVAL_METRIC_SKIPPED,
agent_id=context.agent_id,
pillar=self.pillar.value,
metric="llm_calibration",
reason="no_calibration_records",
)
# Redistribute to CI quality only.
weighted_sum = ci_score
breakdown = [("ci_quality", round(ci_score, 4))]
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When llm_calibration is enabled but there are no calibration records, the fallback unconditionally sets weighted_sum = ci_score and breakdown = [("ci_quality", ...)]. This breaks the metric toggles: if ci_quality_enabled is false (LLM-only mode), this code still uses CI quality and emits a ci_quality breakdown. Suggestion: only fall back to CI quality if ci_quality is actually enabled; otherwise treat this as insufficient data (neutral score + 0 confidence) or skip the LLM metric and return neutral/insufficient-data for the pillar.

Copilot uses AI. Check for mistakes.

Blends existing CI (continuous integration) signal quality score with
LLM calibration data. When LLM calibration is disabled or unavailable,
falls back to CI quality alone with reduced confidence.
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module docstring says the CI-only fallback happens “with reduced confidence”, but the implementation computes confidence solely from data_points and does not reduce it when LLM calibration is disabled/unavailable. Either update the docstring to match the behavior, or explicitly down-weight confidence when the LLM component is disabled or skipped due to missing calibration records.

Suggested change
falls back to CI quality alone with reduced confidence.
falls back to CI quality alone.

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +118
base_trust = _TRUST_LEVEL_SCORES.get(trust_key, _NEUTRAL_SCORE)
if trust_key not in _TRUST_LEVEL_SCORES:
logger.warning(
EVAL_PILLAR_SCORED,
agent_id=context.agent_id,
pillar=self.pillar.value,
warning="unknown_trust_level",
trust_level=trust_key,
fallback_score=_NEUTRAL_SCORE,
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning for an unknown trust_level logs with EVAL_PILLAR_SCORED ("eval.pillar.scored"), which makes it hard to distinguish normal scoring events from exceptional/diagnostic conditions in log queries and metrics. Suggest introducing a dedicated event constant for this condition (e.g., eval.governance.unknown_trust_level) or reusing an existing “skipped/insufficient” event if appropriate, while keeping eval.pillar.scored for the successful final score debug log.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 96.67195% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.77%. Comparing base (bb5c9a4) to head (221af46).
⚠️ Report is 9 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/synthorg/hr/evaluation/pillar_protocol.py 0.00% 10 Missing ⚠️
src/synthorg/hr/evaluation/governance_strategy.py 95.52% 1 Missing and 2 partials ⚠️
src/synthorg/hr/evaluation/models.py 97.61% 2 Missing and 1 partial ⚠️
src/synthorg/hr/evaluation/resilience_strategy.py 94.91% 1 Missing and 2 partials ⚠️
src/synthorg/hr/evaluation/evaluator.py 98.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1017      +/-   ##
==========================================
+ Coverage   91.69%   91.77%   +0.08%     
==========================================
  Files         658      669      +11     
  Lines       36108    36739     +631     
  Branches     3568     3625      +57     
==========================================
+ Hits        33109    33719     +610     
- Misses       2374     2389      +15     
- Partials      625      631       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 188-252: The _resolve_enabled_pillars method is long due to the
large inline pillar_map; extract the pillar configuration to a separate helper
or constant to reduce method length. Create a new function or module-level
constant (e.g., _pillar_config or _build_pillar_map) that returns the list of
tuples currently assigned to pillar_map (using EvaluationPillar entries and
wiring in self._intelligence, self._resilience, self._governance, self._ux where
needed), then update _resolve_enabled_pillars to call that helper, keep the same
logic around enabled collection, redistribute_weights, and returns, and ensure
references to pillar_map, redistribute_weights, EvaluationPillar, and the
strategy attributes (_intelligence, _resilience, _governance, _ux) match the
existing names so behavior is unchanged.

In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Line 145: The confidence formula in evaluate_experience (or the surrounding
function in src/synthorg/hr/evaluation/experience_strategy.py) uses a magic
multiplier `3`; extract this into a module-level constant named
_FULL_CONFIDENCE_FEEDBACK_MULTIPLIER and replace the literal with that constant
in the line computing confidence (confidence = min(1.0, len(feedback) /
(cfg.min_feedback_count * _FULL_CONFIDENCE_FEEDBACK_MULTIPLIER))). Add the
constant near other strategy constants (e.g., alongside
_FULL_CONFIDENCE_DATA_POINTS) and update any imports or references accordingly.

In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 110-118: Replace the misleading EVAL_PILLAR_SCORED event used when
logging an unknown trust level in governance_strategy.py: add a new event
constant (e.g., EVAL_METRIC_FALLBACK or EVAL_UNKNOWN_TRUST_LEVEL) to
synthorg/observability/events/evaluation.py and update the warning call in the
method that contains the trust_key check (the block using logger.warning with
agent_id=context.agent_id and pillar=self.pillar.value) to use that new
constant; alternatively, if you prefer not to add a constant, change the
logger.warning call to a generic structured warning event name (e.g.,
"eval_metric_fallback") so the log semantically matches the fallback case.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 115-125: The fallback branch for when llm_calibration is enabled
but has no records currently overwrites the previously computed weighted_sum
(and discards the redistributed weight logic); instead, update the fallback to
build the final score from the already-determined components: keep the
redistributed weight applied to the CI component, adjust the breakdown to
reflect only ("ci_quality", round(ci_score,4)) and then compute weighted_sum
once from those components (or recompute weighted_sum from the redistribution
logic) rather than assigning weighted_sum = ci_score; reference llm_calibration,
EVAL_METRIC_SKIPPED, weighted_sum, breakdown, and ci_score to locate and change
the assignment so the final score computation happens after all components are
finalized.

In `@src/synthorg/hr/evaluation/resilience_strategy.py`:
- Around line 48-156: The score method in resilience_strategy.py is too large;
split it into small helper functions to meet the <50-line rule by extracting the
logical blocks: (1) input/early-return checks into a helper
validate_and_handle_insufficient_data(context) that returns an optional
PillarScore, (2) metric derivation into
build_enabled_metrics_and_scores(context, rm, cfg) which returns enabled_metrics
and scores, (3) weighting and aggregation into compute_final_score(scores,
enabled_metrics) that calls redistribute_weights, and (4) result assembly into
assemble_pillar_score(context, final_score, scores, rm) which builds the
PillarScore and logs; keep the public async score(...) as a thin orchestrator
that calls these helpers (preserve names used: score, redistribute_weights,
PillarScore, EvaluationContext, EVAL_PILLAR_INSUFFICIENT_DATA,
EVAL_PILLAR_SCORED) so callers and tests remain valid.
- Around line 120-128: Before returning the neutral PillarScore when
enabled_metrics is empty, emit an INFO-level observability event describing the
state transition; add a call to the available logger (preferably
context.logger.info(...), falling back to module logger.info if no context
logger) immediately before the existing return in the branch that checks
enabled_metrics and include key fields (self.pillar, NotBlankStr(self.name) or
self.name, rm.total_tasks, and context.now) so the neutral outcome is traceable;
leave the returned PillarScore construction (PillarScore(...)) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7beee32-35a7-46a6-9757-79d9a3709504

📥 Commits

Reviewing files that changed from the base of the PR and between bb5c9a4 and 10f0772.

📒 Files selected for processing (27)
  • CLAUDE.md
  • docs/DESIGN_SPEC.md
  • docs/architecture/decisions.md
  • docs/design/agents.md
  • docs/design/index.md
  • src/synthorg/hr/evaluation/__init__.py
  • src/synthorg/hr/evaluation/config.py
  • src/synthorg/hr/evaluation/enums.py
  • src/synthorg/hr/evaluation/evaluator.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/pillar_protocol.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • src/synthorg/observability/events/evaluation.py
  • tests/unit/hr/evaluation/__init__.py
  • tests/unit/hr/evaluation/conftest.py
  • tests/unit/hr/evaluation/test_config.py
  • tests/unit/hr/evaluation/test_enums.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_experience_strategy.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • tests/unit/hr/evaluation/test_models.py
  • tests/unit/hr/evaluation/test_resilience_strategy.py
  • tests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Backend
  • GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/

Files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • docs/design/agents.md
  • docs/architecture/decisions.md
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

  • tests/unit/observability/test_events.py
  • src/synthorg/hr/evaluation/__init__.py
  • tests/unit/hr/evaluation/test_enums.py
  • src/synthorg/hr/evaluation/enums.py
  • src/synthorg/hr/evaluation/pillar_protocol.py
  • tests/unit/hr/evaluation/test_config.py
  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • src/synthorg/observability/events/evaluation.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • tests/unit/hr/evaluation/test_experience_strategy.py
  • tests/unit/hr/evaluation/test_resilience_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • tests/unit/hr/evaluation/conftest.py
  • src/synthorg/hr/evaluation/evaluator.py
  • tests/unit/hr/evaluation/test_models.py
  • src/synthorg/hr/evaluation/config.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

  • tests/unit/observability/test_events.py
  • tests/unit/hr/evaluation/test_enums.py
  • tests/unit/hr/evaluation/test_config.py
  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_experience_strategy.py
  • tests/unit/hr/evaluation/test_resilience_strategy.py
  • tests/unit/hr/evaluation/conftest.py
  • tests/unit/hr/evaluation/test_models.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

  • src/synthorg/hr/evaluation/__init__.py
  • src/synthorg/hr/evaluation/enums.py
  • src/synthorg/hr/evaluation/pillar_protocol.py
  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • src/synthorg/hr/evaluation/evaluator.py
  • src/synthorg/hr/evaluation/config.py
🧠 Learnings (50)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • docs/design/agents.md
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • docs/design/agents.md
  • docs/architecture/decisions.md
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • CLAUDE.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • docs/design/agents.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • docs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • tests/unit/observability/test_events.py
  • src/synthorg/hr/evaluation/__init__.py
  • CLAUDE.md
  • src/synthorg/hr/evaluation/enums.py
  • src/synthorg/hr/evaluation/pillar_protocol.py
  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/evaluator.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • src/synthorg/hr/evaluation/__init__.py
  • CLAUDE.md
  • src/synthorg/hr/evaluation/enums.py
  • src/synthorg/hr/evaluation/pillar_protocol.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/evaluator.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

  • docs/design/index.md
  • docs/DESIGN_SPEC.md
  • CLAUDE.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

  • docs/DESIGN_SPEC.md
  • CLAUDE.md
  • src/synthorg/hr/evaluation/pillar_protocol.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/engine/**/*.py : Engine package (engine/): agent orchestration, parallel execution, task decomposition, routing, TaskEngine (centralized single-writer), task lifecycle/recovery/shutdown, workspace isolation, coordination (4 dispatchers: SAS/centralized/decentralized/context-dependent, wave execution), approval gates (escalation detection, context parking/resume), stagnation detection (ToolRepetitionDetector, corrective prompt injection), AgentRuntimeState (execution status), context budget management, conversation compaction (oldest-turns summarizer)

Applied to files:

  • docs/DESIGN_SPEC.md
  • CLAUDE.md
  • src/synthorg/hr/evaluation/evaluator.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • tests/unit/observability/test_events.py
📚 Learning: 2026-03-18T21:23:23.586Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • tests/unit/observability/test_events.py
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.) — import directly

Applied to files:

  • tests/unit/observability/test_events.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

  • tests/unit/observability/test_events.py
  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

  • CLAUDE.md
  • docs/design/agents.md
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

  • CLAUDE.md
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Settings: Runtime-editable settings persistence (DB > env > YAML > code defaults), typed definitions (9 namespaces), Fernet encryption for sensitive values, config bridge, ConfigResolver (typed composed reads for controllers), validation, registry, change notifications via message bus. Per-namespace setting definitions in definitions/ submodule (api, company, providers, memory, budget, security, coordination, observability, backup).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger` (not `_logger`, not `log`).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-17T06:43:14.114Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Every Python module with business logic must have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

  • tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to tests/**/*.py : Test markers: pytest.mark.unit, pytest.mark.integration, pytest.mark.e2e, pytest.mark.slow. Coverage: 80% minimum (enforced in CI).

Applied to files:

  • tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to tests/**/*.py : Test markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Coverage: 80% minimum. Async: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Timeout: 30 seconds per test. Parallelism: `pytest-xdist` via `-n auto` — ALWAYS include `-n auto` when running pytest, never run tests sequentially.

Applied to files:

  • tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.

Applied to files:

  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T18:42:17.990Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`

Applied to files:

  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/config.py
🔇 Additional comments (38)
tests/unit/observability/test_events.py (1)

200-200: LGTM!

The addition of "evaluation" to the expected domain modules correctly reflects the new evaluation events domain introduced in this PR.

docs/design/index.md (1)

161-161: LGTM!

The documentation update accurately reflects the new evaluation framework scope in the HR package.

docs/DESIGN_SPEC.md (1)

14-14: LGTM!

The documentation update is consistent with the corresponding change in docs/design/index.md and accurately reflects the new evaluation framework.

CLAUDE.md (1)

147-147: LGTM!

The addition of EVAL_REPORT_COMPUTED from events.evaluation correctly documents the new evaluation observability domain and follows the established event constant pattern.

src/synthorg/hr/evaluation/__init__.py (1)

1-8: LGTM!

The module docstring clearly describes the five-pillar evaluation framework and its configuration-driven nature. As a pure docstring module, no logging is needed per coding guidelines.

docs/architecture/decisions.md (1)

82-82: LGTM!

The D24 decision entry thoroughly documents the five-pillar evaluation design, including the pluggable protocol pattern, context bag approach, and configuration-driven enablement. The decision aligns with the framework's protocol-driven architecture philosophy.

tests/unit/hr/evaluation/test_enums.py (1)

1-41: LGTM!

Comprehensive test coverage for the EvaluationPillar enum. The tests verify member count, values, StrEnum behavior, value-based lookup, and invalid value handling. Good use of @pytest.mark.parametrize for testing all members.

src/synthorg/hr/evaluation/enums.py (1)

1-17: LGTM!

Clean and well-documented enum definition for the five evaluation pillars. As a pure data model, no logging is needed per coding guidelines. The string values follow a clear convention and align with the InfoQ five-pillar framework.

src/synthorg/hr/evaluation/pillar_protocol.py (1)

16-43: Protocol contract is clean and implementation-ready.

Typed async interface and explicit pillar/name properties are clear and consistent for strategy injection.

src/synthorg/observability/events/evaluation.py (1)

9-16: Event constant set looks consistent and complete for the evaluation domain.

tests/unit/hr/evaluation/test_experience_strategy.py (1)

28-161: Coverage is strong for UX scoring behavior and neutral-path handling.

docs/design/agents.md (1)

411-455: The new five-pillar design section is clear and well-aligned with the implemented architecture.

tests/unit/hr/evaluation/test_intelligence_strategy.py (1)

31-161: Intelligence strategy tests exercise the critical scoring branches and drift-confidence behavior well.

tests/unit/hr/evaluation/test_config.py (1)

18-236: Config model test coverage is comprehensive and validates key invariants effectively.

tests/unit/hr/evaluation/test_models.py (1)

24-409: Model and utility tests are thorough, especially around validation boundaries and frozen behavior.

src/synthorg/hr/evaluation/intelligence_strategy.py (1)

1-163: Well-structured strategy implementation.

The QualityBlendIntelligenceStrategy correctly implements the PillarScoringStrategy protocol with proper logging, event emission, and configuration-driven behavior. The neutral score fallback for missing data and confidence reduction for calibration drift are well-considered design choices.

src/synthorg/hr/evaluation/experience_strategy.py (1)

41-164: Clean UX scoring implementation.

The strategy correctly handles partial feedback (None ratings), metric toggles, and weight redistribution. The early return for insufficient feedback with appropriate logging is good defensive design.

tests/unit/hr/evaluation/test_resilience_strategy.py (1)

1-140: Comprehensive resilience strategy test coverage.

Tests cover protocol properties, neutral scoring fallbacks, metric enable/disable behavior, edge cases (zero tasks, all failures), and score range expectations. The use of factory functions from conftest promotes maintainability.

tests/unit/hr/evaluation/test_governance_strategy.py (1)

1-188: Thorough governance strategy test suite.

Tests cover all key scenarios: neutral fallback, score ranges for different verdict distributions, metric toggles, penalty behaviors, and the unknown trust level fallback. The comparative assertions (lines 144-146, 168-170) effectively validate penalty mechanics.

src/synthorg/hr/evaluation/governance_strategy.py (1)

39-176: Solid governance strategy implementation.

The strategy correctly handles the three governance metrics (audit compliance, trust level, autonomy compliance) with proper fallbacks for missing data and configuration-driven behavior. The trust level mapping with unknown-level fallback is well-designed.

tests/unit/hr/evaluation/test_evaluator.py (1)

1-370: Comprehensive evaluator test coverage.

The test suite covers orchestration (pillar enablement, weight redistribution), individual metric computation (efficiency, resilience), feedback lifecycle, and end-to-end evaluation flow. The TestComputeResilienceMetrics class thoroughly validates streak tracking, recovery detection, and quality stddev computation.

src/synthorg/hr/evaluation/config.py (1)

1-280: Well-designed evaluation configuration schema.

All pillar configs consistently enforce at least one metric enabled when the pillar is active. The use of frozen=True and allow_inf_nan=False aligns with coding guidelines. Default weights within each pillar sum to 1.0, ensuring proper normalization before redistribution.

src/synthorg/hr/evaluation/models.py (4)

33-62: Clean weight redistribution utility.

The redistribute_weights function correctly handles the edge cases: raises when all items are disabled, and uses equal distribution when all enabled items have zero weight. The implementation is concise and well-documented.


141-201: Robust resilience metrics validation.

The ResilienceMetrics model includes comprehensive cross-field validation ensuring logical consistency: failed_tasks <= total_tasks, recovered_tasks <= failed_tasks, and longest_success_streak >= current_success_streak. This prevents invalid states from propagating through the evaluation pipeline.


331-389: Well-structured evaluation report model.

The EvaluationReport correctly enforces unique pillar scores via the validator and provides a complete structure for reporting evaluation results. The use of uuid4 for default IDs ensures uniqueness across evaluations.


12-14: No action required—TYPE_CHECKING import pattern is correct for Python 3.14.

Sequence belongs in the TYPE_CHECKING block because Python 3.14's PEP 649 provides native lazy annotations. Annotations are not evaluated at runtime, so conditional imports are safe and recommended. The current code is correct and requires no changes.

			> Likely an incorrect or invalid review comment.
tests/unit/hr/evaluation/conftest.py (1)

1-124: Well-structured test fixture module with consistent defaults.

The factory helpers correctly satisfy the model validators (e.g., ResilienceMetrics constraints in snippet 2, EvaluationContext agent_id consistency in snippet 1). Defaults are sensible and the NotBlankStr wrapping is applied appropriately for identifier fields.

src/synthorg/hr/evaluation/evaluator.py (11)

1-52: Clean module setup with well-named constants.

Good use of TYPE_CHECKING for forward references and constants for magic numbers.


55-121: Good dependency injection pattern with lazy default strategy loading.

The lazy imports in _default_* methods avoid circular import issues while still providing sensible defaults.


123-155: Clean orchestration with clear step decomposition.

The evaluate method follows a clear flow: build context → resolve pillars → score → assemble report.


157-186: Efficient context building with centralized data gathering.


254-272: Proper use of asyncio.TaskGroup for concurrent pillar scoring.

Per coding guidelines, TaskGroup is the preferred pattern for fan-out/fan-in parallel operations.


274-314: Correct weighted aggregation with proper clamping and logging.

The INFO-level log for report computation follows the guideline for state transitions.


316-360: Clean feedback storage with immutable query results.

The get_feedback method returns a tuple to ensure immutability of query results.


489-522: Defensive constraint enforcement with min(recovered, failed).

Line 518's min(recovered, failed) ensures the recovered_tasks <= failed_tasks constraint is always satisfied, matching the validator in ResilienceMetrics.


525-568: Correct streak and standard deviation computations.

The streak logic properly tracks recoveries (success following failure), and the standard deviation uses population variance (dividing by n), which is appropriate for evaluating all observed data points rather than estimating from a sample.


406-431: No division by zero risk; this concern is unfounded.

EfficiencyConfig validates all reference fields (reference_cost_usd, reference_time_seconds, reference_tokens) with Pydantic's gt=0 constraint, which rejects zero and negative values at validation time. The unit tests confirm this validation is enforced. Division operations at lines 410, 420, and 428 are safe.

			> Likely an incorrect or invalid review comment.

373-374: No issue here. NotBlankStr is Annotated[str, ...], so plain string literals will correctly match NotBlankStr keys in dict lookups.

…nd CodeRabbit

- Fix intelligence strategy fallback bug: CI-disabled + no calibration
  records now returns neutral instead of using disabled CI quality score
- Fix wrong event constant EVAL_PILLAR_SCORED for unknown trust level
  warning in governance strategy; add EVAL_TRUST_LEVEL_UNKNOWN constant
- Move Sequence out of TYPE_CHECKING in models.py for PEP 649 safety
- Extract shared scoring constants to evaluation/constants.py, replace
  duplicated _MAX_SCORE/_NEUTRAL_SCORE/_FULL_CONFIDENCE_DATA_POINTS
  across 5 modules
- Decompose all strategy score() methods into <50-line helpers:
  _collect_metrics, _build_result, _neutral, _compute_confidence
- Extract _get_pillar_configs from _resolve_enabled_pillars in evaluator
- Add EvaluationReport validators: agent_id/snapshot consistency,
  pillar_weights/pillar_scores correspondence
- Extract magic number 3 in UX confidence to named constant
- Add logging for silent neutral returns in governance/resilience/
  experience strategy no-enabled-metrics paths
- Fix docstrings: intelligence strategy LLM calibration origin note,
  efficiency docstring second neutral path, evaluator Args pillar names,
  resilience metrics capping behavior, add __init__ docstring
- Fix agents.md frontmatter description to include 'evaluation'
- Add 10 tests: CI-disabled+no-calibration, EvaluationContext agent_id
  mismatch, EvaluationReport agent_id/weights validators, efficiency
  7d fallback/neutral/clamping, parametrized all-metrics-disabled for
  all 5 configs
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 75-91: In __init__, validate any non-None injected strategy
(intelligence_strategy, resilience_strategy, governance_strategy, ux_strategy)
by checking its strategy.pillar equals the expected EvaluationPillar for that
slot (e.g., intelligence -> EvaluationPillar.INTELLIGENCE, resilience ->
RESILIENCE, governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a
ValueError with a clear message naming the slot and actual strategy.pillar so
the failure occurs at construction time; keep using the existing _default_*()
for None inputs but still assert their .pillar if you want extra safety.

In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Around line 84-103: The sufficiency check and downstream
confidence/data_point_count must count only feedback entries that contributed at
least one enabled metric; change the flow so you first identify/filter
contributing entries (e.g., compute contributing_feedback = [f for f in feedback
if it has at least one enabled metric according to cfg] or update
_collect_metrics to return both available metrics and the per-feedback
contribution set), then use len(contributing_feedback) instead of len(feedback)
when comparing to cfg.min_feedback_count and when computing
data_point_count/confidence; finally pass the filtered contributing_feedback (or
use the contributed-count returned by _collect_metrics) into _build_result and
call _neutral when contributing count < cfg.min_feedback_count (using the same
reason keys), keeping calls to _neutral and symbols _collect_metrics,
_build_result, _neutral, and cfg.min_feedback_count consistent.

In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
legitimate TrustLevel.CUSTOM value, causing agents with "custom" to be treated
as unknown (EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update
the logic to explicitly handle "custom" by either adding a "custom" key to
_TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a
score from that policy before falling back, by updating the code paths that
reference _TRUST_LEVEL_SCORES and the evaluator that emits
EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the
custom-policy resolution routine to derive the numeric score).
- Around line 75-89: Remove the early neutral-return that blocks scoring when
total_audits == 0 and context.trust_level is None; instead call
self._collect_metrics(context, total_audits) unconditionally so that the
collector can evaluate enabled metrics (including autonomy_compliance) and
decide if there is data. After calling _collect_metrics use its returned
enabled/data_points to decide whether to return self._neutral(...) or to call
self._build_result(scores, enabled, data_points, context). Keep references to
the same methods/variables: _collect_metrics, _neutral, _build_result,
total_audits, and context.trust_level (do not add new gating logic before
calling _collect_metrics).

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 64-67: The current logic returns neutral when
context.snapshot.overall_quality_score is None even if CI quality is disabled or
calibration data exists; update the flow in intelligence_strategy.py so
overall_quality_score is only treated as a CI data source when
ci_quality_enabled is true and only include CI-derived points in
_collect_metrics() when ci_quality_enabled is true (i.e., stop preloading
data_points from task_records unless ci_quality_enabled), change the
early-return that calls self._neutral(reason="no_quality_score") to check that
no enabled metric has usable data before returning neutral, and add a regression
test that sets ci_quality_enabled=False with overall_quality_score=None but with
calibration records present to ensure scoring proceeds using calibration only.

In `@src/synthorg/hr/evaluation/models.py`:
- Around line 372-413: The current _validate_weights_match_scores only compares
sets and misses duplicate pillar names and invalid floats; update validation for
the pillar_weights field (and/or _validate_weights_match_scores) to (1) detect
and reject duplicate pillar names in pillar_weights (collect seen names and
raise ValueError listing duplicates), (2) ensure each weight is a real number
within [0.0, 1.0] (reject negatives or >1), and (3) ensure the weights are
normalized (sum(weights) ≈ 1.0 within a small epsilon) and raise descriptive
ValueError messages if any check fails; keep these checks in the model_validator
decorated method(s) for EvaluationReport so invalid/ambiguous weighting schemes
cannot be constructed.
- Around line 262-328: Add an additional after-model validator (e.g. def
_validate_agent_scoped_records_consistency(self) -> Self) that iterates
task_records, calibration_records, and feedback and ensures each record.agent_id
equals self.agent_id; if any mismatch is found raise ValueError with a clear
message identifying the collection and offending record (index or repr). Keep
the existing _validate_agent_id_consistency but implement this new validator to
enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and
InteractionFeedback collections.

In `@tests/unit/hr/evaluation/test_evaluator.py`:
- Around line 167-210: Update the two tests to force the snapshot shapes so the
fallback and neutral branches in EvaluationService._score_efficiency() are
actually exercised: in test_efficiency_7d_window_fallback() monkeypatch
PerformanceTracker.get_snapshot (or the EvaluationService.get_snapshot helper)
to return a snapshot containing only the 7d window (no 30d data), call
svc.evaluate(agent_id) and assert the efficiency pillar's score and confidence
match the known 7d-fallback expected values; in
test_efficiency_no_window_returns_neutral() patch get_snapshot to return no
windows (empty snapshot), call svc.evaluate(agent_id) and assert the efficiency
pillar's score and confidence equal the neutral values returned by
_score_efficiency() for no-data cases. Ensure you reference
EvaluationService._score_efficiency, PerformanceTracker.get_snapshot (or the
concrete get_snapshot you use), and the test functions
test_efficiency_7d_window_fallback and test_efficiency_no_window_returns_neutral
when making the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c3ec6139-1a8d-493e-bbf9-107516e8da5a

📥 Commits

Reviewing files that changed from the base of the PR and between 10f0772 and d5c53cc.

📒 Files selected for processing (14)
  • docs/design/agents.md
  • src/synthorg/hr/evaluation/constants.py
  • src/synthorg/hr/evaluation/evaluator.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • src/synthorg/observability/events/evaluation.py
  • tests/unit/hr/evaluation/conftest.py
  • tests/unit/hr/evaluation/test_config.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • tests/unit/hr/evaluation/test_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Build Backend
  • GitHub Check: Build Sandbox
  • GitHub Check: Build Web
  • GitHub Check: Dependency Review
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

  • src/synthorg/hr/evaluation/constants.py
  • src/synthorg/observability/events/evaluation.py
  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • tests/unit/hr/evaluation/test_models.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/evaluator.py
  • tests/unit/hr/evaluation/conftest.py
  • tests/unit/hr/evaluation/test_evaluator.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

  • src/synthorg/hr/evaluation/constants.py
  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/resilience_strategy.py
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/experience_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/evaluator.py
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/

Files:

  • docs/design/agents.md
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

  • tests/unit/hr/evaluation/test_intelligence_strategy.py
  • tests/unit/hr/evaluation/test_config.py
  • tests/unit/hr/evaluation/test_models.py
  • tests/unit/hr/evaluation/conftest.py
  • tests/unit/hr/evaluation/test_evaluator.py
🧠 Learnings (34)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

  • docs/design/agents.md
  • tests/unit/hr/evaluation/test_models.py
  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

  • docs/design/agents.md
  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

  • docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-18T21:23:23.586Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

  • src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/synthorg/observability/events/evaluation.py
  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

  • tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T07:22:28.134Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T06:43:14.114Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py

Comment on lines +75 to +91
def __init__( # noqa: PLR0913
self,
*,
tracker: PerformanceTracker,
intelligence_strategy: PillarScoringStrategy | None = None,
resilience_strategy: PillarScoringStrategy | None = None,
governance_strategy: PillarScoringStrategy | None = None,
ux_strategy: PillarScoringStrategy | None = None,
config: EvaluationConfig | None = None,
) -> None:
"""Initialize the evaluation service."""
self._tracker = tracker
self._config = config or EvaluationConfig()
self._intelligence = intelligence_strategy or self._default_intelligence()
self._resilience = resilience_strategy or self._default_resilience()
self._governance = governance_strategy or self._default_governance()
self._ux = ux_strategy or self._default_ux()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate injected strategies against their pillar slots.

The service accepts pluggable strategies but stores them without checking strategy.pillar. A miswired dependency passed into the wrong constructor slot will fail much later during evaluate() with duplicate or mismatched pillar data instead of at construction time. Fail fast in __init__ by validating each injected strategy against the expected EvaluationPillar.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/evaluator.py` around lines 75 - 91, In __init__,
validate any non-None injected strategy (intelligence_strategy,
resilience_strategy, governance_strategy, ux_strategy) by checking its
strategy.pillar equals the expected EvaluationPillar for that slot (e.g.,
intelligence -> EvaluationPillar.INTELLIGENCE, resilience -> RESILIENCE,
governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a ValueError
with a clear message naming the slot and actual strategy.pillar so the failure
occurs at construction time; keep using the existing _default_*() for None
inputs but still assert their .pillar if you want extra safety.

Comment on lines +84 to +103
cfg = context.config.experience
feedback = context.feedback

if len(feedback) < cfg.min_feedback_count:
return self._neutral(
context,
reason="insufficient_feedback",
count=len(feedback),
min_required=cfg.min_feedback_count,
)

available = self._collect_metrics(cfg, feedback)

if not available:
return self._neutral(
context,
reason="no_enabled_metrics_with_data",
)

return self._build_result(available, feedback, context)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Count only contributing feedback toward UX sufficiency and confidence.

len(feedback) includes records where every enabled rating is None. With one real rating and many empty submissions, this path clears min_feedback_count, inflates data_point_count, and can push confidence close to 1.0 even though almost no UX signal was used. Base the sufficiency check, confidence, and data_point_count on feedback entries that contributed at least one enabled metric.

Also applies to: 155-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/experience_strategy.py` around lines 84 - 103, The
sufficiency check and downstream confidence/data_point_count must count only
feedback entries that contributed at least one enabled metric; change the flow
so you first identify/filter contributing entries (e.g., compute
contributing_feedback = [f for f in feedback if it has at least one enabled
metric according to cfg] or update _collect_metrics to return both available
metrics and the per-feedback contribution set), then use
len(contributing_feedback) instead of len(feedback) when comparing to
cfg.min_feedback_count and when computing data_point_count/confidence; finally
pass the filtered contributing_feedback (or use the contributed-count returned
by _collect_metrics) into _build_result and call _neutral when contributing
count < cfg.min_feedback_count (using the same reason keys), keeping calls to
_neutral and symbols _collect_metrics, _build_result, _neutral, and
cfg.min_feedback_count consistent.

Comment on lines +29 to +35
# Trust level to score mapping.
_TRUST_LEVEL_SCORES: dict[str, float] = {
"sandboxed": 2.5,
"restricted": 5.0,
"standard": 7.5,
"elevated": 10.0,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle the valid custom trust level explicitly.

src/synthorg/core/enums.py defines TrustLevel.CUSTOM = "custom", but this table does not. Agents with that legitimate value will emit EVAL_TRUST_LEVEL_UNKNOWN and get the neutral fallback instead of a trust score. Add a dedicated custom path, or derive the score from the resolved custom trust policy instead of routing it through the unknown-level fallback.

Also applies to: 148-163

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The
trust-score map _TRUST_LEVEL_SCORES currently omits the legitimate
TrustLevel.CUSTOM value, causing agents with "custom" to be treated as unknown
(EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update the logic to
explicitly handle "custom" by either adding a "custom" key to
_TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a
score from that policy before falling back, by updating the code paths that
reference _TRUST_LEVEL_SCORES and the evaluator that emits
EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the
custom-policy resolution routine to derive the numeric score).

Comment on lines +75 to +89
if total_audits == 0 and context.trust_level is None:
return self._neutral(context, reason="no_governance_data")

scores, enabled, data_points = self._collect_metrics(
context,
total_audits,
)

if not enabled:
return self._neutral(
context,
reason="no_enabled_metrics_with_data",
)

return self._build_result(scores, enabled, data_points, context)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Autonomy-only governance scoring is blocked by the early neutral return.

This precheck short-circuits before _collect_metrics() can score autonomy_compliance, so a configuration that enables only autonomy can never produce a real governance score. Let the collector decide whether any enabled metric has data instead of requiring audits or trust up front.

Suggested fix
-        if total_audits == 0 and context.trust_level is None:
-            return self._neutral(context, reason="no_governance_data")
-
         scores, enabled, data_points = self._collect_metrics(
             context,
             total_audits,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 75 - 89,
Remove the early neutral-return that blocks scoring when total_audits == 0 and
context.trust_level is None; instead call self._collect_metrics(context,
total_audits) unconditionally so that the collector can evaluate enabled metrics
(including autonomy_compliance) and decide if there is data. After calling
_collect_metrics use its returned enabled/data_points to decide whether to
return self._neutral(...) or to call self._build_result(scores, enabled,
data_points, context). Keep references to the same methods/variables:
_collect_metrics, _neutral, _build_result, total_audits, and context.trust_level
(do not add new gating logic before calling _collect_metrics).

Comment on lines +64 to +67
ci_score = context.snapshot.overall_quality_score

if ci_score is None:
return self._neutral(context, reason="no_quality_score")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't make CI quality a hard prerequisite—or confidence source—when it isn't used.

This path returns neutral before calibration is considered, so a calibration-only setup cannot score if overall_quality_score is missing. _collect_metrics() also preloads data_points from task_records even when ci_quality is disabled or skipped, which inflates confidence for LLM-only results. Only count CI data when the CI metric is actually included, and return neutral only when no enabled metric has usable data. Please add a regression test for ci_quality_enabled=False with overall_quality_score=None and calibration records present.

Suggested fix
-        ci_score = context.snapshot.overall_quality_score
-
-        if ci_score is None:
-            return self._neutral(context, reason="no_quality_score")
-
         available, data_points, drift = self._collect_metrics(
-            ci_score,
+            context.snapshot.overall_quality_score,
             context,
         )
         if not available:
             return self._neutral(context, reason="no_enabled_metrics")
@@
-        ci_score: float,
+        ci_score: float | None,
         context: EvaluationContext,
     ) -> tuple[list[tuple[str, float, float]], int, float]:
@@
-        data_points = len(context.task_records)
+        data_points = 0
         calibration_drift = 0.0
 
-        if context.config.intelligence.ci_quality_enabled:
+        if context.config.intelligence.ci_quality_enabled and ci_score is not None:
             available.append(
                 (
                     "ci_quality",
                     context.config.intelligence.ci_quality_weight,
                     ci_score,
                 )
             )
+            data_points += len(context.task_records)
+        elif context.config.intelligence.ci_quality_enabled:
+            logger.debug(
+                EVAL_METRIC_SKIPPED,
+                agent_id=context.agent_id,
+                pillar=self.pillar.value,
+                metric="ci_quality",
+                reason="no_quality_score",
+            )

Also applies to: 78-123

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 64 - 67,
The current logic returns neutral when context.snapshot.overall_quality_score is
None even if CI quality is disabled or calibration data exists; update the flow
in intelligence_strategy.py so overall_quality_score is only treated as a CI
data source when ci_quality_enabled is true and only include CI-derived points
in _collect_metrics() when ci_quality_enabled is true (i.e., stop preloading
data_points from task_records unless ci_quality_enabled), change the
early-return that calls self._neutral(reason="no_quality_score") to check that
no enabled metric has usable data before returning neutral, and add a regression
test that sets ci_quality_enabled=False with overall_quality_score=None but with
calibration records present to ensure scoring proceeds using calibration only.

Comment on lines +262 to +328
agent_id: NotBlankStr = Field(description="Agent being evaluated")
now: AwareDatetime = Field(description="Reference timestamp")
config: EvaluationConfig = Field(description="Evaluation configuration")
snapshot: AgentPerformanceSnapshot = Field(
description="Performance snapshot from the tracker",
)
task_records: tuple[TaskMetricRecord, ...] = Field(
default=(),
description="Raw task metric records",
)
calibration_records: tuple[LlmCalibrationRecord, ...] = Field(
default=(),
description="LLM calibration records",
)
feedback: tuple[InteractionFeedback, ...] = Field(
default=(),
description="Interaction feedback records",
)
resilience_metrics: ResilienceMetrics | None = Field(
default=None,
description="Derived resilience metrics",
)
audit_allow_count: int = Field(
ge=0,
default=0,
description="Allowed audit entries in the window",
)
audit_deny_count: int = Field(
ge=0,
default=0,
description="Denied audit entries in the window",
)
audit_escalate_count: int = Field(
ge=0,
default=0,
description="Escalated audit entries in the window",
)
audit_high_risk_count: int = Field(
ge=0,
default=0,
description="High-risk audit entries in the window",
)
trust_level: NotBlankStr | None = Field(
default=None,
description="Current trust level name",
)
trust_demotions_in_window: int = Field(
ge=0,
default=0,
description="Trust demotions in the window",
)
autonomy_downgrades_in_window: int = Field(
ge=0,
default=0,
description="Autonomy downgrades in the window",
)

@model_validator(mode="after")
def _validate_agent_id_consistency(self) -> Self:
"""Ensure context agent_id matches snapshot agent_id."""
if self.agent_id != self.snapshot.agent_id:
msg = (
f"Context agent_id ({self.agent_id}) does not match "
f"snapshot agent_id ({self.snapshot.agent_id})"
)
raise ValueError(msg)
return self
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject mixed-agent records in EvaluationContext.

The model only checks snapshot.agent_id. A caller can still build a context for one agent that carries agent-scoped records from another agent, and the strategies will score that foreign data as if it belonged to the current agent. Add an after-validator that enforces agent_id consistency across the agent-scoped collections in this model.

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 262 - 328, Add an
additional after-model validator (e.g. def
_validate_agent_scoped_records_consistency(self) -> Self) that iterates
task_records, calibration_records, and feedback and ensures each record.agent_id
equals self.agent_id; if any mismatch is found raise ValueError with a clear
message identifying the collection and offending record (index or repr). Keep
the existing _validate_agent_id_consistency but implement this new validator to
enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and
InteractionFeedback collections.

Comment on lines +372 to +413
pillar_weights: tuple[tuple[NotBlankStr, float], ...] = Field(
description="Applied weights as (pillar_name, weight) pairs",
)

@model_validator(mode="after")
def _validate_unique_pillars(self) -> Self:
"""Ensure pillar scores have unique pillar names."""
names = [ps.pillar for ps in self.pillar_scores]
if len(names) != len(set(names)):
seen: set[EvaluationPillar] = set()
dupes: list[str] = []
for n in names:
if n in seen:
dupes.append(n.value)
seen.add(n)
msg = f"Duplicate pillar scores: {', '.join(dupes)}"
raise ValueError(msg)
return self

@model_validator(mode="after")
def _validate_agent_id_consistency(self) -> Self:
"""Ensure report agent_id matches snapshot agent_id."""
if self.agent_id != self.snapshot.agent_id:
msg = (
f"Report agent_id ({self.agent_id}) does not match "
f"snapshot agent_id ({self.snapshot.agent_id})"
)
raise ValueError(msg)
return self

@model_validator(mode="after")
def _validate_weights_match_scores(self) -> Self:
"""Ensure pillar_weights entries correspond to pillar_scores."""
score_pillars = {ps.pillar.value for ps in self.pillar_scores}
weight_pillars = {name for name, _ in self.pillar_weights}
if score_pillars != weight_pillars:
msg = (
f"Pillar weight names {sorted(weight_pillars)} do not match "
f"pillar score names {sorted(score_pillars)}"
)
raise ValueError(msg)
return self
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

pillar_weights validation is too weak for a public report model.

_validate_weights_match_scores() compares sets only, so duplicate entries like (("intelligence", 0.5), ("intelligence", 0.5)) still validate as long as the score set is {"intelligence"}. The field also accepts unconstrained floats, so negative or >1 weights can slip through. Reject duplicate names and enforce bounded, normalized weights here so EvaluationReport cannot represent an ambiguous weighting scheme.

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 413, The current
_validate_weights_match_scores only compares sets and misses duplicate pillar
names and invalid floats; update validation for the pillar_weights field (and/or
_validate_weights_match_scores) to (1) detect and reject duplicate pillar names
in pillar_weights (collect seen names and raise ValueError listing duplicates),
(2) ensure each weight is a real number within [0.0, 1.0] (reject negatives or
>1), and (3) ensure the weights are normalized (sum(weights) ≈ 1.0 within a
small epsilon) and raise descriptive ValueError messages if any check fails;
keep these checks in the model_validator decorated method(s) for
EvaluationReport so invalid/ambiguous weighting schemes cannot be constructed.

- Fix intelligence strategy: CI quality is no longer a hard prerequisite;
  calibration-only mode works when overall_quality_score is None
- Fix governance strategy: autonomy-only scoring no longer blocked by
  the early neutral return (total_audits==0 && trust_level==None)
- Strengthen EvaluationReport pillar_weights validator: reject duplicate
  weight entries before set comparison
- Fix efficiency tests to actually test 7d fallback and neutral paths
  using direct _score_efficiency calls with custom snapshots
- Update governance no-data test to disable autonomy for true neutral
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
src/synthorg/hr/evaluation/models.py (2)

268-328: ⚠️ Potential issue | 🟠 Major

Reject foreign-agent records in EvaluationContext.

The current validator only ties agent_id to snapshot.agent_id. A caller can still pass task_records, calibration_records, or feedback belonging to another agent, and the strategies will score that foreign data as if it were local.

Proposed fix
     `@model_validator`(mode="after")
     def _validate_agent_id_consistency(self) -> Self:
         """Ensure context agent_id matches snapshot agent_id."""
         if self.agent_id != self.snapshot.agent_id:
             msg = (
                 f"Context agent_id ({self.agent_id}) does not match "
                 f"snapshot agent_id ({self.snapshot.agent_id})"
             )
             raise ValueError(msg)
         return self
+
+    `@model_validator`(mode="after")
+    def _validate_agent_scoped_records(self) -> Self:
+        """Ensure agent-scoped collections match the context agent."""
+        collections = (
+            ("task_records", self.task_records),
+            ("calibration_records", self.calibration_records),
+            ("feedback", self.feedback),
+        )
+        for collection_name, records in collections:
+            for index, record in enumerate(records):
+                if record.agent_id != self.agent_id:
+                    msg = (
+                        f"{collection_name}[{index}] agent_id "
+                        f"({record.agent_id}) does not match "
+                        f"context agent_id ({self.agent_id})"
+                    )
+                    raise ValueError(msg)
+        return self

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 268 - 328, The current
_validate_agent_id_consistency only compares self.agent_id to
self.snapshot.agent_id but does not reject task_records, calibration_records, or
feedback that belong to a different agent; update _validate_agent_id_consistency
in EvaluationContext to iterate over task_records (TaskMetricRecord.agent_id),
calibration_records (LlmCalibrationRecord.agent_id), and feedback
(InteractionFeedback.agent_id) and raise a ValueError if any record.agent_id !=
self.agent_id (include which record type and offending id in the message); keep
the existing snapshot check and return self at the end.

372-374: ⚠️ Potential issue | 🟠 Major

Finish hardening pillar_weights on the report model.

_validate_weights_match_scores() now rejects duplicate names, but it still accepts negative weights, weights above 1.0, or totals that do not sum to 1.0. That leaves EvaluationReport open to ambiguous weighting schemes even though overall_score is defined as weighted output.

Proposed fix
     `@model_validator`(mode="after")
     def _validate_weights_match_scores(self) -> Self:
         """Ensure pillar_weights entries correspond to pillar_scores."""
         weight_names = [name for name, _ in self.pillar_weights]
         if len(weight_names) != len(set(weight_names)):
             msg = "Duplicate entries in pillar_weights"
             raise ValueError(msg)
+        invalid_weights = [
+            str(name)
+            for name, weight in self.pillar_weights
+            if weight < 0.0 or weight > 1.0
+        ]
+        if invalid_weights:
+            msg = (
+                "pillar_weights must be within [0.0, 1.0] for: "
+                f"{', '.join(invalid_weights)}"
+            )
+            raise ValueError(msg)
+        total_weight = sum(weight for _, weight in self.pillar_weights)
+        if abs(total_weight - 1.0) > 1e-9:
+            msg = f"pillar_weights must sum to 1.0, got {total_weight}"
+            raise ValueError(msg)
         score_pillars = {ps.pillar.value for ps in self.pillar_scores}
         weight_pillars = set(weight_names)
         if score_pillars != weight_pillars:
             msg = (
                 f"Pillar weight names {sorted(weight_pillars)} do not match "

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

Also applies to: 402-417

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 374, The
pillar_weights field must be hardened: update the _validate_weights_match_scores
validator (used by EvaluationReport and the related scores validator) to reject
weights < 0 or > 1, enforce that the sum of all weights equals 1.0 within a
small epsilon (e.g., 1e-6), and keep the existing duplicate-name check; raise
clear ValueError messages identifying the offending pillar name or the total sum
mismatch. Ensure the validator is applied to the pillar_weights
tuple[tuple[NotBlankStr, float], ...] field (and reused for the other
weights-validated field handled by _validate_weights_match_scores) so all weight
inputs are normalized and validated at the model boundary.
src/synthorg/hr/evaluation/intelligence_strategy.py (1)

79-123: ⚠️ Potential issue | 🟠 Major

Confidence is still inflated in calibration-only runs.

data_points starts at len(context.task_records) before CI quality is proven usable. When ci_quality is disabled or overall_quality_score is missing, calibration-only scoring still gains confidence from unrelated task counts. Start from 0 and only add task records when the CI component is actually appended.

Proposed fix
         available: list[tuple[str, float, float]] = []
-        data_points = len(context.task_records)
+        data_points = 0
         calibration_drift = 0.0
         ci_score = context.snapshot.overall_quality_score
 
         if context.config.intelligence.ci_quality_enabled and ci_score is not None:
             available.append(
@@
                     context.config.intelligence.ci_quality_weight,
                     ci_score,
                 )
             )
+            data_points += len(context.task_records)
         elif context.config.intelligence.ci_quality_enabled:
             logger.debug(
                 EVAL_METRIC_SKIPPED,
                 agent_id=context.agent_id,
                 pillar=self.pillar.value,

Please add a regression test for calibration-only scoring with task records present so confidence stays tied to calibration_records.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 79 - 123,
data_points is initialized from len(context.task_records) so calibration-only
runs get inflated confidence; change initialization to data_points = 0 and only
add len(context.task_records) when you append the "ci_quality" tuple (i.e.,
inside the block where you call available.append for "ci_quality") and keep
adding len(records) for calibration_records as already done; also add a
regression test (e.g.,
test_calibration_only_confidence_tied_to_calibration_records) that creates
context with task_records present but ci_quality disabled or no
overall_quality_score and asserts returned data_points equals number of
calibration_records only.
src/synthorg/hr/evaluation/governance_strategy.py (1)

29-35: ⚠️ Potential issue | 🟠 Major

Verify the supported custom trust level doesn't fall through the unknown path.

If TrustLevel.CUSTOM is still a valid value in src/synthorg/core/enums.py, this table will log a legitimate trust state as unknown and score it with the neutral fallback. Add an explicit "custom" branch, or derive the score from the resolved custom policy instead of routing it through EVAL_TRUST_LEVEL_UNKNOWN.

Run this read-only check to confirm the upstream enum still exposes CUSTOM:

#!/bin/bash
rg -n -C2 'class TrustLevel|CUSTOM|custom' src/synthorg/core/enums.py

If that enum member is still present, please add a regression test for the legitimate custom path as well.

Also applies to: 149-169

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The
trust-score map _TRUST_LEVEL_SCORES currently omits the "custom" key which
causes legitimate TrustLevel.CUSTOM values to hit EVAL_TRUST_LEVEL_UNKNOWN and
use the neutral fallback; update the mapping in governance_strategy.py to handle
"custom" explicitly (or compute the score from the resolved custom policy) and
update any code-path that maps TrustLevel -> score to use that branch instead of
falling back to EVAL_TRUST_LEVEL_UNKNOWN; reference symbols:
_TRUST_LEVEL_SCORES, TrustLevel.CUSTOM, EVAL_TRUST_LEVEL_UNKNOWN, and ensure you
add a regression test that constructs a TrustLevel.CUSTOM case and asserts the
expected non-neutral score/path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/models.py`:
- Around line 65-139: InteractionFeedback currently allows records with all
ratings None and free_text blank; add an after-model validator on
InteractionFeedback (use Pydantic V2 `@model_validator`(mode="after") or
equivalent) that inspects clarity_rating, tone_rating, helpfulness_rating,
trust_rating, satisfaction_rating and free_text and raises a ValueError when
every rating is None and free_text is None or free_text.strip() == "" so at
least one numeric rating or a non-blank comment is required.

---

Duplicate comments:
In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
"custom" key which causes legitimate TrustLevel.CUSTOM values to hit
EVAL_TRUST_LEVEL_UNKNOWN and use the neutral fallback; update the mapping in
governance_strategy.py to handle "custom" explicitly (or compute the score from
the resolved custom policy) and update any code-path that maps TrustLevel ->
score to use that branch instead of falling back to EVAL_TRUST_LEVEL_UNKNOWN;
reference symbols: _TRUST_LEVEL_SCORES, TrustLevel.CUSTOM,
EVAL_TRUST_LEVEL_UNKNOWN, and ensure you add a regression test that constructs a
TrustLevel.CUSTOM case and asserts the expected non-neutral score/path.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 79-123: data_points is initialized from len(context.task_records)
so calibration-only runs get inflated confidence; change initialization to
data_points = 0 and only add len(context.task_records) when you append the
"ci_quality" tuple (i.e., inside the block where you call available.append for
"ci_quality") and keep adding len(records) for calibration_records as already
done; also add a regression test (e.g.,
test_calibration_only_confidence_tied_to_calibration_records) that creates
context with task_records present but ci_quality disabled or no
overall_quality_score and asserts returned data_points equals number of
calibration_records only.

In `@src/synthorg/hr/evaluation/models.py`:
- Around line 268-328: The current _validate_agent_id_consistency only compares
self.agent_id to self.snapshot.agent_id but does not reject task_records,
calibration_records, or feedback that belong to a different agent; update
_validate_agent_id_consistency in EvaluationContext to iterate over task_records
(TaskMetricRecord.agent_id), calibration_records
(LlmCalibrationRecord.agent_id), and feedback (InteractionFeedback.agent_id) and
raise a ValueError if any record.agent_id != self.agent_id (include which record
type and offending id in the message); keep the existing snapshot check and
return self at the end.
- Around line 372-374: The pillar_weights field must be hardened: update the
_validate_weights_match_scores validator (used by EvaluationReport and the
related scores validator) to reject weights < 0 or > 1, enforce that the sum of
all weights equals 1.0 within a small epsilon (e.g., 1e-6), and keep the
existing duplicate-name check; raise clear ValueError messages identifying the
offending pillar name or the total sum mismatch. Ensure the validator is applied
to the pillar_weights tuple[tuple[NotBlankStr, float], ...] field (and reused
for the other weights-validated field handled by _validate_weights_match_scores)
so all weight inputs are normalized and validated at the model boundary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fae58238-9e61-404e-8c41-b0e7e5860cd6

📥 Commits

Reviewing files that changed from the base of the PR and between d5c53cc and 0cddc0f.

📒 Files selected for processing (5)
  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Build Backend
  • GitHub Check: Build Web
  • GitHub Check: Build Sandbox
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Dependency Review
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

  • src/synthorg/hr/evaluation/governance_strategy.py
  • src/synthorg/hr/evaluation/models.py
  • src/synthorg/hr/evaluation/intelligence_strategy.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

  • tests/unit/hr/evaluation/test_evaluator.py
  • tests/unit/hr/evaluation/test_governance_strategy.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T07:22:28.134Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

  • src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

  • tests/unit/hr/evaluation/test_evaluator.py

Add model_validator requiring at least one rating or non-blank free_text
in InteractionFeedback. Prevents empty feedback records with no signal
from being stored. Add tests for empty feedback rejection and
free-text-only feedback acceptance.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit/hr/evaluation/test_models.py`:
- Around line 329-339: Remove the redundant in-function import of
EvaluationContext inside test_agent_id_mismatch_raises; instead import
EvaluationContext at the module level with the other model imports so the test
uses the top-level import. Specifically, delete the local "from
synthorg.hr.evaluation.models import EvaluationContext" inside
test_agent_id_mismatch_raises and add EvaluationContext to the existing model
imports at the top of the test file (where other model classes are imported).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d5b307c2-1f81-440c-bbde-adeeb2c836d2

📥 Commits

Reviewing files that changed from the base of the PR and between 0cddc0f and fcff01f.

📒 Files selected for processing (2)
  • src/synthorg/hr/evaluation/models.py
  • tests/unit/hr/evaluation/test_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Build Backend
  • GitHub Check: Build Web
  • GitHub Check: Build Sandbox
  • GitHub Check: Dependency Review
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

  • tests/unit/hr/evaluation/test_models.py
  • src/synthorg/hr/evaluation/models.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

  • tests/unit/hr/evaluation/test_models.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

  • src/synthorg/hr/evaluation/models.py
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:42:17.990Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence).

Applied to files:

  • src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer

Applied to files:

  • src/synthorg/hr/evaluation/models.py
🔇 Additional comments (13)
src/synthorg/hr/evaluation/models.py (7)

338-347: Add validation for agent-scoped record collections.

The current validator only checks agent_id == snapshot.agent_id, but does not validate that task_records, calibration_records, and feedback entries all belong to the same agent. While the _build_context method in evaluator.py fetches data using consistent agent_id, the model itself does not enforce this invariant, allowing callers (e.g., tests or future code paths) to construct inconsistent contexts.


421-436: Weight bounds and normalization still not validated.

The duplicate weight names check was added (lines 424-427), but pillar_weights still accepts negative weights and weights that don't sum to 1.0. While the _build_report method uses redistribute_weights which guarantees proper bounds and normalization, the model itself permits invalid states.


1-31: LGTM: Module setup and imports are correct.

The module docstring is clear, imports are appropriate, and the pattern of using ConfigDict(frozen=True, allow_inf_nan=False) aligns with coding guidelines for frozen Pydantic models. The # noqa: TC003 and # noqa: TC001 comments appropriately suppress type-checking-only import warnings for runtime-required types.


33-62: LGTM: redistribute_weights utility is well-designed.

The function correctly handles:

  • Filtering disabled items
  • Proportional redistribution
  • Zero-weight fallback to equal distribution
  • Error case when all items are disabled or input is empty

The docstring is complete with Args, Returns, and Raises sections.


140-157: LGTM: Empty feedback rejection properly implemented.

The _validate_has_signal validator correctly ensures at least one rating or non-blank free_text is present, addressing the previous review feedback about rejecting feedback records with no usable signal.


160-220: LGTM: ResilienceMetrics has comprehensive cross-field validation.

The validator correctly enforces all relational invariants:

  • failed_tasks <= total_tasks
  • recovered_tasks <= failed_tasks
  • longest_success_streak >= current_success_streak

223-251: LGTM: PillarScore model is correctly constrained.

The score (0.0-10.0) and confidence (0.0-1.0) bounds are properly enforced. The breakdown field appropriately stores component scores without rigid bounds since these are informational and may have varying scales depending on the strategy.

tests/unit/hr/evaluation/test_models.py (6)

1-20: LGTM: Test file setup is correct.

The pytestmark = pytest.mark.unit properly marks all tests, and imports are appropriate for testing Pydantic model validation behavior.


25-73: LGTM: Comprehensive tests for redistribute_weights.

The test suite covers all important cases:

  • Proportional preservation
  • Redistribution when items are disabled
  • Error cases (all disabled, empty)
  • Zero-weight equal distribution fallback
  • Single enabled item
  • Sum-to-one invariant

Good use of epsilon comparisons for float assertions.


78-198: LGTM: Thorough InteractionFeedback test coverage.

The tests comprehensively cover:

  • Valid construction with all/partial ratings
  • Frozen immutability
  • Parametrized bounds checking for all rating fields
  • free_text max length
  • Auto-generated unique IDs
  • Empty feedback rejection
  • Free-text-only acceptance

Good use of @pytest.mark.parametrize to avoid test duplication.


203-271: LGTM: ResilienceMetrics tests cover all validation invariants.

All cross-field validation rules are tested:

  • failed_tasks > total_tasks rejection
  • recovered_tasks > failed_tasks rejection
  • current_success_streak > longest_success_streak rejection
  • Frozen immutability

276-321: LGTM: PillarScore tests verify bounds and structure.

Good coverage of score/confidence bounds at boundary values (0.0, 10.0/1.0) and beyond, plus breakdown tuple structure verification.


345-482: LGTM: EvaluationReport tests cover key validation paths.

The tests verify:

  • Valid construction
  • Duplicate pillar score rejection
  • Unique ID generation
  • Score/confidence bounds
  • Frozen immutability
  • Agent ID consistency
  • Weight/score name mismatch

Good coverage of the model's validators.

@Aureliolo Aureliolo temporarily deployed to cloudflare-preview April 3, 2026 07:28 — with GitHub Actions Inactive
@Aureliolo Aureliolo merged commit 5e66cbd into main Apr 3, 2026
34 checks passed
@Aureliolo Aureliolo deleted the feat/hr-evaluation-framework branch April 3, 2026 07:33
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview April 3, 2026 07:33 — with GitHub Actions Inactive
Aureliolo added a commit that referenced this pull request Apr 3, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.5.8](v0.5.7...v0.5.8)
(2026-04-03)


### Features

* auto-select embedding model + fine-tuning pipeline wiring
([#999](#999))
([a4cbc4e](a4cbc4e)),
closes [#965](#965)
[#966](#966)
* ceremony scheduling batch 3 -- milestone strategy, template defaults,
department overrides
([#1019](#1019))
([321d245](321d245))
* five-pillar evaluation framework for HR performance tracking
([#1017](#1017))
([5e66cbd](5e66cbd)),
closes [#699](#699)
* populate comparison page with 53 competitor entries
([#1000](#1000))
([5cb232d](5cb232d)),
closes [#993](#993)
* throughput-adaptive and external-trigger ceremony scheduling
strategies ([#1003](#1003))
([bb5c9a4](bb5c9a4)),
closes [#973](#973)
[#974](#974)


### Bug Fixes

* eliminate backup service I/O from API test lifecycle
([#1015](#1015))
([08d9183](08d9183))
* update run_affected_tests.py to use -n 8
([#1014](#1014))
([3ee9fa7](3ee9fa7))


### Performance

* reduce pytest parallelism from -n auto to -n 8
([#1013](#1013))
([43e0707](43e0707))


### CI/CD

* bump docker/login-action from 4.0.0 to 4.1.0 in the all group
([#1027](#1027))
([e7e28ec](e7e28ec))
* bump wrangler from 4.79.0 to 4.80.0 in /.github in the all group
([#1023](#1023))
([1322a0d](1322a0d))


### Maintenance

* bump github.com/mattn/go-runewidth from 0.0.21 to 0.0.22 in /cli in
the all group
([#1024](#1024))
([b311694](b311694))
* bump https://github.com/astral-sh/ruff-pre-commit from v0.15.8 to
0.15.9 in the all group
([#1022](#1022))
([1650087](1650087))
* bump node from `71be405` to `387eebd` in /docker/sandbox in the all
group ([#1021](#1021))
([40bd2f6](40bd2f6))
* bump node from `cf38e1f` to `ad82eca` in /docker/web in the all group
([#1020](#1020))
([f05ab9f](f05ab9f))
* bump the all group in /web with 3 updates
([#1025](#1025))
([21d40d3](21d40d3))
* bump the all group with 2 updates
([#1026](#1026))
([36778de](36778de))
* enable additional eslint-react rules and fix violations
([#1028](#1028))
([80423be](80423be))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research: five-pillar evaluation framework for HR performance tracking

2 participants