feat: five-pillar evaluation framework for HR performance tracking by Aureliolo · Pull Request #1017 · Aureliolo/synthorg

Aureliolo · 2026-04-02T20:57:57Z

Summary

Implements a structured five-pillar agent evaluation framework based on the InfoQ evaluation framework, with fully pluggable pillars and metrics that can be independently enabled/disabled via EvaluationConfig.

Pillars

Pillar	Strategy	Data Sources
Intelligence/Accuracy	`QualityBlendIntelligenceStrategy` (70% CI / 30% LLM calibration)	`QualityScoreResult`, `LlmCalibrationRecord`
Performance/Efficiency	Inline from `WindowMetrics` (40% cost, 30% time, 30% tokens)	`WindowMetrics` averages
Reliability/Resilience	`TaskBasedResilienceStrategy` (success rate, recovery, consistency, streaks)	`TaskMetricRecord` sequences
Responsibility/Governance	`AuditBasedGovernanceStrategy` (audit compliance, trust, autonomy)	Audit log, trust system, autonomy system
User Experience	`FeedbackBasedUxStrategy` (clarity, tone, helpfulness, trust, satisfaction)	`InteractionFeedback` records

Key design decisions (D24)

Single PillarScoringStrategy protocol with EvaluationContext bag
Per-pillar sub-configs with metric-level enable/disable toggles and configurable weights
Disabled pillars/metrics have weight redistributed proportionally (redistribute_weights utility)
Pull-based evaluation (no background daemon) -- EvaluationService.evaluate() called on demand
Human-calibrated LLM labeling reuses existing LlmCalibrationSampler -- drift above threshold reduces intelligence pillar confidence
All pillars ship enabled with recommended defaults

New files

src/synthorg/hr/evaluation/ (10 source files)
src/synthorg/observability/events/evaluation.py
tests/unit/hr/evaluation/ (10 test files, 126 tests)

Also

Design spec D24 in docs/design/agents.md + docs/architecture/decisions.md
Updated CLAUDE.md Package Structure, logging examples
Updated DESIGN_SPEC.md and design/index.md descriptions

Test plan

126 unit tests covering all models, configs, strategies, evaluator, and edge cases
All tests marked @pytest.mark.unit
mypy strict clean, ruff clean, 0 lint warnings
Pre-reviewed by 12 agents, 25 findings addressed

Review coverage

12 agents: code-reviewer, python-reviewer, test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, conventions-enforcer, async-reviewer, issue-verifier, docs-consistency, resilience-audit, comment-analyzer. All 25 valid findings implemented.

Closes #699

Implement structured five-pillar agent evaluation based on the InfoQ evaluation framework. Each pillar and its individual metrics can be independently enabled/disabled via EvaluationConfig. Pillars: - Intelligence/Accuracy: blends CI quality score with LLM calibration - Performance/Efficiency: normalized cost, time, token metrics - Reliability/Resilience: success rate, recovery, consistency, streaks - Responsibility/Governance: audit compliance, trust, autonomy - User Experience: clarity, tone, helpfulness, trust, satisfaction New hr/evaluation/ subpackage (10 files): - EvaluationPillar enum, PillarScore/EvaluationReport/InteractionFeedback/ ResilienceMetrics/EvaluationContext models - EvaluationConfig with per-pillar sub-configs and metric toggles - PillarScoringStrategy protocol (single protocol, single context bag) - Four default strategies + inline efficiency computation - EvaluationService orchestrator with concurrent pillar scoring - redistribute_weights() utility for weight redistribution Also: - Observability events (eval.* namespace) - Design spec D16 decision in docs/design/agents.md - 118 unit tests, mypy clean, ruff clean Closes #699

Pre-reviewed by 12 agents, 25 findings addressed: Source fixes: - Extract evaluate() into 4 helper methods (was 139 lines, now <50 each) - Extract _score_efficiency into sub-score + builder helpers - Extract _compute_resilience_metrics into module-level helpers - Add EVAL_CALIBRATION_DRIFT_HIGH log on drift detection - Add EVAL_PILLAR_INSUFFICIENT_DATA logs on efficiency early returns - Add EVAL_WEIGHTS_REDISTRIBUTED log on weight redistribution - Add confidence kwarg to efficiency pillar log (consistency) - Change record_feedback to sync def (no await needed) - Use setdefault pattern for feedback dict - Fix data_points = len(...) or 1 -> len(...) in intelligence - Add _FULL_CONFIDENCE_DATA_POINTS named constant (replaces magic 10.0) - Remove unreachable max(1, total_audits) guards in governance - Add warning log for unknown trust levels in governance - Add at-least-one-metric-enabled validators to all 5 sub-configs - Add agent_id consistency validator to EvaluationContext - Fix docstrings: PillarScore 'mirrors' -> 'extends', CI spelled out, resilience 'inverse' -> 'linear penalty', config module docstring Docs fixes: - Renumber D16 -> D24 (collision with Docker sandbox decision) - Add D24 row to docs/architecture/decisions.md - Update CLAUDE.md Package Structure with evaluation/ - Add evaluation event example to CLAUDE.md logging section - Add 'evaluation' to DESIGN_SPEC.md and design/index.md descriptions Test fixes: - Add pytestmark = pytest.mark.unit to all 8 test files - Add tests: shuffled records, failure-ending pattern, explicit now, CI disabled + LLM only, unknown trust level, feedback-to-evaluation end-to-end pipeline - Fix tests for sync record_feedback and new config validators

github-actions · 2026-04-02T21:00:11Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 221af46.

Ensure that dependencies are being submitted on PR branches. Re-running this action after a short time may resolve the issue. See the documentation for more information and troubleshooting advice.

Scanned Files

None

coderabbitai · 2026-04-02T21:00:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bffe7901-ebbd-42c7-9a72-b22e103f7ae8

📥 Commits

Reviewing files that changed from the base of the PR and between fcff01f and 221af46.

📒 Files selected for processing (1)

tests/unit/hr/evaluation/test_models.py

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Sandbox
GitHub Check: Build Backend
GitHub Check: Build Web
GitHub Check: Dependency Review
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

tests/unit/hr/evaluation/test_models.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

tests/unit/hr/evaluation/test_models.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

🔇 Additional comments (6)

tests/unit/hr/evaluation/test_models.py (6)

21-21: Unit test marker is correctly applied.

pytestmark = pytest.mark.unit is present at module scope and aligns with test-marker requirements.

26-74: redistribute_weights coverage is strong and well-balanced.

This block checks happy paths, redistribution math, empty/all-disabled failures, and sum-to-one invariants.

79-199: InteractionFeedback tests cover key validation and immutability paths well.

Good coverage for bounds, max length, required-content rule, ID generation, and frozen-model behavior.

204-271: ResilienceMetrics invariant testing is thorough.

The cases validate cross-field constraints and frozen behavior clearly.

327-338: Agent/snapshot consistency validation is correctly exercised.

The mismatch test directly targets an important model boundary condition.

344-481: EvaluationReport tests provide solid guardrail coverage.

Duplicate detection, bounds checks, ID generation, frozen enforcement, and weights/scores alignment are all covered.

Walkthrough

Adds a five‑pillar HR evaluation framework under src/synthorg/hr/evaluation/: new Pydantic configs (EvaluationConfig and per‑pillar configs), domain models (PillarScore, EvaluationContext, EvaluationReport, InteractionFeedback, ResilienceMetrics), constants and enums, a PillarScoringStrategy protocol, weight‑redistribution helpers, and observability event constants. Implements EvaluationService (orchestrator with inline efficiency scoring) and four pillar strategies (intelligence, resilience, governance, experience). Adds extensive unit tests and fixtures, test coverage updates, and corresponding documentation and design/decision entries (including D24 and CLAUDE.md edits).

Suggested labels

autorelease: tagged

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: five-pillar evaluation framework for HR performance tracking' clearly and directly summarizes the main change: implementing a five-pillar evaluation framework, which aligns with all file additions and documentation updates.
Description check	✅ Passed	The description comprehensively documents the five-pillar framework implementation, pillar strategies, design decisions, files added, and test coverage, all directly related to the changeset.
Linked Issues check	✅ Passed	The PR fulfills all coding objectives from issue `#699`: implements all five pillars with pluggable strategies [`#699`], maps framework to HR tracking, designs structured UX measurement with InteractionFeedback, and uses pull-based EvaluationService with human-calibrated LLM integration via LlmCalibrationSampler.
Out of Scope Changes check	✅ Passed	All changes are in-scope: 10 evaluation framework modules, strategies, configs, models, tests, observability events, and documentation updates directly support the five-pillar framework objective from issue `#699`. No unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 49.23% which is sufficient. The required threshold is 40.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a comprehensive "Five-Pillar Evaluation Framework" for tracking agent performance within the HR module. The framework assesses agents across Intelligence, Efficiency, Resilience, Governance, and User Experience using pluggable scoring strategies and a centralized EvaluationService. Key features include configurable metric toggles with automatic weight redistribution, structured logging for evaluation events, and detailed documentation of the design and architectural decisions. Feedback was provided regarding the QualityBlendIntelligenceStrategy to ensure that weight redistribution logic remains consistent and robust when calibration data is unavailable.

gemini-code-assist · 2026-04-02T21:02:18Z

src/synthorg/hr/evaluation/intelligence_strategy.py

+        # Build enabled metrics list.
+        metrics: list[tuple[str, float, bool]] = []
+        if cfg.ci_quality_enabled:
+            metrics.append(("ci_quality", cfg.ci_quality_weight, True))
+        if cfg.llm_calibration_enabled:
+            metrics.append(("llm_calibration", cfg.llm_calibration_weight, True))
+
+        if not metrics:
+            return PillarScore(
+                pillar=self.pillar,
+                score=_NEUTRAL_SCORE,
+                confidence=0.0,
+                strategy_name=NotBlankStr(self.name),
+                data_point_count=0,
+                evaluated_at=context.now,
+            )
+
+        weights = redistribute_weights(metrics)
+
+        # Compute CI quality component.
+        breakdown: list[tuple[str, float]] = []
+        weighted_sum = 0.0
+        data_points = len(context.task_records)
+
+        if "ci_quality" in weights:
+            breakdown.append(("ci_quality", round(ci_score, 4)))
+            weighted_sum += ci_score * weights["ci_quality"]
+
+        # Compute LLM calibration component.
+        calibration_drift = 0.0
+        if "llm_calibration" in weights:
+            records = context.calibration_records
+            if records:
+                avg_llm = sum(r.llm_score for r in records) / len(records)
+                breakdown.append(("llm_calibration", round(avg_llm, 4)))
+                weighted_sum += avg_llm * weights["llm_calibration"]
+                calibration_drift = sum(r.drift for r in records) / len(records)
+                data_points += len(records)
+            else:
+                logger.debug(
+                    EVAL_METRIC_SKIPPED,
+                    agent_id=context.agent_id,
+                    pillar=self.pillar.value,
+                    metric="llm_calibration",
+                    reason="no_calibration_records",
+                )
+                # Redistribute to CI quality only.
+                weighted_sum = ci_score
+                breakdown = [("ci_quality", round(ci_score, 4))]


The logic for handling missing calibration records appears to be incorrect. When llm_calibration is enabled but no data is available, the code at line 124 (weighted_sum = ci_score) overwrites the previously calculated weighted ci_score. This results in the final score being the raw ci_score, rather than a correctly weighted score where ci_quality receives 100% of the weight.

This can be fixed by refactoring to follow the pattern used in other strategies (e.g., FeedbackBasedUxStrategy): first, determine which metrics have available data, then redistribute weights among only those metrics, and finally compute the weighted sum. This makes the logic more robust and consistent across strategies.

# Build a list of metrics that are enabled and have data. available: list[tuple[str, float, float]] = [] # (name, weight, score) data_points = 0 calibration_drift = 0.0 if cfg.ci_quality_enabled: available.append(("ci_quality", cfg.ci_quality_weight, ci_score)) data_points += len(context.task_records) if cfg.llm_calibration_enabled: records = context.calibration_records if records: avg_llm = sum(r.llm_score for r in records) / len(records) available.append(("llm_calibration", cfg.llm_calibration_weight, avg_llm)) calibration_drift = sum(r.drift for r in records) / len(records) data_points += len(records) else: logger.debug( EVAL_METRIC_SKIPPED, agent_id=context.agent_id, pillar=self.pillar.value, metric="llm_calibration", reason="no_calibration_records", ) if not available: # This case is already handled by the initial ci_score check, # but it's a good safeguard. return PillarScore( pillar=self.pillar, score=_NEUTRAL_SCORE, confidence=0.0, strategy_name=NotBlankStr(self.name), data_point_count=0, evaluated_at=context.now, ) # Redistribute weights among metrics with data. weights = redistribute_weights([(name, w, True) for name, w, _ in available]) scores = {name: s for name, _, s in available} weighted_sum = sum(scores[k] * weights[k] for k in weights) breakdown = list(sorted(scores.items()))

Copilot

Pull request overview

Adds a new five-pillar evaluation subsystem under hr/ to compute on-demand, configurable evaluation reports (intelligence, efficiency, resilience, governance, UX) and integrates it with structured observability events, tests, and design docs.

Changes:

Introduces EvaluationService orchestrator plus pluggable pillar strategies and frozen Pydantic models/configs under src/synthorg/hr/evaluation/.
Adds evaluation-specific structured logging event constants and updates event-module discovery tests.
Documents the new D24 decision and five-pillar framework in the design/spec docs; adds comprehensive unit tests.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/observability/test_events.py	Adds `evaluation` to expected observability event domain modules.
tests/unit/hr/evaluation/test_resilience_strategy.py	Unit tests for resilience strategy behavior and toggles.
tests/unit/hr/evaluation/test_models.py	Unit tests for evaluation models + `redistribute_weights`.
tests/unit/hr/evaluation/test_intelligence_strategy.py	Unit tests for intelligence strategy blending + drift behavior.
tests/unit/hr/evaluation/test_governance_strategy.py	Unit tests for governance strategy scoring + toggles.
tests/unit/hr/evaluation/test_experience_strategy.py	Unit tests for UX feedback-based scoring + redistribution.
tests/unit/hr/evaluation/test_evaluator.py	Unit tests for `EvaluationService` orchestration and pipelines.
tests/unit/hr/evaluation/test_enums.py	Unit tests for `EvaluationPillar` enum.
tests/unit/hr/evaluation/test_config.py	Unit tests for per-pillar configs and validation rules.
tests/unit/hr/evaluation/conftest.py	Shared fixtures/builders for evaluation tests.
tests/unit/hr/evaluation/init.py	Test package marker for evaluation tests.
src/synthorg/observability/events/evaluation.py	New structured logging event constants for evaluation domain.
src/synthorg/hr/evaluation/resilience_strategy.py	`TaskBasedResilienceStrategy` implementation.
src/synthorg/hr/evaluation/pillar_protocol.py	`PillarScoringStrategy` protocol for pluggable pillars.
src/synthorg/hr/evaluation/models.py	Frozen Pydantic models for context, scores, reports, feedback, metrics.
src/synthorg/hr/evaluation/intelligence_strategy.py	`QualityBlendIntelligenceStrategy` implementation.
src/synthorg/hr/evaluation/governance_strategy.py	`AuditBasedGovernanceStrategy` implementation.
src/synthorg/hr/evaluation/experience_strategy.py	`FeedbackBasedUxStrategy` implementation.
src/synthorg/hr/evaluation/evaluator.py	`EvaluationService` orchestrator + inline efficiency scoring and resilience derivations.
src/synthorg/hr/evaluation/enums.py	`EvaluationPillar` enum (five pillars).
src/synthorg/hr/evaluation/config.py	`EvaluationConfig` and per-pillar sub-configs with toggles/weights.
src/synthorg/hr/evaluation/init.py	Package docstring for the evaluation framework.
docs/design/index.md	Updates design index summary to include evaluation under Agents & HR.
docs/design/agents.md	Adds the five-pillar evaluation framework section + D24 note.
docs/DESIGN_SPEC.md	Updates design spec index to include evaluation in Agents & HR description.
docs/architecture/decisions.md	Adds decision D24 entry describing evaluation framework design choices.
CLAUDE.md	Updates package structure and logging examples to include evaluation domain/events.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T21:03:00Z

src/synthorg/hr/evaluation/intelligence_strategy.py

+        if "llm_calibration" in weights:
+            records = context.calibration_records
+            if records:
+                avg_llm = sum(r.llm_score for r in records) / len(records)
+                breakdown.append(("llm_calibration", round(avg_llm, 4)))
+                weighted_sum += avg_llm * weights["llm_calibration"]
+                calibration_drift = sum(r.drift for r in records) / len(records)
+                data_points += len(records)
+            else:
+                logger.debug(
+                    EVAL_METRIC_SKIPPED,
+                    agent_id=context.agent_id,
+                    pillar=self.pillar.value,
+                    metric="llm_calibration",
+                    reason="no_calibration_records",
+                )
+                # Redistribute to CI quality only.
+                weighted_sum = ci_score
+                breakdown = [("ci_quality", round(ci_score, 4))]


When llm_calibration is enabled but there are no calibration records, the fallback unconditionally sets weighted_sum = ci_score and breakdown = [("ci_quality", ...)]. This breaks the metric toggles: if ci_quality_enabled is false (LLM-only mode), this code still uses CI quality and emits a ci_quality breakdown. Suggestion: only fall back to CI quality if ci_quality is actually enabled; otherwise treat this as insufficient data (neutral score + 0 confidence) or skip the LLM metric and return neutral/insufficient-data for the pillar.

Copilot · 2026-04-02T21:03:01Z

src/synthorg/hr/evaluation/intelligence_strategy.py

+
+Blends existing CI (continuous integration) signal quality score with
+LLM calibration data. When LLM calibration is disabled or unavailable,
+falls back to CI quality alone with reduced confidence.


Module docstring says the CI-only fallback happens “with reduced confidence”, but the implementation computes confidence solely from data_points and does not reduce it when LLM calibration is disabled/unavailable. Either update the docstring to match the behavior, or explicitly down-weight confidence when the LLM component is disabled or skipped due to missing calibration records.

Suggested change

falls back to CI quality alone with reduced confidence.

falls back to CI quality alone.

Copilot · 2026-04-02T21:03:01Z

src/synthorg/hr/evaluation/governance_strategy.py

+            base_trust = _TRUST_LEVEL_SCORES.get(trust_key, _NEUTRAL_SCORE)
+            if trust_key not in _TRUST_LEVEL_SCORES:
+                logger.warning(
+                    EVAL_PILLAR_SCORED,
+                    agent_id=context.agent_id,
+                    pillar=self.pillar.value,
+                    warning="unknown_trust_level",
+                    trust_level=trust_key,
+                    fallback_score=_NEUTRAL_SCORE,
+                )


The warning for an unknown trust_level logs with EVAL_PILLAR_SCORED ("eval.pillar.scored"), which makes it hard to distinguish normal scoring events from exceptional/diagnostic conditions in log queries and metrics. Suggest introducing a dedicated event constant for this condition (e.g., eval.governance.unknown_trust_level) or reusing an existing “skipped/insufficient” event if appropriate, while keeping eval.pillar.scored for the successful final score debug log.

codecov · 2026-04-02T21:03:05Z

Codecov Report

❌ Patch coverage is 96.67195% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.77%. Comparing base (bb5c9a4) to head (221af46).
⚠️ Report is 9 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/synthorg/hr/evaluation/pillar_protocol.py	0.00%	10 Missing ⚠️
src/synthorg/hr/evaluation/governance_strategy.py	95.52%	1 Missing and 2 partials ⚠️
src/synthorg/hr/evaluation/models.py	97.61%	2 Missing and 1 partial ⚠️
src/synthorg/hr/evaluation/resilience_strategy.py	94.91%	1 Missing and 2 partials ⚠️
src/synthorg/hr/evaluation/evaluator.py	98.71%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1017      +/-   ##
==========================================
+ Coverage   91.69%   91.77%   +0.08%     
==========================================
  Files         658      669      +11     
  Lines       36108    36739     +631     
  Branches     3568     3625      +57     
==========================================
+ Hits        33109    33719     +610     
- Misses       2374     2389      +15     
- Partials      625      631       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 188-252: The _resolve_enabled_pillars method is long due to the
large inline pillar_map; extract the pillar configuration to a separate helper
or constant to reduce method length. Create a new function or module-level
constant (e.g., _pillar_config or _build_pillar_map) that returns the list of
tuples currently assigned to pillar_map (using EvaluationPillar entries and
wiring in self._intelligence, self._resilience, self._governance, self._ux where
needed), then update _resolve_enabled_pillars to call that helper, keep the same
logic around enabled collection, redistribute_weights, and returns, and ensure
references to pillar_map, redistribute_weights, EvaluationPillar, and the
strategy attributes (_intelligence, _resilience, _governance, _ux) match the
existing names so behavior is unchanged.

In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Line 145: The confidence formula in evaluate_experience (or the surrounding
function in src/synthorg/hr/evaluation/experience_strategy.py) uses a magic
multiplier `3`; extract this into a module-level constant named
_FULL_CONFIDENCE_FEEDBACK_MULTIPLIER and replace the literal with that constant
in the line computing confidence (confidence = min(1.0, len(feedback) /
(cfg.min_feedback_count * _FULL_CONFIDENCE_FEEDBACK_MULTIPLIER))). Add the
constant near other strategy constants (e.g., alongside
_FULL_CONFIDENCE_DATA_POINTS) and update any imports or references accordingly.

In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 110-118: Replace the misleading EVAL_PILLAR_SCORED event used when
logging an unknown trust level in governance_strategy.py: add a new event
constant (e.g., EVAL_METRIC_FALLBACK or EVAL_UNKNOWN_TRUST_LEVEL) to
synthorg/observability/events/evaluation.py and update the warning call in the
method that contains the trust_key check (the block using logger.warning with
agent_id=context.agent_id and pillar=self.pillar.value) to use that new
constant; alternatively, if you prefer not to add a constant, change the
logger.warning call to a generic structured warning event name (e.g.,
"eval_metric_fallback") so the log semantically matches the fallback case.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 115-125: The fallback branch for when llm_calibration is enabled
but has no records currently overwrites the previously computed weighted_sum
(and discards the redistributed weight logic); instead, update the fallback to
build the final score from the already-determined components: keep the
redistributed weight applied to the CI component, adjust the breakdown to
reflect only ("ci_quality", round(ci_score,4)) and then compute weighted_sum
once from those components (or recompute weighted_sum from the redistribution
logic) rather than assigning weighted_sum = ci_score; reference llm_calibration,
EVAL_METRIC_SKIPPED, weighted_sum, breakdown, and ci_score to locate and change
the assignment so the final score computation happens after all components are
finalized.

In `@src/synthorg/hr/evaluation/resilience_strategy.py`:
- Around line 48-156: The score method in resilience_strategy.py is too large;
split it into small helper functions to meet the <50-line rule by extracting the
logical blocks: (1) input/early-return checks into a helper
validate_and_handle_insufficient_data(context) that returns an optional
PillarScore, (2) metric derivation into
build_enabled_metrics_and_scores(context, rm, cfg) which returns enabled_metrics
and scores, (3) weighting and aggregation into compute_final_score(scores,
enabled_metrics) that calls redistribute_weights, and (4) result assembly into
assemble_pillar_score(context, final_score, scores, rm) which builds the
PillarScore and logs; keep the public async score(...) as a thin orchestrator
that calls these helpers (preserve names used: score, redistribute_weights,
PillarScore, EvaluationContext, EVAL_PILLAR_INSUFFICIENT_DATA,
EVAL_PILLAR_SCORED) so callers and tests remain valid.
- Around line 120-128: Before returning the neutral PillarScore when
enabled_metrics is empty, emit an INFO-level observability event describing the
state transition; add a call to the available logger (preferably
context.logger.info(...), falling back to module logger.info if no context
logger) immediately before the existing return in the branch that checks
enabled_metrics and include key fields (self.pillar, NotBlankStr(self.name) or
self.name, rm.total_tasks, and context.now) so the neutral outcome is traceable;
leave the returned PillarScore construction (PillarScore(...)) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7beee32-35a7-46a6-9757-79d9a3709504

📥 Commits

Reviewing files that changed from the base of the PR and between bb5c9a4 and 10f0772.

📒 Files selected for processing (27)

CLAUDE.md
docs/DESIGN_SPEC.md
docs/architecture/decisions.md
docs/design/agents.md
docs/design/index.md
src/synthorg/hr/evaluation/__init__.py
src/synthorg/hr/evaluation/config.py
src/synthorg/hr/evaluation/enums.py
src/synthorg/hr/evaluation/evaluator.py
src/synthorg/hr/evaluation/experience_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/pillar_protocol.py
src/synthorg/hr/evaluation/resilience_strategy.py
src/synthorg/observability/events/evaluation.py
tests/unit/hr/evaluation/__init__.py
tests/unit/hr/evaluation/conftest.py
tests/unit/hr/evaluation/test_config.py
tests/unit/hr/evaluation/test_enums.py
tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_experience_strategy.py
tests/unit/hr/evaluation/test_governance_strategy.py
tests/unit/hr/evaluation/test_intelligence_strategy.py
tests/unit/hr/evaluation/test_models.py
tests/unit/hr/evaluation/test_resilience_strategy.py
tests/unit/observability/test_events.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build Backend
GitHub Check: Test (Python 3.14)

🧰 Additional context used

📓 Path-based instructions (4)

docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/

Files:

docs/design/index.md
docs/DESIGN_SPEC.md
docs/design/agents.md
docs/architecture/decisions.md

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

tests/unit/observability/test_events.py
src/synthorg/hr/evaluation/__init__.py
tests/unit/hr/evaluation/test_enums.py
src/synthorg/hr/evaluation/enums.py
src/synthorg/hr/evaluation/pillar_protocol.py
tests/unit/hr/evaluation/test_config.py
tests/unit/hr/evaluation/test_intelligence_strategy.py
src/synthorg/observability/events/evaluation.py
tests/unit/hr/evaluation/test_governance_strategy.py
tests/unit/hr/evaluation/test_evaluator.py
src/synthorg/hr/evaluation/intelligence_strategy.py
tests/unit/hr/evaluation/test_experience_strategy.py
tests/unit/hr/evaluation/test_resilience_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/experience_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/resilience_strategy.py
tests/unit/hr/evaluation/conftest.py
src/synthorg/hr/evaluation/evaluator.py
tests/unit/hr/evaluation/test_models.py
src/synthorg/hr/evaluation/config.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

tests/unit/observability/test_events.py
tests/unit/hr/evaluation/test_enums.py
tests/unit/hr/evaluation/test_config.py
tests/unit/hr/evaluation/test_intelligence_strategy.py
tests/unit/hr/evaluation/test_governance_strategy.py
tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_experience_strategy.py
tests/unit/hr/evaluation/test_resilience_strategy.py
tests/unit/hr/evaluation/conftest.py
tests/unit/hr/evaluation/test_models.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

src/synthorg/hr/evaluation/__init__.py
src/synthorg/hr/evaluation/enums.py
src/synthorg/hr/evaluation/pillar_protocol.py
src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/experience_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/resilience_strategy.py
src/synthorg/hr/evaluation/evaluator.py
src/synthorg/hr/evaluation/config.py

🧠 Learnings (50)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
docs/design/agents.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
docs/design/agents.md
docs/architecture/decisions.md

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
CLAUDE.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
docs/design/agents.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
docs/design/agents.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
tests/unit/observability/test_events.py
src/synthorg/hr/evaluation/__init__.py
CLAUDE.md
src/synthorg/hr/evaluation/enums.py
src/synthorg/hr/evaluation/pillar_protocol.py
tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/evaluator.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
src/synthorg/hr/evaluation/__init__.py
CLAUDE.md
src/synthorg/hr/evaluation/enums.py
src/synthorg/hr/evaluation/pillar_protocol.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/evaluator.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

docs/design/index.md
docs/DESIGN_SPEC.md
CLAUDE.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/DESIGN_SPEC.md
CLAUDE.md
src/synthorg/hr/evaluation/pillar_protocol.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/engine/**/*.py : Engine package (engine/): agent orchestration, parallel execution, task decomposition, routing, TaskEngine (centralized single-writer), task lifecycle/recovery/shutdown, workspace isolation, coordination (4 dispatchers: SAS/centralized/decentralized/context-dependent, wave execution), approval gates (escalation detection, context parking/resume), stagnation detection (ToolRepetitionDetector, corrective prompt injection), AgentRuntimeState (execution status), context budget management, conversation compaction (oldest-turns summarizer)

Applied to files:

docs/DESIGN_SPEC.md
CLAUDE.md
src/synthorg/hr/evaluation/evaluator.py

📚 Learning: 2026-04-02T18:54:07.757Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

tests/unit/observability/test_events.py

📚 Learning: 2026-03-18T21:23:23.586Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

tests/unit/observability/test_events.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.) — import directly

Applied to files:

tests/unit/observability/test_events.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

CLAUDE.md
docs/design/agents.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

CLAUDE.md
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Settings: Runtime-editable settings persistence (DB > env > YAML > code defaults), typed definitions (9 namespaces), Fernet encryption for sensitive values, config bridge, ConfigResolver (typed composed reads for controllers), validation, registry, change notifications via message bus. Per-namespace setting definitions in definitions/ submodule (api, company, providers, memory, budget, security, coordination, observability, backup).

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger` (not `_logger`, not `log`).

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-04-02T18:54:07.757Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Every Python module with business logic must have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-04-02T18:54:07.757Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/hr/evaluation/test_resilience_strategy.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to tests/**/*.py : Test markers: pytest.mark.unit, pytest.mark.integration, pytest.mark.e2e, pytest.mark.slow. Coverage: 80% minimum (enforced in CI).

Applied to files:

tests/unit/hr/evaluation/test_resilience_strategy.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to tests/**/*.py : Test markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Coverage: 80% minimum. Async: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Timeout: 30 seconds per test. Parallelism: `pytest-xdist` via `-n auto` — ALWAYS include `-n auto` when running pytest, never run tests sequentially.

Applied to files:

tests/unit/hr/evaluation/test_resilience_strategy.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.

Applied to files:

src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

📚 Learning: 2026-03-15T18:42:17.990Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`

Applied to files:

src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/config.py

🔇 Additional comments (38)

tests/unit/observability/test_events.py (1)

200-200: LGTM!

The addition of "evaluation" to the expected domain modules correctly reflects the new evaluation events domain introduced in this PR.

docs/design/index.md (1)

161-161: LGTM!

The documentation update accurately reflects the new evaluation framework scope in the HR package.

docs/DESIGN_SPEC.md (1)

14-14: LGTM!

The documentation update is consistent with the corresponding change in docs/design/index.md and accurately reflects the new evaluation framework.

CLAUDE.md (1)

147-147: LGTM!

The addition of EVAL_REPORT_COMPUTED from events.evaluation correctly documents the new evaluation observability domain and follows the established event constant pattern.

src/synthorg/hr/evaluation/__init__.py (1)

1-8: LGTM!

The module docstring clearly describes the five-pillar evaluation framework and its configuration-driven nature. As a pure docstring module, no logging is needed per coding guidelines.

docs/architecture/decisions.md (1)

82-82: LGTM!

The D24 decision entry thoroughly documents the five-pillar evaluation design, including the pluggable protocol pattern, context bag approach, and configuration-driven enablement. The decision aligns with the framework's protocol-driven architecture philosophy.

tests/unit/hr/evaluation/test_enums.py (1)

1-41: LGTM!

Comprehensive test coverage for the EvaluationPillar enum. The tests verify member count, values, StrEnum behavior, value-based lookup, and invalid value handling. Good use of @pytest.mark.parametrize for testing all members.

src/synthorg/hr/evaluation/enums.py (1)

1-17: LGTM!

Clean and well-documented enum definition for the five evaluation pillars. As a pure data model, no logging is needed per coding guidelines. The string values follow a clear convention and align with the InfoQ five-pillar framework.

src/synthorg/hr/evaluation/pillar_protocol.py (1)

16-43: Protocol contract is clean and implementation-ready.

Typed async interface and explicit pillar/name properties are clear and consistent for strategy injection.

src/synthorg/observability/events/evaluation.py (1)

9-16: Event constant set looks consistent and complete for the evaluation domain.

tests/unit/hr/evaluation/test_experience_strategy.py (1)

28-161: Coverage is strong for UX scoring behavior and neutral-path handling.

docs/design/agents.md (1)

411-455: The new five-pillar design section is clear and well-aligned with the implemented architecture.

tests/unit/hr/evaluation/test_intelligence_strategy.py (1)

31-161: Intelligence strategy tests exercise the critical scoring branches and drift-confidence behavior well.

tests/unit/hr/evaluation/test_config.py (1)

18-236: Config model test coverage is comprehensive and validates key invariants effectively.

tests/unit/hr/evaluation/test_models.py (1)

24-409: Model and utility tests are thorough, especially around validation boundaries and frozen behavior.

src/synthorg/hr/evaluation/intelligence_strategy.py (1)

1-163: Well-structured strategy implementation.

The QualityBlendIntelligenceStrategy correctly implements the PillarScoringStrategy protocol with proper logging, event emission, and configuration-driven behavior. The neutral score fallback for missing data and confidence reduction for calibration drift are well-considered design choices.

src/synthorg/hr/evaluation/experience_strategy.py (1)

41-164: Clean UX scoring implementation.

The strategy correctly handles partial feedback (None ratings), metric toggles, and weight redistribution. The early return for insufficient feedback with appropriate logging is good defensive design.

tests/unit/hr/evaluation/test_resilience_strategy.py (1)

1-140: Comprehensive resilience strategy test coverage.

Tests cover protocol properties, neutral scoring fallbacks, metric enable/disable behavior, edge cases (zero tasks, all failures), and score range expectations. The use of factory functions from conftest promotes maintainability.

tests/unit/hr/evaluation/test_governance_strategy.py (1)

1-188: Thorough governance strategy test suite.

Tests cover all key scenarios: neutral fallback, score ranges for different verdict distributions, metric toggles, penalty behaviors, and the unknown trust level fallback. The comparative assertions (lines 144-146, 168-170) effectively validate penalty mechanics.

src/synthorg/hr/evaluation/governance_strategy.py (1)

39-176: Solid governance strategy implementation.

The strategy correctly handles the three governance metrics (audit compliance, trust level, autonomy compliance) with proper fallbacks for missing data and configuration-driven behavior. The trust level mapping with unknown-level fallback is well-designed.

tests/unit/hr/evaluation/test_evaluator.py (1)

1-370: Comprehensive evaluator test coverage.

The test suite covers orchestration (pillar enablement, weight redistribution), individual metric computation (efficiency, resilience), feedback lifecycle, and end-to-end evaluation flow. The TestComputeResilienceMetrics class thoroughly validates streak tracking, recovery detection, and quality stddev computation.

src/synthorg/hr/evaluation/config.py (1)

1-280: Well-designed evaluation configuration schema.

All pillar configs consistently enforce at least one metric enabled when the pillar is active. The use of frozen=True and allow_inf_nan=False aligns with coding guidelines. Default weights within each pillar sum to 1.0, ensuring proper normalization before redistribution.
src/synthorg/hr/evaluation/models.py (4)
33-62: Clean weight redistribution utility.

The redistribute_weights function correctly handles the edge cases: raises when all items are disabled, and uses equal distribution when all enabled items have zero weight. The implementation is concise and well-documented.

141-201: Robust resilience metrics validation.

The ResilienceMetrics model includes comprehensive cross-field validation ensuring logical consistency: failed_tasks <= total_tasks, recovered_tasks <= failed_tasks, and longest_success_streak >= current_success_streak. This prevents invalid states from propagating through the evaluation pipeline.

331-389: Well-structured evaluation report model.

The EvaluationReport correctly enforces unique pillar scores via the validator and provides a complete structure for reporting evaluation results. The use of uuid4 for default IDs ensures uniqueness across evaluations.

12-14: No action required—TYPE_CHECKING import pattern is correct for Python 3.14.

Sequence belongs in the TYPE_CHECKING block because Python 3.14's PEP 649 provides native lazy annotations. Annotations are not evaluated at runtime, so conditional imports are safe and recommended. The current code is correct and requires no changes.
			> Likely an incorrect or invalid review comment.
tests/unit/hr/evaluation/conftest.py (1)

1-124: Well-structured test fixture module with consistent defaults.

The factory helpers correctly satisfy the model validators (e.g., ResilienceMetrics constraints in snippet 2, EvaluationContext agent_id consistency in snippet 1). Defaults are sensible and the NotBlankStr wrapping is applied appropriately for identifier fields.
src/synthorg/hr/evaluation/evaluator.py (11)
1-52: Clean module setup with well-named constants.

Good use of TYPE_CHECKING for forward references and constants for magic numbers.

55-121: Good dependency injection pattern with lazy default strategy loading.

The lazy imports in _default_* methods avoid circular import issues while still providing sensible defaults.

123-155: Clean orchestration with clear step decomposition.

The evaluate method follows a clear flow: build context → resolve pillars → score → assemble report.

157-186: Efficient context building with centralized data gathering.

254-272: Proper use of asyncio.TaskGroup for concurrent pillar scoring.

Per coding guidelines, TaskGroup is the preferred pattern for fan-out/fan-in parallel operations.

274-314: Correct weighted aggregation with proper clamping and logging.

The INFO-level log for report computation follows the guideline for state transitions.

316-360: Clean feedback storage with immutable query results.

The get_feedback method returns a tuple to ensure immutability of query results.

489-522: Defensive constraint enforcement with min(recovered, failed).

Line 518's min(recovered, failed) ensures the recovered_tasks <= failed_tasks constraint is always satisfied, matching the validator in ResilienceMetrics.

525-568: Correct streak and standard deviation computations.

The streak logic properly tracks recoveries (success following failure), and the standard deviation uses population variance (dividing by n), which is appropriate for evaluating all observed data points rather than estimating from a sample.

406-431: No division by zero risk; this concern is unfounded.

EfficiencyConfig validates all reference fields (reference_cost_usd, reference_time_seconds, reference_tokens) with Pydantic's gt=0 constraint, which rejects zero and negative values at validation time. The unit tests confirm this validation is enforced. Division operations at lines 410, 420, and 428 are safe.
			> Likely an incorrect or invalid review comment.
373-374: No issue here. NotBlankStr is Annotated[str, ...], so plain string literals will correctly match NotBlankStr keys in dict lookups.

src/synthorg/hr/evaluation/evaluator.py

src/synthorg/hr/evaluation/experience_strategy.py

src/synthorg/hr/evaluation/governance_strategy.py

src/synthorg/hr/evaluation/intelligence_strategy.py

src/synthorg/hr/evaluation/resilience_strategy.py

…nd CodeRabbit - Fix intelligence strategy fallback bug: CI-disabled + no calibration records now returns neutral instead of using disabled CI quality score - Fix wrong event constant EVAL_PILLAR_SCORED for unknown trust level warning in governance strategy; add EVAL_TRUST_LEVEL_UNKNOWN constant - Move Sequence out of TYPE_CHECKING in models.py for PEP 649 safety - Extract shared scoring constants to evaluation/constants.py, replace duplicated _MAX_SCORE/_NEUTRAL_SCORE/_FULL_CONFIDENCE_DATA_POINTS across 5 modules - Decompose all strategy score() methods into <50-line helpers: _collect_metrics, _build_result, _neutral, _compute_confidence - Extract _get_pillar_configs from _resolve_enabled_pillars in evaluator - Add EvaluationReport validators: agent_id/snapshot consistency, pillar_weights/pillar_scores correspondence - Extract magic number 3 in UX confidence to named constant - Add logging for silent neutral returns in governance/resilience/ experience strategy no-enabled-metrics paths - Fix docstrings: intelligence strategy LLM calibration origin note, efficiency docstring second neutral path, evaluator Args pillar names, resilience metrics capping behavior, add __init__ docstring - Fix agents.md frontmatter description to include 'evaluation' - Add 10 tests: CI-disabled+no-calibration, EvaluationContext agent_id mismatch, EvaluationReport agent_id/weights validators, efficiency 7d fallback/neutral/clamping, parametrized all-metrics-disabled for all 5 configs

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 75-91: In __init__, validate any non-None injected strategy
(intelligence_strategy, resilience_strategy, governance_strategy, ux_strategy)
by checking its strategy.pillar equals the expected EvaluationPillar for that
slot (e.g., intelligence -> EvaluationPillar.INTELLIGENCE, resilience ->
RESILIENCE, governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a
ValueError with a clear message naming the slot and actual strategy.pillar so
the failure occurs at construction time; keep using the existing _default_*()
for None inputs but still assert their .pillar if you want extra safety.

In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Around line 84-103: The sufficiency check and downstream
confidence/data_point_count must count only feedback entries that contributed at
least one enabled metric; change the flow so you first identify/filter
contributing entries (e.g., compute contributing_feedback = [f for f in feedback
if it has at least one enabled metric according to cfg] or update
_collect_metrics to return both available metrics and the per-feedback
contribution set), then use len(contributing_feedback) instead of len(feedback)
when comparing to cfg.min_feedback_count and when computing
data_point_count/confidence; finally pass the filtered contributing_feedback (or
use the contributed-count returned by _collect_metrics) into _build_result and
call _neutral when contributing count < cfg.min_feedback_count (using the same
reason keys), keeping calls to _neutral and symbols _collect_metrics,
_build_result, _neutral, and cfg.min_feedback_count consistent.

In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
legitimate TrustLevel.CUSTOM value, causing agents with "custom" to be treated
as unknown (EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update
the logic to explicitly handle "custom" by either adding a "custom" key to
_TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a
score from that policy before falling back, by updating the code paths that
reference _TRUST_LEVEL_SCORES and the evaluator that emits
EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the
custom-policy resolution routine to derive the numeric score).
- Around line 75-89: Remove the early neutral-return that blocks scoring when
total_audits == 0 and context.trust_level is None; instead call
self._collect_metrics(context, total_audits) unconditionally so that the
collector can evaluate enabled metrics (including autonomy_compliance) and
decide if there is data. After calling _collect_metrics use its returned
enabled/data_points to decide whether to return self._neutral(...) or to call
self._build_result(scores, enabled, data_points, context). Keep references to
the same methods/variables: _collect_metrics, _neutral, _build_result,
total_audits, and context.trust_level (do not add new gating logic before
calling _collect_metrics).

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 64-67: The current logic returns neutral when
context.snapshot.overall_quality_score is None even if CI quality is disabled or
calibration data exists; update the flow in intelligence_strategy.py so
overall_quality_score is only treated as a CI data source when
ci_quality_enabled is true and only include CI-derived points in
_collect_metrics() when ci_quality_enabled is true (i.e., stop preloading
data_points from task_records unless ci_quality_enabled), change the
early-return that calls self._neutral(reason="no_quality_score") to check that
no enabled metric has usable data before returning neutral, and add a regression
test that sets ci_quality_enabled=False with overall_quality_score=None but with
calibration records present to ensure scoring proceeds using calibration only.

In `@src/synthorg/hr/evaluation/models.py`:
- Around line 372-413: The current _validate_weights_match_scores only compares
sets and misses duplicate pillar names and invalid floats; update validation for
the pillar_weights field (and/or _validate_weights_match_scores) to (1) detect
and reject duplicate pillar names in pillar_weights (collect seen names and
raise ValueError listing duplicates), (2) ensure each weight is a real number
within [0.0, 1.0] (reject negatives or >1), and (3) ensure the weights are
normalized (sum(weights) ≈ 1.0 within a small epsilon) and raise descriptive
ValueError messages if any check fails; keep these checks in the model_validator
decorated method(s) for EvaluationReport so invalid/ambiguous weighting schemes
cannot be constructed.
- Around line 262-328: Add an additional after-model validator (e.g. def
_validate_agent_scoped_records_consistency(self) -> Self) that iterates
task_records, calibration_records, and feedback and ensures each record.agent_id
equals self.agent_id; if any mismatch is found raise ValueError with a clear
message identifying the collection and offending record (index or repr). Keep
the existing _validate_agent_id_consistency but implement this new validator to
enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and
InteractionFeedback collections.

In `@tests/unit/hr/evaluation/test_evaluator.py`:
- Around line 167-210: Update the two tests to force the snapshot shapes so the
fallback and neutral branches in EvaluationService._score_efficiency() are
actually exercised: in test_efficiency_7d_window_fallback() monkeypatch
PerformanceTracker.get_snapshot (or the EvaluationService.get_snapshot helper)
to return a snapshot containing only the 7d window (no 30d data), call
svc.evaluate(agent_id) and assert the efficiency pillar's score and confidence
match the known 7d-fallback expected values; in
test_efficiency_no_window_returns_neutral() patch get_snapshot to return no
windows (empty snapshot), call svc.evaluate(agent_id) and assert the efficiency
pillar's score and confidence equal the neutral values returned by
_score_efficiency() for no-data cases. Ensure you reference
EvaluationService._score_efficiency, PerformanceTracker.get_snapshot (or the
concrete get_snapshot you use), and the test functions
test_efficiency_7d_window_fallback and test_efficiency_no_window_returns_neutral
when making the changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c3ec6139-1a8d-493e-bbf9-107516e8da5a

📥 Commits

Reviewing files that changed from the base of the PR and between 10f0772 and d5c53cc.

📒 Files selected for processing (14)

docs/design/agents.md
src/synthorg/hr/evaluation/constants.py
src/synthorg/hr/evaluation/evaluator.py
src/synthorg/hr/evaluation/experience_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/resilience_strategy.py
src/synthorg/observability/events/evaluation.py
tests/unit/hr/evaluation/conftest.py
tests/unit/hr/evaluation/test_config.py
tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_intelligence_strategy.py
tests/unit/hr/evaluation/test_models.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Backend
GitHub Check: Build Sandbox
GitHub Check: Build Web
GitHub Check: Dependency Review
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

src/synthorg/hr/evaluation/constants.py
src/synthorg/observability/events/evaluation.py
tests/unit/hr/evaluation/test_intelligence_strategy.py
tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/resilience_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/experience_strategy.py
tests/unit/hr/evaluation/test_models.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/evaluator.py
tests/unit/hr/evaluation/conftest.py
tests/unit/hr/evaluation/test_evaluator.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

src/synthorg/hr/evaluation/constants.py
src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/resilience_strategy.py
src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/experience_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/evaluator.py

docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/

Files:

docs/design/agents.md

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

tests/unit/hr/evaluation/test_intelligence_strategy.py
tests/unit/hr/evaluation/test_config.py
tests/unit/hr/evaluation/test_models.py
tests/unit/hr/evaluation/conftest.py
tests/unit/hr/evaluation/test_evaluator.py

🧠 Learnings (34)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

docs/design/agents.md
tests/unit/hr/evaluation/test_models.py
src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

docs/design/agents.md
src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

docs/design/agents.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-04-02T18:54:07.757Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-18T21:23:23.586Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

src/synthorg/observability/events/evaluation.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

src/synthorg/observability/events/evaluation.py
src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

tests/unit/hr/evaluation/test_config.py
src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

tests/unit/hr/evaluation/test_config.py

📚 Learning: 2026-04-02T18:54:07.757Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves

Applied to files:

tests/unit/hr/evaluation/test_config.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

tests/unit/hr/evaluation/test_config.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/evaluator.py

+    def __init__(  # noqa: PLR0913
+        self,
+        *,
+        tracker: PerformanceTracker,
+        intelligence_strategy: PillarScoringStrategy | None = None,
+        resilience_strategy: PillarScoringStrategy | None = None,
+        governance_strategy: PillarScoringStrategy | None = None,
+        ux_strategy: PillarScoringStrategy | None = None,
+        config: EvaluationConfig | None = None,
+    ) -> None:
+        """Initialize the evaluation service."""
+        self._tracker = tracker
+        self._config = config or EvaluationConfig()
+        self._intelligence = intelligence_strategy or self._default_intelligence()
+        self._resilience = resilience_strategy or self._default_resilience()
+        self._governance = governance_strategy or self._default_governance()
+        self._ux = ux_strategy or self._default_ux()


⚠️ Potential issue | 🟠 Major

Validate injected strategies against their pillar slots.

The service accepts pluggable strategies but stores them without checking strategy.pillar. A miswired dependency passed into the wrong constructor slot will fail much later during evaluate() with duplicate or mismatched pillar data instead of at construction time. Fail fast in __init__ by validating each injected strategy against the expected EvaluationPillar.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/evaluator.py` around lines 75 - 91, In __init__, validate any non-None injected strategy (intelligence_strategy, resilience_strategy, governance_strategy, ux_strategy) by checking its strategy.pillar equals the expected EvaluationPillar for that slot (e.g., intelligence -> EvaluationPillar.INTELLIGENCE, resilience -> RESILIENCE, governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a ValueError with a clear message naming the slot and actual strategy.pillar so the failure occurs at construction time; keep using the existing _default_*() for None inputs but still assert their .pillar if you want extra safety.

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/experience_strategy.py

+        cfg = context.config.experience
+        feedback = context.feedback
+
+        if len(feedback) < cfg.min_feedback_count:
+            return self._neutral(
+                context,
+                reason="insufficient_feedback",
+                count=len(feedback),
+                min_required=cfg.min_feedback_count,
+            )
+
+        available = self._collect_metrics(cfg, feedback)
+
+        if not available:
+            return self._neutral(
+                context,
+                reason="no_enabled_metrics_with_data",
+            )
+
+        return self._build_result(available, feedback, context)


⚠️ Potential issue | 🟠 Major

Count only contributing feedback toward UX sufficiency and confidence.

len(feedback) includes records where every enabled rating is None. With one real rating and many empty submissions, this path clears min_feedback_count, inflates data_point_count, and can push confidence close to 1.0 even though almost no UX signal was used. Base the sufficiency check, confidence, and data_point_count on feedback entries that contributed at least one enabled metric.

Also applies to: 155-167

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/experience_strategy.py` around lines 84 - 103, The sufficiency check and downstream confidence/data_point_count must count only feedback entries that contributed at least one enabled metric; change the flow so you first identify/filter contributing entries (e.g., compute contributing_feedback = [f for f in feedback if it has at least one enabled metric according to cfg] or update _collect_metrics to return both available metrics and the per-feedback contribution set), then use len(contributing_feedback) instead of len(feedback) when comparing to cfg.min_feedback_count and when computing data_point_count/confidence; finally pass the filtered contributing_feedback (or use the contributed-count returned by _collect_metrics) into _build_result and call _neutral when contributing count < cfg.min_feedback_count (using the same reason keys), keeping calls to _neutral and symbols _collect_metrics, _build_result, _neutral, and cfg.min_feedback_count consistent.

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/governance_strategy.py

+# Trust level to score mapping.
+_TRUST_LEVEL_SCORES: dict[str, float] = {
+    "sandboxed": 2.5,
+    "restricted": 5.0,
+    "standard": 7.5,
+    "elevated": 10.0,
+}


⚠️ Potential issue | 🟠 Major

Handle the valid custom trust level explicitly.

src/synthorg/core/enums.py defines TrustLevel.CUSTOM = "custom", but this table does not. Agents with that legitimate value will emit EVAL_TRUST_LEVEL_UNKNOWN and get the neutral fallback instead of a trust score. Add a dedicated custom path, or derive the score from the resolved custom trust policy instead of routing it through the unknown-level fallback.

Also applies to: 148-163

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The trust-score map _TRUST_LEVEL_SCORES currently omits the legitimate TrustLevel.CUSTOM value, causing agents with "custom" to be treated as unknown (EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update the logic to explicitly handle "custom" by either adding a "custom" key to _TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a score from that policy before falling back, by updating the code paths that reference _TRUST_LEVEL_SCORES and the evaluator that emits EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the custom-policy resolution routine to derive the numeric score).

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/governance_strategy.py

+        if total_audits == 0 and context.trust_level is None:
+            return self._neutral(context, reason="no_governance_data")
+
+        scores, enabled, data_points = self._collect_metrics(
+            context,
+            total_audits,
+        )
+
+        if not enabled:
+            return self._neutral(
+                context,
+                reason="no_enabled_metrics_with_data",
+            )
+
+        return self._build_result(scores, enabled, data_points, context)


⚠️ Potential issue | 🔴 Critical

Autonomy-only governance scoring is blocked by the early neutral return.

This precheck short-circuits before _collect_metrics() can score autonomy_compliance, so a configuration that enables only autonomy can never produce a real governance score. Let the collector decide whether any enabled metric has data instead of requiring audits or trust up front.

Suggested fix

- if total_audits == 0 and context.trust_level is None: - return self._neutral(context, reason="no_governance_data") - scores, enabled, data_points = self._collect_metrics( context, total_audits, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 75 - 89, Remove the early neutral-return that blocks scoring when total_audits == 0 and context.trust_level is None; instead call self._collect_metrics(context, total_audits) unconditionally so that the collector can evaluate enabled metrics (including autonomy_compliance) and decide if there is data. After calling _collect_metrics use its returned enabled/data_points to decide whether to return self._neutral(...) or to call self._build_result(scores, enabled, data_points, context). Keep references to the same methods/variables: _collect_metrics, _neutral, _build_result, total_audits, and context.trust_level (do not add new gating logic before calling _collect_metrics).

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/intelligence_strategy.py

+        ci_score = context.snapshot.overall_quality_score
+
+        if ci_score is None:
+            return self._neutral(context, reason="no_quality_score")


⚠️ Potential issue | 🔴 Critical

Don't make CI quality a hard prerequisite—or confidence source—when it isn't used.

This path returns neutral before calibration is considered, so a calibration-only setup cannot score if overall_quality_score is missing. _collect_metrics() also preloads data_points from task_records even when ci_quality is disabled or skipped, which inflates confidence for LLM-only results. Only count CI data when the CI metric is actually included, and return neutral only when no enabled metric has usable data. Please add a regression test for ci_quality_enabled=False with overall_quality_score=None and calibration records present.

Suggested fix

- ci_score = context.snapshot.overall_quality_score - - if ci_score is None: - return self._neutral(context, reason="no_quality_score") - available, data_points, drift = self._collect_metrics( - ci_score, + context.snapshot.overall_quality_score, context, ) if not available: return self._neutral(context, reason="no_enabled_metrics") @@ - ci_score: float, + ci_score: float | None, context: EvaluationContext, ) -> tuple[list[tuple[str, float, float]], int, float]: @@ - data_points = len(context.task_records) + data_points = 0 calibration_drift = 0.0 - if context.config.intelligence.ci_quality_enabled: + if context.config.intelligence.ci_quality_enabled and ci_score is not None: available.append( ( "ci_quality", context.config.intelligence.ci_quality_weight, ci_score, ) ) + data_points += len(context.task_records) + elif context.config.intelligence.ci_quality_enabled: + logger.debug( + EVAL_METRIC_SKIPPED, + agent_id=context.agent_id, + pillar=self.pillar.value, + metric="ci_quality", + reason="no_quality_score", + )

Also applies to: 78-123

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 64 - 67, The current logic returns neutral when context.snapshot.overall_quality_score is None even if CI quality is disabled or calibration data exists; update the flow in intelligence_strategy.py so overall_quality_score is only treated as a CI data source when ci_quality_enabled is true and only include CI-derived points in _collect_metrics() when ci_quality_enabled is true (i.e., stop preloading data_points from task_records unless ci_quality_enabled), change the early-return that calls self._neutral(reason="no_quality_score") to check that no enabled metric has usable data before returning neutral, and add a regression test that sets ci_quality_enabled=False with overall_quality_score=None but with calibration records present to ensure scoring proceeds using calibration only.

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/models.py

+    agent_id: NotBlankStr = Field(description="Agent being evaluated")
+    now: AwareDatetime = Field(description="Reference timestamp")
+    config: EvaluationConfig = Field(description="Evaluation configuration")
+    snapshot: AgentPerformanceSnapshot = Field(
+        description="Performance snapshot from the tracker",
+    )
+    task_records: tuple[TaskMetricRecord, ...] = Field(
+        default=(),
+        description="Raw task metric records",
+    )
+    calibration_records: tuple[LlmCalibrationRecord, ...] = Field(
+        default=(),
+        description="LLM calibration records",
+    )
+    feedback: tuple[InteractionFeedback, ...] = Field(
+        default=(),
+        description="Interaction feedback records",
+    )
+    resilience_metrics: ResilienceMetrics | None = Field(
+        default=None,
+        description="Derived resilience metrics",
+    )
+    audit_allow_count: int = Field(
+        ge=0,
+        default=0,
+        description="Allowed audit entries in the window",
+    )
+    audit_deny_count: int = Field(
+        ge=0,
+        default=0,
+        description="Denied audit entries in the window",
+    )
+    audit_escalate_count: int = Field(
+        ge=0,
+        default=0,
+        description="Escalated audit entries in the window",
+    )
+    audit_high_risk_count: int = Field(
+        ge=0,
+        default=0,
+        description="High-risk audit entries in the window",
+    )
+    trust_level: NotBlankStr | None = Field(
+        default=None,
+        description="Current trust level name",
+    )
+    trust_demotions_in_window: int = Field(
+        ge=0,
+        default=0,
+        description="Trust demotions in the window",
+    )
+    autonomy_downgrades_in_window: int = Field(
+        ge=0,
+        default=0,
+        description="Autonomy downgrades in the window",
+    )
+
+    @model_validator(mode="after")
+    def _validate_agent_id_consistency(self) -> Self:
+        """Ensure context agent_id matches snapshot agent_id."""
+        if self.agent_id != self.snapshot.agent_id:
+            msg = (
+                f"Context agent_id ({self.agent_id}) does not match "
+                f"snapshot agent_id ({self.snapshot.agent_id})"
+            )
+            raise ValueError(msg)
+        return self


⚠️ Potential issue | 🟠 Major

Reject mixed-agent records in EvaluationContext.

The model only checks snapshot.agent_id. A caller can still build a context for one agent that carries agent-scoped records from another agent, and the strategies will score that foreign data as if it belonged to the current agent. Add an after-validator that enforces agent_id consistency across the agent-scoped collections in this model.

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/models.py` around lines 262 - 328, Add an additional after-model validator (e.g. def _validate_agent_scoped_records_consistency(self) -> Self) that iterates task_records, calibration_records, and feedback and ensures each record.agent_id equals self.agent_id; if any mismatch is found raise ValueError with a clear message identifying the collection and offending record (index or repr). Keep the existing _validate_agent_id_consistency but implement this new validator to enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and InteractionFeedback collections.

coderabbitai · 2026-04-02T21:39:25Z

src/synthorg/hr/evaluation/models.py

+    pillar_weights: tuple[tuple[NotBlankStr, float], ...] = Field(
+        description="Applied weights as (pillar_name, weight) pairs",
+    )
+
+    @model_validator(mode="after")
+    def _validate_unique_pillars(self) -> Self:
+        """Ensure pillar scores have unique pillar names."""
+        names = [ps.pillar for ps in self.pillar_scores]
+        if len(names) != len(set(names)):
+            seen: set[EvaluationPillar] = set()
+            dupes: list[str] = []
+            for n in names:
+                if n in seen:
+                    dupes.append(n.value)
+                seen.add(n)
+            msg = f"Duplicate pillar scores: {', '.join(dupes)}"
+            raise ValueError(msg)
+        return self
+
+    @model_validator(mode="after")
+    def _validate_agent_id_consistency(self) -> Self:
+        """Ensure report agent_id matches snapshot agent_id."""
+        if self.agent_id != self.snapshot.agent_id:
+            msg = (
+                f"Report agent_id ({self.agent_id}) does not match "
+                f"snapshot agent_id ({self.snapshot.agent_id})"
+            )
+            raise ValueError(msg)
+        return self
+
+    @model_validator(mode="after")
+    def _validate_weights_match_scores(self) -> Self:
+        """Ensure pillar_weights entries correspond to pillar_scores."""
+        score_pillars = {ps.pillar.value for ps in self.pillar_scores}
+        weight_pillars = {name for name, _ in self.pillar_weights}
+        if score_pillars != weight_pillars:
+            msg = (
+                f"Pillar weight names {sorted(weight_pillars)} do not match "
+                f"pillar score names {sorted(score_pillars)}"
+            )
+            raise ValueError(msg)
+        return self


⚠️ Potential issue | 🟠 Major

pillar_weights validation is too weak for a public report model.

_validate_weights_match_scores() compares sets only, so duplicate entries like (("intelligence", 0.5), ("intelligence", 0.5)) still validate as long as the score set is {"intelligence"}. The field also accepts unconstrained floats, so negative or >1 weights can slip through. Reject duplicate names and enforce bounded, normalized weights here so EvaluationReport cannot represent an ambiguous weighting scheme.

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 413, The current _validate_weights_match_scores only compares sets and misses duplicate pillar names and invalid floats; update validation for the pillar_weights field (and/or _validate_weights_match_scores) to (1) detect and reject duplicate pillar names in pillar_weights (collect seen names and raise ValueError listing duplicates), (2) ensure each weight is a real number within [0.0, 1.0] (reject negatives or >1), and (3) ensure the weights are normalized (sum(weights) ≈ 1.0 within a small epsilon) and raise descriptive ValueError messages if any check fails; keep these checks in the model_validator decorated method(s) for EvaluationReport so invalid/ambiguous weighting schemes cannot be constructed.

tests/unit/hr/evaluation/test_evaluator.py

- Fix intelligence strategy: CI quality is no longer a hard prerequisite; calibration-only mode works when overall_quality_score is None - Fix governance strategy: autonomy-only scoring no longer blocked by the early neutral return (total_audits==0 && trust_level==None) - Strengthen EvaluationReport pillar_weights validator: reject duplicate weight entries before set comparison - Fix efficiency tests to actually test 7d fallback and neutral paths using direct _score_efficiency calls with custom snapshots - Update governance no-data test to disable autonomy for true neutral

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

src/synthorg/hr/evaluation/models.py (2)

268-328: ⚠️ Potential issue | 🟠 Major

Reject foreign-agent records in EvaluationContext.

The current validator only ties agent_id to snapshot.agent_id. A caller can still pass task_records, calibration_records, or feedback belonging to another agent, and the strategies will score that foreign data as if it were local.

Proposed fix

     `@model_validator`(mode="after")
     def _validate_agent_id_consistency(self) -> Self:
         """Ensure context agent_id matches snapshot agent_id."""
         if self.agent_id != self.snapshot.agent_id:
             msg = (
                 f"Context agent_id ({self.agent_id}) does not match "
                 f"snapshot agent_id ({self.snapshot.agent_id})"
             )
             raise ValueError(msg)
         return self
+
+    `@model_validator`(mode="after")
+    def _validate_agent_scoped_records(self) -> Self:
+        """Ensure agent-scoped collections match the context agent."""
+        collections = (
+            ("task_records", self.task_records),
+            ("calibration_records", self.calibration_records),
+            ("feedback", self.feedback),
+        )
+        for collection_name, records in collections:
+            for index, record in enumerate(records):
+                if record.agent_id != self.agent_id:
+                    msg = (
+                        f"{collection_name}[{index}] agent_id "
+                        f"({record.agent_id}) does not match "
+                        f"context agent_id ({self.agent_id})"
+                    )
+                    raise ValueError(msg)
+        return self

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 268 - 328, The current
_validate_agent_id_consistency only compares self.agent_id to
self.snapshot.agent_id but does not reject task_records, calibration_records, or
feedback that belong to a different agent; update _validate_agent_id_consistency
in EvaluationContext to iterate over task_records (TaskMetricRecord.agent_id),
calibration_records (LlmCalibrationRecord.agent_id), and feedback
(InteractionFeedback.agent_id) and raise a ValueError if any record.agent_id !=
self.agent_id (include which record type and offending id in the message); keep
the existing snapshot check and return self at the end.

372-374: ⚠️ Potential issue | 🟠 Major

Finish hardening pillar_weights on the report model.

_validate_weights_match_scores() now rejects duplicate names, but it still accepts negative weights, weights above 1.0, or totals that do not sum to 1.0. That leaves EvaluationReport open to ambiguous weighting schemes even though overall_score is defined as weighted output.

Proposed fix

     `@model_validator`(mode="after")
     def _validate_weights_match_scores(self) -> Self:
         """Ensure pillar_weights entries correspond to pillar_scores."""
         weight_names = [name for name, _ in self.pillar_weights]
         if len(weight_names) != len(set(weight_names)):
             msg = "Duplicate entries in pillar_weights"
             raise ValueError(msg)
+        invalid_weights = [
+            str(name)
+            for name, weight in self.pillar_weights
+            if weight < 0.0 or weight > 1.0
+        ]
+        if invalid_weights:
+            msg = (
+                "pillar_weights must be within [0.0, 1.0] for: "
+                f"{', '.join(invalid_weights)}"
+            )
+            raise ValueError(msg)
+        total_weight = sum(weight for _, weight in self.pillar_weights)
+        if abs(total_weight - 1.0) > 1e-9:
+            msg = f"pillar_weights must sum to 1.0, got {total_weight}"
+            raise ValueError(msg)
         score_pillars = {ps.pillar.value for ps in self.pillar_scores}
         weight_pillars = set(weight_names)
         if score_pillars != weight_pillars:
             msg = (
                 f"Pillar weight names {sorted(weight_pillars)} do not match "

As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."

Also applies to: 402-417

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 374, The
pillar_weights field must be hardened: update the _validate_weights_match_scores
validator (used by EvaluationReport and the related scores validator) to reject
weights < 0 or > 1, enforce that the sum of all weights equals 1.0 within a
small epsilon (e.g., 1e-6), and keep the existing duplicate-name check; raise
clear ValueError messages identifying the offending pillar name or the total sum
mismatch. Ensure the validator is applied to the pillar_weights
tuple[tuple[NotBlankStr, float], ...] field (and reused for the other
weights-validated field handled by _validate_weights_match_scores) so all weight
inputs are normalized and validated at the model boundary.

src/synthorg/hr/evaluation/intelligence_strategy.py (1)

79-123: ⚠️ Potential issue | 🟠 Major

Confidence is still inflated in calibration-only runs.

data_points starts at len(context.task_records) before CI quality is proven usable. When ci_quality is disabled or overall_quality_score is missing, calibration-only scoring still gains confidence from unrelated task counts. Start from 0 and only add task records when the CI component is actually appended.

Proposed fix

         available: list[tuple[str, float, float]] = []
-        data_points = len(context.task_records)
+        data_points = 0
         calibration_drift = 0.0
         ci_score = context.snapshot.overall_quality_score
 
         if context.config.intelligence.ci_quality_enabled and ci_score is not None:
             available.append(
@@
                     context.config.intelligence.ci_quality_weight,
                     ci_score,
                 )
             )
+            data_points += len(context.task_records)
         elif context.config.intelligence.ci_quality_enabled:
             logger.debug(
                 EVAL_METRIC_SKIPPED,
                 agent_id=context.agent_id,
                 pillar=self.pillar.value,

Please add a regression test for calibration-only scoring with task records present so confidence stays tied to calibration_records.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 79 - 123,
data_points is initialized from len(context.task_records) so calibration-only
runs get inflated confidence; change initialization to data_points = 0 and only
add len(context.task_records) when you append the "ci_quality" tuple (i.e.,
inside the block where you call available.append for "ci_quality") and keep
adding len(records) for calibration_records as already done; also add a
regression test (e.g.,
test_calibration_only_confidence_tied_to_calibration_records) that creates
context with task_records present but ci_quality disabled or no
overall_quality_score and asserts returned data_points equals number of
calibration_records only.

src/synthorg/hr/evaluation/governance_strategy.py (1)

29-35: ⚠️ Potential issue | 🟠 Major

Verify the supported custom trust level doesn't fall through the unknown path.

If TrustLevel.CUSTOM is still a valid value in src/synthorg/core/enums.py, this table will log a legitimate trust state as unknown and score it with the neutral fallback. Add an explicit "custom" branch, or derive the score from the resolved custom policy instead of routing it through EVAL_TRUST_LEVEL_UNKNOWN.

Run this read-only check to confirm the upstream enum still exposes CUSTOM:
#!/bin/bash
rg -n -C2 'class TrustLevel|CUSTOM|custom' src/synthorg/core/enums.py
If that enum member is still present, please add a regression test for the legitimate custom path as well.

Also applies to: 149-169
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The
trust-score map _TRUST_LEVEL_SCORES currently omits the "custom" key which
causes legitimate TrustLevel.CUSTOM values to hit EVAL_TRUST_LEVEL_UNKNOWN and
use the neutral fallback; update the mapping in governance_strategy.py to handle
"custom" explicitly (or compute the score from the resolved custom policy) and
update any code-path that maps TrustLevel -> score to use that branch instead of
falling back to EVAL_TRUST_LEVEL_UNKNOWN; reference symbols:
_TRUST_LEVEL_SCORES, TrustLevel.CUSTOM, EVAL_TRUST_LEVEL_UNKNOWN, and ensure you
add a regression test that constructs a TrustLevel.CUSTOM case and asserts the
expected non-neutral score/path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/evaluation/models.py`:
- Around line 65-139: InteractionFeedback currently allows records with all
ratings None and free_text blank; add an after-model validator on
InteractionFeedback (use Pydantic V2 `@model_validator`(mode="after") or
equivalent) that inspects clarity_rating, tone_rating, helpfulness_rating,
trust_rating, satisfaction_rating and free_text and raises a ValueError when
every rating is None and free_text is None or free_text.strip() == "" so at
least one numeric rating or a non-blank comment is required.

---

Duplicate comments:
In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
"custom" key which causes legitimate TrustLevel.CUSTOM values to hit
EVAL_TRUST_LEVEL_UNKNOWN and use the neutral fallback; update the mapping in
governance_strategy.py to handle "custom" explicitly (or compute the score from
the resolved custom policy) and update any code-path that maps TrustLevel ->
score to use that branch instead of falling back to EVAL_TRUST_LEVEL_UNKNOWN;
reference symbols: _TRUST_LEVEL_SCORES, TrustLevel.CUSTOM,
EVAL_TRUST_LEVEL_UNKNOWN, and ensure you add a regression test that constructs a
TrustLevel.CUSTOM case and asserts the expected non-neutral score/path.

In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 79-123: data_points is initialized from len(context.task_records)
so calibration-only runs get inflated confidence; change initialization to
data_points = 0 and only add len(context.task_records) when you append the
"ci_quality" tuple (i.e., inside the block where you call available.append for
"ci_quality") and keep adding len(records) for calibration_records as already
done; also add a regression test (e.g.,
test_calibration_only_confidence_tied_to_calibration_records) that creates
context with task_records present but ci_quality disabled or no
overall_quality_score and asserts returned data_points equals number of
calibration_records only.

In `@src/synthorg/hr/evaluation/models.py`:
- Around line 268-328: The current _validate_agent_id_consistency only compares
self.agent_id to self.snapshot.agent_id but does not reject task_records,
calibration_records, or feedback that belong to a different agent; update
_validate_agent_id_consistency in EvaluationContext to iterate over task_records
(TaskMetricRecord.agent_id), calibration_records
(LlmCalibrationRecord.agent_id), and feedback (InteractionFeedback.agent_id) and
raise a ValueError if any record.agent_id != self.agent_id (include which record
type and offending id in the message); keep the existing snapshot check and
return self at the end.
- Around line 372-374: The pillar_weights field must be hardened: update the
_validate_weights_match_scores validator (used by EvaluationReport and the
related scores validator) to reject weights < 0 or > 1, enforce that the sum of
all weights equals 1.0 within a small epsilon (e.g., 1e-6), and keep the
existing duplicate-name check; raise clear ValueError messages identifying the
offending pillar name or the total sum mismatch. Ensure the validator is applied
to the pillar_weights tuple[tuple[NotBlankStr, float], ...] field (and reused
for the other weights-validated field handled by _validate_weights_match_scores)
so all weight inputs are normalized and validated at the model boundary.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fae58238-9e61-404e-8c41-b0e7e5860cd6

📥 Commits

Reviewing files that changed from the base of the PR and between d5c53cc and 0cddc0f.

📒 Files selected for processing (5)

src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/hr/evaluation/models.py
tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_governance_strategy.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Build Backend
GitHub Check: Build Web
GitHub Check: Build Sandbox
GitHub Check: Test (Python 3.14)
GitHub Check: Dependency Review
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/models.py
tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_governance_strategy.py
src/synthorg/hr/evaluation/intelligence_strategy.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

src/synthorg/hr/evaluation/governance_strategy.py
src/synthorg/hr/evaluation/models.py
src/synthorg/hr/evaluation/intelligence_strategy.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

tests/unit/hr/evaluation/test_evaluator.py
tests/unit/hr/evaluation/test_governance_strategy.py

🧠 Learnings (17)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

src/synthorg/hr/evaluation/governance_strategy.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/hr/evaluation/test_evaluator.py

src/synthorg/hr/evaluation/models.py

Add model_validator requiring at least one rating or non-blank free_text in InteractionFeedback. Prevents empty feedback records with no signal from being stored. Add tests for empty feedback rejection and free-text-only feedback acceptance.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit/hr/evaluation/test_models.py`:
- Around line 329-339: Remove the redundant in-function import of
EvaluationContext inside test_agent_id_mismatch_raises; instead import
EvaluationContext at the module level with the other model imports so the test
uses the top-level import. Specifically, delete the local "from
synthorg.hr.evaluation.models import EvaluationContext" inside
test_agent_id_mismatch_raises and add EvaluationContext to the existing model
imports at the top of the test file (where other model classes are imported).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d5b307c2-1f81-440c-bbde-adeeb2c836d2

📥 Commits

Reviewing files that changed from the base of the PR and between 0cddc0f and fcff01f.

📒 Files selected for processing (2)

src/synthorg/hr/evaluation/models.py
tests/unit/hr/evaluation/test_models.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Backend
GitHub Check: Build Web
GitHub Check: Build Sandbox
GitHub Check: Dependency Review
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations in Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use @computed_field for derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)

Files:

tests/unit/hr/evaluation/test_models.py
src/synthorg/hr/evaluation/models.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: All Python test files must use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow markers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer @pytest.mark.parametrize for testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given + @settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions

Files:

tests/unit/hr/evaluation/test_models.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every Python module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always be logger (not _logger, not log)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: always logger.info(EVENT, key=value) -- never logger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases

Files:

src/synthorg/hr/evaluation/models.py

🧠 Learnings (11)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T18:42:17.990Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence).

Applied to files:

src/synthorg/hr/evaluation/models.py

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer

Applied to files:

src/synthorg/hr/evaluation/models.py

🔇 Additional comments (13)

src/synthorg/hr/evaluation/models.py (7)

338-347: Add validation for agent-scoped record collections.

The current validator only checks agent_id == snapshot.agent_id, but does not validate that task_records, calibration_records, and feedback entries all belong to the same agent. While the _build_context method in evaluator.py fetches data using consistent agent_id, the model itself does not enforce this invariant, allowing callers (e.g., tests or future code paths) to construct inconsistent contexts.

421-436: Weight bounds and normalization still not validated.

The duplicate weight names check was added (lines 424-427), but pillar_weights still accepts negative weights and weights that don't sum to 1.0. While the _build_report method uses redistribute_weights which guarantees proper bounds and normalization, the model itself permits invalid states.

1-31: LGTM: Module setup and imports are correct.

The module docstring is clear, imports are appropriate, and the pattern of using ConfigDict(frozen=True, allow_inf_nan=False) aligns with coding guidelines for frozen Pydantic models. The # noqa: TC003 and # noqa: TC001 comments appropriately suppress type-checking-only import warnings for runtime-required types.

33-62: LGTM: redistribute_weights utility is well-designed.

The function correctly handles:

Filtering disabled items

Proportional redistribution

Zero-weight fallback to equal distribution

Error case when all items are disabled or input is empty

The docstring is complete with Args, Returns, and Raises sections.

140-157: LGTM: Empty feedback rejection properly implemented.

The _validate_has_signal validator correctly ensures at least one rating or non-blank free_text is present, addressing the previous review feedback about rejecting feedback records with no usable signal.

160-220: LGTM: ResilienceMetrics has comprehensive cross-field validation.

The validator correctly enforces all relational invariants:

failed_tasks <= total_tasks

recovered_tasks <= failed_tasks

longest_success_streak >= current_success_streak

223-251: LGTM: PillarScore model is correctly constrained.

The score (0.0-10.0) and confidence (0.0-1.0) bounds are properly enforced. The breakdown field appropriately stores component scores without rigid bounds since these are informational and may have varying scales depending on the strategy.

tests/unit/hr/evaluation/test_models.py (6)

1-20: LGTM: Test file setup is correct.

The pytestmark = pytest.mark.unit properly marks all tests, and imports are appropriate for testing Pydantic model validation behavior.

25-73: LGTM: Comprehensive tests for redistribute_weights.

The test suite covers all important cases:

Proportional preservation

Redistribution when items are disabled

Error cases (all disabled, empty)

Zero-weight equal distribution fallback

Single enabled item

Sum-to-one invariant

Good use of epsilon comparisons for float assertions.

78-198: LGTM: Thorough InteractionFeedback test coverage.

The tests comprehensively cover:

Valid construction with all/partial ratings

Frozen immutability

Parametrized bounds checking for all rating fields

free_text max length

Auto-generated unique IDs

Empty feedback rejection

Free-text-only acceptance

Good use of @pytest.mark.parametrize to avoid test duplication.

203-271: LGTM: ResilienceMetrics tests cover all validation invariants.

All cross-field validation rules are tested:

failed_tasks > total_tasks rejection

recovered_tasks > failed_tasks rejection

current_success_streak > longest_success_streak rejection

Frozen immutability

276-321: LGTM: PillarScore tests verify bounds and structure.

Good coverage of score/confidence bounds at boundary values (0.0, 10.0/1.0) and beyond, plus breakdown tuple structure verification.

345-482: LGTM: EvaluationReport tests cover key validation paths.

The tests verify:

Valid construction

Duplicate pillar score rejection

Unique ID generation

Score/confidence bounds

Frozen immutability

Agent ID consistency

Weight/score name mismatch

Good coverage of the model's validators.

tests/unit/hr/evaluation/test_models.py

🤖 I have created a release *beep* *boop* --- ## [0.5.8](v0.5.7...v0.5.8) (2026-04-03) ### Features * auto-select embedding model + fine-tuning pipeline wiring ([#999](#999)) ([a4cbc4e](a4cbc4e)), closes [#965](#965) [#966](#966) * ceremony scheduling batch 3 -- milestone strategy, template defaults, department overrides ([#1019](#1019)) ([321d245](321d245)) * five-pillar evaluation framework for HR performance tracking ([#1017](#1017)) ([5e66cbd](5e66cbd)), closes [#699](#699) * populate comparison page with 53 competitor entries ([#1000](#1000)) ([5cb232d](5cb232d)), closes [#993](#993) * throughput-adaptive and external-trigger ceremony scheduling strategies ([#1003](#1003)) ([bb5c9a4](bb5c9a4)), closes [#973](#973) [#974](#974) ### Bug Fixes * eliminate backup service I/O from API test lifecycle ([#1015](#1015)) ([08d9183](08d9183)) * update run_affected_tests.py to use -n 8 ([#1014](#1014)) ([3ee9fa7](3ee9fa7)) ### Performance * reduce pytest parallelism from -n auto to -n 8 ([#1013](#1013)) ([43e0707](43e0707)) ### CI/CD * bump docker/login-action from 4.0.0 to 4.1.0 in the all group ([#1027](#1027)) ([e7e28ec](e7e28ec)) * bump wrangler from 4.79.0 to 4.80.0 in /.github in the all group ([#1023](#1023)) ([1322a0d](1322a0d)) ### Maintenance * bump github.com/mattn/go-runewidth from 0.0.21 to 0.0.22 in /cli in the all group ([#1024](#1024)) ([b311694](b311694)) * bump https://github.com/astral-sh/ruff-pre-commit from v0.15.8 to 0.15.9 in the all group ([#1022](#1022)) ([1650087](1650087)) * bump node from `71be405` to `387eebd` in /docker/sandbox in the all group ([#1021](#1021)) ([40bd2f6](40bd2f6)) * bump node from `cf38e1f` to `ad82eca` in /docker/web in the all group ([#1020](#1020)) ([f05ab9f](f05ab9f)) * bump the all group in /web with 3 updates ([#1025](#1025)) ([21d40d3](21d40d3)) * bump the all group with 2 updates ([#1026](#1026)) ([36778de](36778de)) * enable additional eslint-react rules and fix violations ([#1028](#1028)) ([80423be](80423be)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Aureliolo added 3 commits April 2, 2026 22:26

fix: resolve mypy arg-type errors in parametrized feedback tests

b0f35ce

Copilot AI review requested due to automatic review settings April 2, 2026 20:57

Aureliolo temporarily deployed to ci April 2, 2026 20:58 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Aureliolo April 2, 2026 20:58 View session

Aureliolo temporarily deployed to cloudflare-preview April 2, 2026 20:59 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

Copilot AI reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

Aureliolo temporarily deployed to ci April 2, 2026 21:28 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 2, 2026 21:29 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

Aureliolo temporarily deployed to ci April 2, 2026 21:46 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 2, 2026 21:47 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

src/synthorg/hr/evaluation/models.py Show resolved Hide resolved

Aureliolo temporarily deployed to ci April 3, 2026 07:18 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 3, 2026 07:18 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 3, 2026

View reviewed changes

tests/unit/hr/evaluation/test_models.py Show resolved Hide resolved

refactor: move EvaluationContext import to module level in test_models

221af46

Aureliolo temporarily deployed to ci April 3, 2026 07:27 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 3, 2026 07:28 — with GitHub Actions Inactive

Aureliolo merged commit 5e66cbd into main Apr 3, 2026
34 checks passed

Aureliolo deleted the feat/hr-evaluation-framework branch April 3, 2026 07:33

Aureliolo temporarily deployed to cloudflare-preview April 3, 2026 07:33 — with GitHub Actions Inactive

Aureliolo mentioned this pull request Apr 3, 2026

chore(main): release 0.5.8 #1002

Merged

Aureliolo mentioned this pull request Apr 3, 2026

feat: implement quality scoring Layers 2+3 -- LLM judge and human override #230

Closed

	falls back to CI quality alone with reduced confidence.
	falls back to CI quality alone.

Conversation

Aureliolo commented Apr 2, 2026

Summary

Pillars

Key design decisions (D24)

New files

Also

Test plan

Review coverage

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Snapshot Warnings

Scanned Files

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested labels

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

codecov bot commented Apr 2, 2026 •

edited

Loading