feat: five-pillar evaluation framework for HR performance tracking#1017
feat: five-pillar evaluation framework for HR performance tracking#1017
Conversation
Implement structured five-pillar agent evaluation based on the InfoQ evaluation framework. Each pillar and its individual metrics can be independently enabled/disabled via EvaluationConfig. Pillars: - Intelligence/Accuracy: blends CI quality score with LLM calibration - Performance/Efficiency: normalized cost, time, token metrics - Reliability/Resilience: success rate, recovery, consistency, streaks - Responsibility/Governance: audit compliance, trust, autonomy - User Experience: clarity, tone, helpfulness, trust, satisfaction New hr/evaluation/ subpackage (10 files): - EvaluationPillar enum, PillarScore/EvaluationReport/InteractionFeedback/ ResilienceMetrics/EvaluationContext models - EvaluationConfig with per-pillar sub-configs and metric toggles - PillarScoringStrategy protocol (single protocol, single context bag) - Four default strategies + inline efficiency computation - EvaluationService orchestrator with concurrent pillar scoring - redistribute_weights() utility for weight redistribution Also: - Observability events (eval.* namespace) - Design spec D16 decision in docs/design/agents.md - 118 unit tests, mypy clean, ruff clean Closes #699
Pre-reviewed by 12 agents, 25 findings addressed: Source fixes: - Extract evaluate() into 4 helper methods (was 139 lines, now <50 each) - Extract _score_efficiency into sub-score + builder helpers - Extract _compute_resilience_metrics into module-level helpers - Add EVAL_CALIBRATION_DRIFT_HIGH log on drift detection - Add EVAL_PILLAR_INSUFFICIENT_DATA logs on efficiency early returns - Add EVAL_WEIGHTS_REDISTRIBUTED log on weight redistribution - Add confidence kwarg to efficiency pillar log (consistency) - Change record_feedback to sync def (no await needed) - Use setdefault pattern for feedback dict - Fix data_points = len(...) or 1 -> len(...) in intelligence - Add _FULL_CONFIDENCE_DATA_POINTS named constant (replaces magic 10.0) - Remove unreachable max(1, total_audits) guards in governance - Add warning log for unknown trust levels in governance - Add at-least-one-metric-enabled validators to all 5 sub-configs - Add agent_id consistency validator to EvaluationContext - Fix docstrings: PillarScore 'mirrors' -> 'extends', CI spelled out, resilience 'inverse' -> 'linear penalty', config module docstring Docs fixes: - Renumber D16 -> D24 (collision with Docker sandbox decision) - Add D24 row to docs/architecture/decisions.md - Update CLAUDE.md Package Structure with evaluation/ - Add evaluation event example to CLAUDE.md logging section - Add 'evaluation' to DESIGN_SPEC.md and design/index.md descriptions Test fixes: - Add pytestmark = pytest.mark.unit to all 8 test files - Add tests: shuffled records, failure-ending pattern, explicit now, CI disabled + LLM only, unknown trust level, feedback-to-evaluation end-to-end pipeline - Fix tests for sync record_feedback and new config validators
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches. Re-running this action after a short time may resolve the issue. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
tests/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧠 Learnings (1)📓 Common learnings🔇 Additional comments (6)
WalkthroughAdds a five‑pillar HR evaluation framework under Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive "Five-Pillar Evaluation Framework" for tracking agent performance within the HR module. The framework assesses agents across Intelligence, Efficiency, Resilience, Governance, and User Experience using pluggable scoring strategies and a centralized EvaluationService. Key features include configurable metric toggles with automatic weight redistribution, structured logging for evaluation events, and detailed documentation of the design and architectural decisions. Feedback was provided regarding the QualityBlendIntelligenceStrategy to ensure that weight redistribution logic remains consistent and robust when calibration data is unavailable.
| # Build enabled metrics list. | ||
| metrics: list[tuple[str, float, bool]] = [] | ||
| if cfg.ci_quality_enabled: | ||
| metrics.append(("ci_quality", cfg.ci_quality_weight, True)) | ||
| if cfg.llm_calibration_enabled: | ||
| metrics.append(("llm_calibration", cfg.llm_calibration_weight, True)) | ||
|
|
||
| if not metrics: | ||
| return PillarScore( | ||
| pillar=self.pillar, | ||
| score=_NEUTRAL_SCORE, | ||
| confidence=0.0, | ||
| strategy_name=NotBlankStr(self.name), | ||
| data_point_count=0, | ||
| evaluated_at=context.now, | ||
| ) | ||
|
|
||
| weights = redistribute_weights(metrics) | ||
|
|
||
| # Compute CI quality component. | ||
| breakdown: list[tuple[str, float]] = [] | ||
| weighted_sum = 0.0 | ||
| data_points = len(context.task_records) | ||
|
|
||
| if "ci_quality" in weights: | ||
| breakdown.append(("ci_quality", round(ci_score, 4))) | ||
| weighted_sum += ci_score * weights["ci_quality"] | ||
|
|
||
| # Compute LLM calibration component. | ||
| calibration_drift = 0.0 | ||
| if "llm_calibration" in weights: | ||
| records = context.calibration_records | ||
| if records: | ||
| avg_llm = sum(r.llm_score for r in records) / len(records) | ||
| breakdown.append(("llm_calibration", round(avg_llm, 4))) | ||
| weighted_sum += avg_llm * weights["llm_calibration"] | ||
| calibration_drift = sum(r.drift for r in records) / len(records) | ||
| data_points += len(records) | ||
| else: | ||
| logger.debug( | ||
| EVAL_METRIC_SKIPPED, | ||
| agent_id=context.agent_id, | ||
| pillar=self.pillar.value, | ||
| metric="llm_calibration", | ||
| reason="no_calibration_records", | ||
| ) | ||
| # Redistribute to CI quality only. | ||
| weighted_sum = ci_score | ||
| breakdown = [("ci_quality", round(ci_score, 4))] |
There was a problem hiding this comment.
The logic for handling missing calibration records appears to be incorrect. When llm_calibration is enabled but no data is available, the code at line 124 (weighted_sum = ci_score) overwrites the previously calculated weighted ci_score. This results in the final score being the raw ci_score, rather than a correctly weighted score where ci_quality receives 100% of the weight.
This can be fixed by refactoring to follow the pattern used in other strategies (e.g., FeedbackBasedUxStrategy): first, determine which metrics have available data, then redistribute weights among only those metrics, and finally compute the weighted sum. This makes the logic more robust and consistent across strategies.
# Build a list of metrics that are enabled and have data.
available: list[tuple[str, float, float]] = [] # (name, weight, score)
data_points = 0
calibration_drift = 0.0
if cfg.ci_quality_enabled:
available.append(("ci_quality", cfg.ci_quality_weight, ci_score))
data_points += len(context.task_records)
if cfg.llm_calibration_enabled:
records = context.calibration_records
if records:
avg_llm = sum(r.llm_score for r in records) / len(records)
available.append(("llm_calibration", cfg.llm_calibration_weight, avg_llm))
calibration_drift = sum(r.drift for r in records) / len(records)
data_points += len(records)
else:
logger.debug(
EVAL_METRIC_SKIPPED,
agent_id=context.agent_id,
pillar=self.pillar.value,
metric="llm_calibration",
reason="no_calibration_records",
)
if not available:
# This case is already handled by the initial ci_score check,
# but it's a good safeguard.
return PillarScore(
pillar=self.pillar,
score=_NEUTRAL_SCORE,
confidence=0.0,
strategy_name=NotBlankStr(self.name),
data_point_count=0,
evaluated_at=context.now,
)
# Redistribute weights among metrics with data.
weights = redistribute_weights([(name, w, True) for name, w, _ in available])
scores = {name: s for name, _, s in available}
weighted_sum = sum(scores[k] * weights[k] for k in weights)
breakdown = list(sorted(scores.items()))There was a problem hiding this comment.
Pull request overview
Adds a new five-pillar evaluation subsystem under hr/ to compute on-demand, configurable evaluation reports (intelligence, efficiency, resilience, governance, UX) and integrates it with structured observability events, tests, and design docs.
Changes:
- Introduces
EvaluationServiceorchestrator plus pluggable pillar strategies and frozen Pydantic models/configs undersrc/synthorg/hr/evaluation/. - Adds evaluation-specific structured logging event constants and updates event-module discovery tests.
- Documents the new D24 decision and five-pillar framework in the design/spec docs; adds comprehensive unit tests.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/observability/test_events.py | Adds evaluation to expected observability event domain modules. |
| tests/unit/hr/evaluation/test_resilience_strategy.py | Unit tests for resilience strategy behavior and toggles. |
| tests/unit/hr/evaluation/test_models.py | Unit tests for evaluation models + redistribute_weights. |
| tests/unit/hr/evaluation/test_intelligence_strategy.py | Unit tests for intelligence strategy blending + drift behavior. |
| tests/unit/hr/evaluation/test_governance_strategy.py | Unit tests for governance strategy scoring + toggles. |
| tests/unit/hr/evaluation/test_experience_strategy.py | Unit tests for UX feedback-based scoring + redistribution. |
| tests/unit/hr/evaluation/test_evaluator.py | Unit tests for EvaluationService orchestration and pipelines. |
| tests/unit/hr/evaluation/test_enums.py | Unit tests for EvaluationPillar enum. |
| tests/unit/hr/evaluation/test_config.py | Unit tests for per-pillar configs and validation rules. |
| tests/unit/hr/evaluation/conftest.py | Shared fixtures/builders for evaluation tests. |
| tests/unit/hr/evaluation/init.py | Test package marker for evaluation tests. |
| src/synthorg/observability/events/evaluation.py | New structured logging event constants for evaluation domain. |
| src/synthorg/hr/evaluation/resilience_strategy.py | TaskBasedResilienceStrategy implementation. |
| src/synthorg/hr/evaluation/pillar_protocol.py | PillarScoringStrategy protocol for pluggable pillars. |
| src/synthorg/hr/evaluation/models.py | Frozen Pydantic models for context, scores, reports, feedback, metrics. |
| src/synthorg/hr/evaluation/intelligence_strategy.py | QualityBlendIntelligenceStrategy implementation. |
| src/synthorg/hr/evaluation/governance_strategy.py | AuditBasedGovernanceStrategy implementation. |
| src/synthorg/hr/evaluation/experience_strategy.py | FeedbackBasedUxStrategy implementation. |
| src/synthorg/hr/evaluation/evaluator.py | EvaluationService orchestrator + inline efficiency scoring and resilience derivations. |
| src/synthorg/hr/evaluation/enums.py | EvaluationPillar enum (five pillars). |
| src/synthorg/hr/evaluation/config.py | EvaluationConfig and per-pillar sub-configs with toggles/weights. |
| src/synthorg/hr/evaluation/init.py | Package docstring for the evaluation framework. |
| docs/design/index.md | Updates design index summary to include evaluation under Agents & HR. |
| docs/design/agents.md | Adds the five-pillar evaluation framework section + D24 note. |
| docs/DESIGN_SPEC.md | Updates design spec index to include evaluation in Agents & HR description. |
| docs/architecture/decisions.md | Adds decision D24 entry describing evaluation framework design choices. |
| CLAUDE.md | Updates package structure and logging examples to include evaluation domain/events. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "llm_calibration" in weights: | ||
| records = context.calibration_records | ||
| if records: | ||
| avg_llm = sum(r.llm_score for r in records) / len(records) | ||
| breakdown.append(("llm_calibration", round(avg_llm, 4))) | ||
| weighted_sum += avg_llm * weights["llm_calibration"] | ||
| calibration_drift = sum(r.drift for r in records) / len(records) | ||
| data_points += len(records) | ||
| else: | ||
| logger.debug( | ||
| EVAL_METRIC_SKIPPED, | ||
| agent_id=context.agent_id, | ||
| pillar=self.pillar.value, | ||
| metric="llm_calibration", | ||
| reason="no_calibration_records", | ||
| ) | ||
| # Redistribute to CI quality only. | ||
| weighted_sum = ci_score | ||
| breakdown = [("ci_quality", round(ci_score, 4))] |
There was a problem hiding this comment.
When llm_calibration is enabled but there are no calibration records, the fallback unconditionally sets weighted_sum = ci_score and breakdown = [("ci_quality", ...)]. This breaks the metric toggles: if ci_quality_enabled is false (LLM-only mode), this code still uses CI quality and emits a ci_quality breakdown. Suggestion: only fall back to CI quality if ci_quality is actually enabled; otherwise treat this as insufficient data (neutral score + 0 confidence) or skip the LLM metric and return neutral/insufficient-data for the pillar.
|
|
||
| Blends existing CI (continuous integration) signal quality score with | ||
| LLM calibration data. When LLM calibration is disabled or unavailable, | ||
| falls back to CI quality alone with reduced confidence. |
There was a problem hiding this comment.
Module docstring says the CI-only fallback happens “with reduced confidence”, but the implementation computes confidence solely from data_points and does not reduce it when LLM calibration is disabled/unavailable. Either update the docstring to match the behavior, or explicitly down-weight confidence when the LLM component is disabled or skipped due to missing calibration records.
| falls back to CI quality alone with reduced confidence. | |
| falls back to CI quality alone. |
| base_trust = _TRUST_LEVEL_SCORES.get(trust_key, _NEUTRAL_SCORE) | ||
| if trust_key not in _TRUST_LEVEL_SCORES: | ||
| logger.warning( | ||
| EVAL_PILLAR_SCORED, | ||
| agent_id=context.agent_id, | ||
| pillar=self.pillar.value, | ||
| warning="unknown_trust_level", | ||
| trust_level=trust_key, | ||
| fallback_score=_NEUTRAL_SCORE, | ||
| ) |
There was a problem hiding this comment.
The warning for an unknown trust_level logs with EVAL_PILLAR_SCORED ("eval.pillar.scored"), which makes it hard to distinguish normal scoring events from exceptional/diagnostic conditions in log queries and metrics. Suggest introducing a dedicated event constant for this condition (e.g., eval.governance.unknown_trust_level) or reusing an existing “skipped/insufficient” event if appropriate, while keeping eval.pillar.scored for the successful final score debug log.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1017 +/- ##
==========================================
+ Coverage 91.69% 91.77% +0.08%
==========================================
Files 658 669 +11
Lines 36108 36739 +631
Branches 3568 3625 +57
==========================================
+ Hits 33109 33719 +610
- Misses 2374 2389 +15
- Partials 625 631 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 188-252: The _resolve_enabled_pillars method is long due to the
large inline pillar_map; extract the pillar configuration to a separate helper
or constant to reduce method length. Create a new function or module-level
constant (e.g., _pillar_config or _build_pillar_map) that returns the list of
tuples currently assigned to pillar_map (using EvaluationPillar entries and
wiring in self._intelligence, self._resilience, self._governance, self._ux where
needed), then update _resolve_enabled_pillars to call that helper, keep the same
logic around enabled collection, redistribute_weights, and returns, and ensure
references to pillar_map, redistribute_weights, EvaluationPillar, and the
strategy attributes (_intelligence, _resilience, _governance, _ux) match the
existing names so behavior is unchanged.
In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Line 145: The confidence formula in evaluate_experience (or the surrounding
function in src/synthorg/hr/evaluation/experience_strategy.py) uses a magic
multiplier `3`; extract this into a module-level constant named
_FULL_CONFIDENCE_FEEDBACK_MULTIPLIER and replace the literal with that constant
in the line computing confidence (confidence = min(1.0, len(feedback) /
(cfg.min_feedback_count * _FULL_CONFIDENCE_FEEDBACK_MULTIPLIER))). Add the
constant near other strategy constants (e.g., alongside
_FULL_CONFIDENCE_DATA_POINTS) and update any imports or references accordingly.
In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 110-118: Replace the misleading EVAL_PILLAR_SCORED event used when
logging an unknown trust level in governance_strategy.py: add a new event
constant (e.g., EVAL_METRIC_FALLBACK or EVAL_UNKNOWN_TRUST_LEVEL) to
synthorg/observability/events/evaluation.py and update the warning call in the
method that contains the trust_key check (the block using logger.warning with
agent_id=context.agent_id and pillar=self.pillar.value) to use that new
constant; alternatively, if you prefer not to add a constant, change the
logger.warning call to a generic structured warning event name (e.g.,
"eval_metric_fallback") so the log semantically matches the fallback case.
In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 115-125: The fallback branch for when llm_calibration is enabled
but has no records currently overwrites the previously computed weighted_sum
(and discards the redistributed weight logic); instead, update the fallback to
build the final score from the already-determined components: keep the
redistributed weight applied to the CI component, adjust the breakdown to
reflect only ("ci_quality", round(ci_score,4)) and then compute weighted_sum
once from those components (or recompute weighted_sum from the redistribution
logic) rather than assigning weighted_sum = ci_score; reference llm_calibration,
EVAL_METRIC_SKIPPED, weighted_sum, breakdown, and ci_score to locate and change
the assignment so the final score computation happens after all components are
finalized.
In `@src/synthorg/hr/evaluation/resilience_strategy.py`:
- Around line 48-156: The score method in resilience_strategy.py is too large;
split it into small helper functions to meet the <50-line rule by extracting the
logical blocks: (1) input/early-return checks into a helper
validate_and_handle_insufficient_data(context) that returns an optional
PillarScore, (2) metric derivation into
build_enabled_metrics_and_scores(context, rm, cfg) which returns enabled_metrics
and scores, (3) weighting and aggregation into compute_final_score(scores,
enabled_metrics) that calls redistribute_weights, and (4) result assembly into
assemble_pillar_score(context, final_score, scores, rm) which builds the
PillarScore and logs; keep the public async score(...) as a thin orchestrator
that calls these helpers (preserve names used: score, redistribute_weights,
PillarScore, EvaluationContext, EVAL_PILLAR_INSUFFICIENT_DATA,
EVAL_PILLAR_SCORED) so callers and tests remain valid.
- Around line 120-128: Before returning the neutral PillarScore when
enabled_metrics is empty, emit an INFO-level observability event describing the
state transition; add a call to the available logger (preferably
context.logger.info(...), falling back to module logger.info if no context
logger) immediately before the existing return in the branch that checks
enabled_metrics and include key fields (self.pillar, NotBlankStr(self.name) or
self.name, rm.total_tasks, and context.now) so the neutral outcome is traceable;
leave the returned PillarScore construction (PillarScore(...)) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c7beee32-35a7-46a6-9757-79d9a3709504
📒 Files selected for processing (27)
CLAUDE.mddocs/DESIGN_SPEC.mddocs/architecture/decisions.mddocs/design/agents.mddocs/design/index.mdsrc/synthorg/hr/evaluation/__init__.pysrc/synthorg/hr/evaluation/config.pysrc/synthorg/hr/evaluation/enums.pysrc/synthorg/hr/evaluation/evaluator.pysrc/synthorg/hr/evaluation/experience_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/pillar_protocol.pysrc/synthorg/hr/evaluation/resilience_strategy.pysrc/synthorg/observability/events/evaluation.pytests/unit/hr/evaluation/__init__.pytests/unit/hr/evaluation/conftest.pytests/unit/hr/evaluation/test_config.pytests/unit/hr/evaluation/test_enums.pytests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_experience_strategy.pytests/unit/hr/evaluation/test_governance_strategy.pytests/unit/hr/evaluation/test_intelligence_strategy.pytests/unit/hr/evaluation/test_models.pytests/unit/hr/evaluation/test_resilience_strategy.pytests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Backend
- GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
docs/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/
Files:
docs/design/index.mddocs/DESIGN_SPEC.mddocs/design/agents.mddocs/architecture/decisions.md
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotationsin Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use@computed_fieldfor derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)
Files:
tests/unit/observability/test_events.pysrc/synthorg/hr/evaluation/__init__.pytests/unit/hr/evaluation/test_enums.pysrc/synthorg/hr/evaluation/enums.pysrc/synthorg/hr/evaluation/pillar_protocol.pytests/unit/hr/evaluation/test_config.pytests/unit/hr/evaluation/test_intelligence_strategy.pysrc/synthorg/observability/events/evaluation.pytests/unit/hr/evaluation/test_governance_strategy.pytests/unit/hr/evaluation/test_evaluator.pysrc/synthorg/hr/evaluation/intelligence_strategy.pytests/unit/hr/evaluation/test_experience_strategy.pytests/unit/hr/evaluation/test_resilience_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/experience_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/resilience_strategy.pytests/unit/hr/evaluation/conftest.pysrc/synthorg/hr/evaluation/evaluator.pytests/unit/hr/evaluation/test_models.pysrc/synthorg/hr/evaluation/config.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: All Python test files must use@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowmarkers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer@pytest.mark.parametrizefor testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given+@settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions
Files:
tests/unit/observability/test_events.pytests/unit/hr/evaluation/test_enums.pytests/unit/hr/evaluation/test_config.pytests/unit/hr/evaluation/test_intelligence_strategy.pytests/unit/hr/evaluation/test_governance_strategy.pytests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_experience_strategy.pytests/unit/hr/evaluation/test_resilience_strategy.pytests/unit/hr/evaluation/conftest.pytests/unit/hr/evaluation/test_models.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/synthorg/**/*.py: Every Python module with business logic must have:from synthorg.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always belogger(not_logger, notlog)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly:from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: alwayslogger.info(EVENT, key=value)-- neverlogger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases
Files:
src/synthorg/hr/evaluation/__init__.pysrc/synthorg/hr/evaluation/enums.pysrc/synthorg/hr/evaluation/pillar_protocol.pysrc/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/experience_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/resilience_strategy.pysrc/synthorg/hr/evaluation/evaluator.pysrc/synthorg/hr/evaluation/config.py
🧠 Learnings (50)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mddocs/design/agents.md
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mddocs/design/agents.mddocs/architecture/decisions.md
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.md
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mdCLAUDE.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mddocs/design/agents.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mddocs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mdtests/unit/observability/test_events.pysrc/synthorg/hr/evaluation/__init__.pyCLAUDE.mdsrc/synthorg/hr/evaluation/enums.pysrc/synthorg/hr/evaluation/pillar_protocol.pytests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/evaluator.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mdsrc/synthorg/hr/evaluation/__init__.pyCLAUDE.mdsrc/synthorg/hr/evaluation/enums.pysrc/synthorg/hr/evaluation/pillar_protocol.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/evaluator.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.
Applied to files:
docs/design/index.mddocs/DESIGN_SPEC.mdCLAUDE.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
Applied to files:
docs/DESIGN_SPEC.mdCLAUDE.mdsrc/synthorg/hr/evaluation/pillar_protocol.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/engine/**/*.py : Engine package (engine/): agent orchestration, parallel execution, task decomposition, routing, TaskEngine (centralized single-writer), task lifecycle/recovery/shutdown, workspace isolation, coordination (4 dispatchers: SAS/centralized/decentralized/context-dependent, wave execution), approval gates (escalation detection, context parking/resume), stagnation detection (ToolRepetitionDetector, corrective prompt injection), AgentRuntimeState (execution status), context budget management, conversation compaction (oldest-turns summarizer)
Applied to files:
docs/DESIGN_SPEC.mdCLAUDE.mdsrc/synthorg/hr/evaluation/evaluator.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
tests/unit/observability/test_events.py
📚 Learning: 2026-03-18T21:23:23.586Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
tests/unit/observability/test_events.py
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.) — import directly
Applied to files:
tests/unit/observability/test_events.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls
Applied to files:
tests/unit/observability/test_events.pysrc/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...
Applied to files:
CLAUDE.mddocs/design/agents.md
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)
Applied to files:
CLAUDE.mdsrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Settings: Runtime-editable settings persistence (DB > env > YAML > code defaults), typed definitions (9 namespaces), Fernet encryption for sensitive values, config bridge, ConfigResolver (typed composed reads for controllers), validation, registry, change notifications via message bus. Per-namespace setting definitions in definitions/ submodule (api, company, providers, memory, budget, security, coordination, observability, backup).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger` (not `_logger`, not `log`).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-17T06:43:14.114Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Every Python module with business logic must have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins
Applied to files:
tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to tests/**/*.py : Test markers: pytest.mark.unit, pytest.mark.integration, pytest.mark.e2e, pytest.mark.slow. Coverage: 80% minimum (enforced in CI).
Applied to files:
tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to tests/**/*.py : Test markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Coverage: 80% minimum. Async: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Timeout: 30 seconds per test. Parallelism: `pytest-xdist` via `-n auto` — ALWAYS include `-n auto` when running pytest, never run tests sequentially.
Applied to files:
tests/unit/hr/evaluation/test_resilience_strategy.py
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.
Applied to files:
src/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
📚 Learning: 2026-03-15T18:42:17.990Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`
Applied to files:
src/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/config.py
🔇 Additional comments (38)
tests/unit/observability/test_events.py (1)
200-200: LGTM!The addition of
"evaluation"to the expected domain modules correctly reflects the new evaluation events domain introduced in this PR.docs/design/index.md (1)
161-161: LGTM!The documentation update accurately reflects the new evaluation framework scope in the HR package.
docs/DESIGN_SPEC.md (1)
14-14: LGTM!The documentation update is consistent with the corresponding change in
docs/design/index.mdand accurately reflects the new evaluation framework.CLAUDE.md (1)
147-147: LGTM!The addition of
EVAL_REPORT_COMPUTEDfromevents.evaluationcorrectly documents the new evaluation observability domain and follows the established event constant pattern.src/synthorg/hr/evaluation/__init__.py (1)
1-8: LGTM!The module docstring clearly describes the five-pillar evaluation framework and its configuration-driven nature. As a pure docstring module, no logging is needed per coding guidelines.
docs/architecture/decisions.md (1)
82-82: LGTM!The D24 decision entry thoroughly documents the five-pillar evaluation design, including the pluggable protocol pattern, context bag approach, and configuration-driven enablement. The decision aligns with the framework's protocol-driven architecture philosophy.
tests/unit/hr/evaluation/test_enums.py (1)
1-41: LGTM!Comprehensive test coverage for the
EvaluationPillarenum. The tests verify member count, values,StrEnumbehavior, value-based lookup, and invalid value handling. Good use of@pytest.mark.parametrizefor testing all members.src/synthorg/hr/evaluation/enums.py (1)
1-17: LGTM!Clean and well-documented enum definition for the five evaluation pillars. As a pure data model, no logging is needed per coding guidelines. The string values follow a clear convention and align with the InfoQ five-pillar framework.
src/synthorg/hr/evaluation/pillar_protocol.py (1)
16-43: Protocol contract is clean and implementation-ready.Typed async interface and explicit pillar/name properties are clear and consistent for strategy injection.
src/synthorg/observability/events/evaluation.py (1)
9-16: Event constant set looks consistent and complete for the evaluation domain.tests/unit/hr/evaluation/test_experience_strategy.py (1)
28-161: Coverage is strong for UX scoring behavior and neutral-path handling.docs/design/agents.md (1)
411-455: The new five-pillar design section is clear and well-aligned with the implemented architecture.tests/unit/hr/evaluation/test_intelligence_strategy.py (1)
31-161: Intelligence strategy tests exercise the critical scoring branches and drift-confidence behavior well.tests/unit/hr/evaluation/test_config.py (1)
18-236: Config model test coverage is comprehensive and validates key invariants effectively.tests/unit/hr/evaluation/test_models.py (1)
24-409: Model and utility tests are thorough, especially around validation boundaries and frozen behavior.src/synthorg/hr/evaluation/intelligence_strategy.py (1)
1-163: Well-structured strategy implementation.The
QualityBlendIntelligenceStrategycorrectly implements thePillarScoringStrategyprotocol with proper logging, event emission, and configuration-driven behavior. The neutral score fallback for missing data and confidence reduction for calibration drift are well-considered design choices.src/synthorg/hr/evaluation/experience_strategy.py (1)
41-164: Clean UX scoring implementation.The strategy correctly handles partial feedback (None ratings), metric toggles, and weight redistribution. The early return for insufficient feedback with appropriate logging is good defensive design.
tests/unit/hr/evaluation/test_resilience_strategy.py (1)
1-140: Comprehensive resilience strategy test coverage.Tests cover protocol properties, neutral scoring fallbacks, metric enable/disable behavior, edge cases (zero tasks, all failures), and score range expectations. The use of factory functions from conftest promotes maintainability.
tests/unit/hr/evaluation/test_governance_strategy.py (1)
1-188: Thorough governance strategy test suite.Tests cover all key scenarios: neutral fallback, score ranges for different verdict distributions, metric toggles, penalty behaviors, and the unknown trust level fallback. The comparative assertions (lines 144-146, 168-170) effectively validate penalty mechanics.
src/synthorg/hr/evaluation/governance_strategy.py (1)
39-176: Solid governance strategy implementation.The strategy correctly handles the three governance metrics (audit compliance, trust level, autonomy compliance) with proper fallbacks for missing data and configuration-driven behavior. The trust level mapping with unknown-level fallback is well-designed.
tests/unit/hr/evaluation/test_evaluator.py (1)
1-370: Comprehensive evaluator test coverage.The test suite covers orchestration (pillar enablement, weight redistribution), individual metric computation (efficiency, resilience), feedback lifecycle, and end-to-end evaluation flow. The
TestComputeResilienceMetricsclass thoroughly validates streak tracking, recovery detection, and quality stddev computation.src/synthorg/hr/evaluation/config.py (1)
1-280: Well-designed evaluation configuration schema.All pillar configs consistently enforce at least one metric enabled when the pillar is active. The use of
frozen=Trueandallow_inf_nan=Falsealigns with coding guidelines. Default weights within each pillar sum to 1.0, ensuring proper normalization before redistribution.src/synthorg/hr/evaluation/models.py (4)
33-62: Clean weight redistribution utility.The
redistribute_weightsfunction correctly handles the edge cases: raises when all items are disabled, and uses equal distribution when all enabled items have zero weight. The implementation is concise and well-documented.
141-201: Robust resilience metrics validation.The
ResilienceMetricsmodel includes comprehensive cross-field validation ensuring logical consistency:failed_tasks <= total_tasks,recovered_tasks <= failed_tasks, andlongest_success_streak >= current_success_streak. This prevents invalid states from propagating through the evaluation pipeline.
331-389: Well-structured evaluation report model.The
EvaluationReportcorrectly enforces unique pillar scores via the validator and provides a complete structure for reporting evaluation results. The use ofuuid4for default IDs ensures uniqueness across evaluations.
12-14: No action required—TYPE_CHECKING import pattern is correct for Python 3.14.
Sequencebelongs in theTYPE_CHECKINGblock because Python 3.14's PEP 649 provides native lazy annotations. Annotations are not evaluated at runtime, so conditional imports are safe and recommended. The current code is correct and requires no changes.> Likely an incorrect or invalid review comment.tests/unit/hr/evaluation/conftest.py (1)
1-124: Well-structured test fixture module with consistent defaults.The factory helpers correctly satisfy the model validators (e.g.,
ResilienceMetricsconstraints in snippet 2,EvaluationContextagent_id consistency in snippet 1). Defaults are sensible and theNotBlankStrwrapping is applied appropriately for identifier fields.src/synthorg/hr/evaluation/evaluator.py (11)
1-52: Clean module setup with well-named constants.Good use of
TYPE_CHECKINGfor forward references and constants for magic numbers.
55-121: Good dependency injection pattern with lazy default strategy loading.The lazy imports in
_default_*methods avoid circular import issues while still providing sensible defaults.
123-155: Clean orchestration with clear step decomposition.The
evaluatemethod follows a clear flow: build context → resolve pillars → score → assemble report.
157-186: Efficient context building with centralized data gathering.
254-272: Proper use ofasyncio.TaskGroupfor concurrent pillar scoring.Per coding guidelines,
TaskGroupis the preferred pattern for fan-out/fan-in parallel operations.
274-314: Correct weighted aggregation with proper clamping and logging.The INFO-level log for report computation follows the guideline for state transitions.
316-360: Clean feedback storage with immutable query results.The
get_feedbackmethod returns a tuple to ensure immutability of query results.
489-522: Defensive constraint enforcement withmin(recovered, failed).Line 518's
min(recovered, failed)ensures therecovered_tasks <= failed_tasksconstraint is always satisfied, matching the validator inResilienceMetrics.
525-568: Correct streak and standard deviation computations.The streak logic properly tracks recoveries (success following failure), and the standard deviation uses population variance (dividing by
n), which is appropriate for evaluating all observed data points rather than estimating from a sample.
406-431: No division by zero risk; this concern is unfounded.
EfficiencyConfigvalidates all reference fields (reference_cost_usd,reference_time_seconds,reference_tokens) with Pydantic'sgt=0constraint, which rejects zero and negative values at validation time. The unit tests confirm this validation is enforced. Division operations at lines 410, 420, and 428 are safe.> Likely an incorrect or invalid review comment.
373-374: No issue here.NotBlankStrisAnnotated[str, ...], so plain string literals will correctly matchNotBlankStrkeys in dict lookups.
…nd CodeRabbit - Fix intelligence strategy fallback bug: CI-disabled + no calibration records now returns neutral instead of using disabled CI quality score - Fix wrong event constant EVAL_PILLAR_SCORED for unknown trust level warning in governance strategy; add EVAL_TRUST_LEVEL_UNKNOWN constant - Move Sequence out of TYPE_CHECKING in models.py for PEP 649 safety - Extract shared scoring constants to evaluation/constants.py, replace duplicated _MAX_SCORE/_NEUTRAL_SCORE/_FULL_CONFIDENCE_DATA_POINTS across 5 modules - Decompose all strategy score() methods into <50-line helpers: _collect_metrics, _build_result, _neutral, _compute_confidence - Extract _get_pillar_configs from _resolve_enabled_pillars in evaluator - Add EvaluationReport validators: agent_id/snapshot consistency, pillar_weights/pillar_scores correspondence - Extract magic number 3 in UX confidence to named constant - Add logging for silent neutral returns in governance/resilience/ experience strategy no-enabled-metrics paths - Fix docstrings: intelligence strategy LLM calibration origin note, efficiency docstring second neutral path, evaluator Args pillar names, resilience metrics capping behavior, add __init__ docstring - Fix agents.md frontmatter description to include 'evaluation' - Add 10 tests: CI-disabled+no-calibration, EvaluationContext agent_id mismatch, EvaluationReport agent_id/weights validators, efficiency 7d fallback/neutral/clamping, parametrized all-metrics-disabled for all 5 configs
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/synthorg/hr/evaluation/evaluator.py`:
- Around line 75-91: In __init__, validate any non-None injected strategy
(intelligence_strategy, resilience_strategy, governance_strategy, ux_strategy)
by checking its strategy.pillar equals the expected EvaluationPillar for that
slot (e.g., intelligence -> EvaluationPillar.INTELLIGENCE, resilience ->
RESILIENCE, governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a
ValueError with a clear message naming the slot and actual strategy.pillar so
the failure occurs at construction time; keep using the existing _default_*()
for None inputs but still assert their .pillar if you want extra safety.
In `@src/synthorg/hr/evaluation/experience_strategy.py`:
- Around line 84-103: The sufficiency check and downstream
confidence/data_point_count must count only feedback entries that contributed at
least one enabled metric; change the flow so you first identify/filter
contributing entries (e.g., compute contributing_feedback = [f for f in feedback
if it has at least one enabled metric according to cfg] or update
_collect_metrics to return both available metrics and the per-feedback
contribution set), then use len(contributing_feedback) instead of len(feedback)
when comparing to cfg.min_feedback_count and when computing
data_point_count/confidence; finally pass the filtered contributing_feedback (or
use the contributed-count returned by _collect_metrics) into _build_result and
call _neutral when contributing count < cfg.min_feedback_count (using the same
reason keys), keeping calls to _neutral and symbols _collect_metrics,
_build_result, _neutral, and cfg.min_feedback_count consistent.
In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
legitimate TrustLevel.CUSTOM value, causing agents with "custom" to be treated
as unknown (EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update
the logic to explicitly handle "custom" by either adding a "custom" key to
_TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a
score from that policy before falling back, by updating the code paths that
reference _TRUST_LEVEL_SCORES and the evaluator that emits
EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the
custom-policy resolution routine to derive the numeric score).
- Around line 75-89: Remove the early neutral-return that blocks scoring when
total_audits == 0 and context.trust_level is None; instead call
self._collect_metrics(context, total_audits) unconditionally so that the
collector can evaluate enabled metrics (including autonomy_compliance) and
decide if there is data. After calling _collect_metrics use its returned
enabled/data_points to decide whether to return self._neutral(...) or to call
self._build_result(scores, enabled, data_points, context). Keep references to
the same methods/variables: _collect_metrics, _neutral, _build_result,
total_audits, and context.trust_level (do not add new gating logic before
calling _collect_metrics).
In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 64-67: The current logic returns neutral when
context.snapshot.overall_quality_score is None even if CI quality is disabled or
calibration data exists; update the flow in intelligence_strategy.py so
overall_quality_score is only treated as a CI data source when
ci_quality_enabled is true and only include CI-derived points in
_collect_metrics() when ci_quality_enabled is true (i.e., stop preloading
data_points from task_records unless ci_quality_enabled), change the
early-return that calls self._neutral(reason="no_quality_score") to check that
no enabled metric has usable data before returning neutral, and add a regression
test that sets ci_quality_enabled=False with overall_quality_score=None but with
calibration records present to ensure scoring proceeds using calibration only.
In `@src/synthorg/hr/evaluation/models.py`:
- Around line 372-413: The current _validate_weights_match_scores only compares
sets and misses duplicate pillar names and invalid floats; update validation for
the pillar_weights field (and/or _validate_weights_match_scores) to (1) detect
and reject duplicate pillar names in pillar_weights (collect seen names and
raise ValueError listing duplicates), (2) ensure each weight is a real number
within [0.0, 1.0] (reject negatives or >1), and (3) ensure the weights are
normalized (sum(weights) ≈ 1.0 within a small epsilon) and raise descriptive
ValueError messages if any check fails; keep these checks in the model_validator
decorated method(s) for EvaluationReport so invalid/ambiguous weighting schemes
cannot be constructed.
- Around line 262-328: Add an additional after-model validator (e.g. def
_validate_agent_scoped_records_consistency(self) -> Self) that iterates
task_records, calibration_records, and feedback and ensures each record.agent_id
equals self.agent_id; if any mismatch is found raise ValueError with a clear
message identifying the collection and offending record (index or repr). Keep
the existing _validate_agent_id_consistency but implement this new validator to
enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and
InteractionFeedback collections.
In `@tests/unit/hr/evaluation/test_evaluator.py`:
- Around line 167-210: Update the two tests to force the snapshot shapes so the
fallback and neutral branches in EvaluationService._score_efficiency() are
actually exercised: in test_efficiency_7d_window_fallback() monkeypatch
PerformanceTracker.get_snapshot (or the EvaluationService.get_snapshot helper)
to return a snapshot containing only the 7d window (no 30d data), call
svc.evaluate(agent_id) and assert the efficiency pillar's score and confidence
match the known 7d-fallback expected values; in
test_efficiency_no_window_returns_neutral() patch get_snapshot to return no
windows (empty snapshot), call svc.evaluate(agent_id) and assert the efficiency
pillar's score and confidence equal the neutral values returned by
_score_efficiency() for no-data cases. Ensure you reference
EvaluationService._score_efficiency, PerformanceTracker.get_snapshot (or the
concrete get_snapshot you use), and the test functions
test_efficiency_7d_window_fallback and test_efficiency_no_window_returns_neutral
when making the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c3ec6139-1a8d-493e-bbf9-107516e8da5a
📒 Files selected for processing (14)
docs/design/agents.mdsrc/synthorg/hr/evaluation/constants.pysrc/synthorg/hr/evaluation/evaluator.pysrc/synthorg/hr/evaluation/experience_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/resilience_strategy.pysrc/synthorg/observability/events/evaluation.pytests/unit/hr/evaluation/conftest.pytests/unit/hr/evaluation/test_config.pytests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_intelligence_strategy.pytests/unit/hr/evaluation/test_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Test (Python 3.14)
- GitHub Check: Build Backend
- GitHub Check: Build Sandbox
- GitHub Check: Build Web
- GitHub Check: Dependency Review
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotationsin Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use@computed_fieldfor derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)
Files:
src/synthorg/hr/evaluation/constants.pysrc/synthorg/observability/events/evaluation.pytests/unit/hr/evaluation/test_intelligence_strategy.pytests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/resilience_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/experience_strategy.pytests/unit/hr/evaluation/test_models.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/evaluator.pytests/unit/hr/evaluation/conftest.pytests/unit/hr/evaluation/test_evaluator.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/synthorg/**/*.py: Every Python module with business logic must have:from synthorg.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always belogger(not_logger, notlog)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly:from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: alwayslogger.info(EVENT, key=value)-- neverlogger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases
Files:
src/synthorg/hr/evaluation/constants.pysrc/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/resilience_strategy.pysrc/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/experience_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/evaluator.py
docs/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
Documentation files in docs/ are Markdown, built with Zensical, configured in mkdocs.yml; design spec in docs/design/ (12 pages), Architecture in docs/architecture/, Roadmap in docs/roadmap/
Files:
docs/design/agents.md
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: All Python test files must use@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowmarkers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer@pytest.mark.parametrizefor testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given+@settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions
Files:
tests/unit/hr/evaluation/test_intelligence_strategy.pytests/unit/hr/evaluation/test_config.pytests/unit/hr/evaluation/test_models.pytests/unit/hr/evaluation/conftest.pytests/unit/hr/evaluation/test_evaluator.py
🧠 Learnings (34)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-14T15:43:05.601Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Applied to files:
docs/design/agents.mdtests/unit/hr/evaluation/test_models.pysrc/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.
Applied to files:
docs/design/agents.mdsrc/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...
Applied to files:
docs/design/agents.md
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging
Applied to files:
src/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls
Applied to files:
src/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T11:18:48.128Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-18T21:23:23.586Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
src/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals
Applied to files:
src/synthorg/observability/events/evaluation.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/synthorg/observability/events/evaluation.pysrc/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state
Applied to files:
tests/unit/hr/evaluation/test_config.pysrc/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Applied to files:
tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-04-02T18:54:07.757Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:54:07.757Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with `model_copy(update=...)` for runtime state that evolves
Applied to files:
tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.
Applied to files:
tests/unit/hr/evaluation/test_config.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T07:22:28.134Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T06:43:14.114Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
| def __init__( # noqa: PLR0913 | ||
| self, | ||
| *, | ||
| tracker: PerformanceTracker, | ||
| intelligence_strategy: PillarScoringStrategy | None = None, | ||
| resilience_strategy: PillarScoringStrategy | None = None, | ||
| governance_strategy: PillarScoringStrategy | None = None, | ||
| ux_strategy: PillarScoringStrategy | None = None, | ||
| config: EvaluationConfig | None = None, | ||
| ) -> None: | ||
| """Initialize the evaluation service.""" | ||
| self._tracker = tracker | ||
| self._config = config or EvaluationConfig() | ||
| self._intelligence = intelligence_strategy or self._default_intelligence() | ||
| self._resilience = resilience_strategy or self._default_resilience() | ||
| self._governance = governance_strategy or self._default_governance() | ||
| self._ux = ux_strategy or self._default_ux() |
There was a problem hiding this comment.
Validate injected strategies against their pillar slots.
The service accepts pluggable strategies but stores them without checking strategy.pillar. A miswired dependency passed into the wrong constructor slot will fail much later during evaluate() with duplicate or mismatched pillar data instead of at construction time. Fail fast in __init__ by validating each injected strategy against the expected EvaluationPillar.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/evaluator.py` around lines 75 - 91, In __init__,
validate any non-None injected strategy (intelligence_strategy,
resilience_strategy, governance_strategy, ux_strategy) by checking its
strategy.pillar equals the expected EvaluationPillar for that slot (e.g.,
intelligence -> EvaluationPillar.INTELLIGENCE, resilience -> RESILIENCE,
governance -> GOVERNANCE, ux -> UX); if a mismatch is found raise a ValueError
with a clear message naming the slot and actual strategy.pillar so the failure
occurs at construction time; keep using the existing _default_*() for None
inputs but still assert their .pillar if you want extra safety.
| cfg = context.config.experience | ||
| feedback = context.feedback | ||
|
|
||
| if len(feedback) < cfg.min_feedback_count: | ||
| return self._neutral( | ||
| context, | ||
| reason="insufficient_feedback", | ||
| count=len(feedback), | ||
| min_required=cfg.min_feedback_count, | ||
| ) | ||
|
|
||
| available = self._collect_metrics(cfg, feedback) | ||
|
|
||
| if not available: | ||
| return self._neutral( | ||
| context, | ||
| reason="no_enabled_metrics_with_data", | ||
| ) | ||
|
|
||
| return self._build_result(available, feedback, context) |
There was a problem hiding this comment.
Count only contributing feedback toward UX sufficiency and confidence.
len(feedback) includes records where every enabled rating is None. With one real rating and many empty submissions, this path clears min_feedback_count, inflates data_point_count, and can push confidence close to 1.0 even though almost no UX signal was used. Base the sufficiency check, confidence, and data_point_count on feedback entries that contributed at least one enabled metric.
Also applies to: 155-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/experience_strategy.py` around lines 84 - 103, The
sufficiency check and downstream confidence/data_point_count must count only
feedback entries that contributed at least one enabled metric; change the flow
so you first identify/filter contributing entries (e.g., compute
contributing_feedback = [f for f in feedback if it has at least one enabled
metric according to cfg] or update _collect_metrics to return both available
metrics and the per-feedback contribution set), then use
len(contributing_feedback) instead of len(feedback) when comparing to
cfg.min_feedback_count and when computing data_point_count/confidence; finally
pass the filtered contributing_feedback (or use the contributed-count returned
by _collect_metrics) into _build_result and call _neutral when contributing
count < cfg.min_feedback_count (using the same reason keys), keeping calls to
_neutral and symbols _collect_metrics, _build_result, _neutral, and
cfg.min_feedback_count consistent.
| # Trust level to score mapping. | ||
| _TRUST_LEVEL_SCORES: dict[str, float] = { | ||
| "sandboxed": 2.5, | ||
| "restricted": 5.0, | ||
| "standard": 7.5, | ||
| "elevated": 10.0, | ||
| } |
There was a problem hiding this comment.
Handle the valid custom trust level explicitly.
src/synthorg/core/enums.py defines TrustLevel.CUSTOM = "custom", but this table does not. Agents with that legitimate value will emit EVAL_TRUST_LEVEL_UNKNOWN and get the neutral fallback instead of a trust score. Add a dedicated custom path, or derive the score from the resolved custom trust policy instead of routing it through the unknown-level fallback.
Also applies to: 148-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The
trust-score map _TRUST_LEVEL_SCORES currently omits the legitimate
TrustLevel.CUSTOM value, causing agents with "custom" to be treated as unknown
(EVAL_TRUST_LEVEL_UNKNOWN) and receive the neutral fallback; update the logic to
explicitly handle "custom" by either adding a "custom" key to
_TRUST_LEVEL_SCORES or—preferably—resolve the custom trust policy and compute a
score from that policy before falling back, by updating the code paths that
reference _TRUST_LEVEL_SCORES and the evaluator that emits
EVAL_TRUST_LEVEL_UNKNOWN (use TrustLevel.CUSTOM as the discriminant and call the
custom-policy resolution routine to derive the numeric score).
| if total_audits == 0 and context.trust_level is None: | ||
| return self._neutral(context, reason="no_governance_data") | ||
|
|
||
| scores, enabled, data_points = self._collect_metrics( | ||
| context, | ||
| total_audits, | ||
| ) | ||
|
|
||
| if not enabled: | ||
| return self._neutral( | ||
| context, | ||
| reason="no_enabled_metrics_with_data", | ||
| ) | ||
|
|
||
| return self._build_result(scores, enabled, data_points, context) |
There was a problem hiding this comment.
Autonomy-only governance scoring is blocked by the early neutral return.
This precheck short-circuits before _collect_metrics() can score autonomy_compliance, so a configuration that enables only autonomy can never produce a real governance score. Let the collector decide whether any enabled metric has data instead of requiring audits or trust up front.
Suggested fix
- if total_audits == 0 and context.trust_level is None:
- return self._neutral(context, reason="no_governance_data")
-
scores, enabled, data_points = self._collect_metrics(
context,
total_audits,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 75 - 89,
Remove the early neutral-return that blocks scoring when total_audits == 0 and
context.trust_level is None; instead call self._collect_metrics(context,
total_audits) unconditionally so that the collector can evaluate enabled metrics
(including autonomy_compliance) and decide if there is data. After calling
_collect_metrics use its returned enabled/data_points to decide whether to
return self._neutral(...) or to call self._build_result(scores, enabled,
data_points, context). Keep references to the same methods/variables:
_collect_metrics, _neutral, _build_result, total_audits, and context.trust_level
(do not add new gating logic before calling _collect_metrics).
| ci_score = context.snapshot.overall_quality_score | ||
|
|
||
| if ci_score is None: | ||
| return self._neutral(context, reason="no_quality_score") |
There was a problem hiding this comment.
Don't make CI quality a hard prerequisite—or confidence source—when it isn't used.
This path returns neutral before calibration is considered, so a calibration-only setup cannot score if overall_quality_score is missing. _collect_metrics() also preloads data_points from task_records even when ci_quality is disabled or skipped, which inflates confidence for LLM-only results. Only count CI data when the CI metric is actually included, and return neutral only when no enabled metric has usable data. Please add a regression test for ci_quality_enabled=False with overall_quality_score=None and calibration records present.
Suggested fix
- ci_score = context.snapshot.overall_quality_score
-
- if ci_score is None:
- return self._neutral(context, reason="no_quality_score")
-
available, data_points, drift = self._collect_metrics(
- ci_score,
+ context.snapshot.overall_quality_score,
context,
)
if not available:
return self._neutral(context, reason="no_enabled_metrics")
@@
- ci_score: float,
+ ci_score: float | None,
context: EvaluationContext,
) -> tuple[list[tuple[str, float, float]], int, float]:
@@
- data_points = len(context.task_records)
+ data_points = 0
calibration_drift = 0.0
- if context.config.intelligence.ci_quality_enabled:
+ if context.config.intelligence.ci_quality_enabled and ci_score is not None:
available.append(
(
"ci_quality",
context.config.intelligence.ci_quality_weight,
ci_score,
)
)
+ data_points += len(context.task_records)
+ elif context.config.intelligence.ci_quality_enabled:
+ logger.debug(
+ EVAL_METRIC_SKIPPED,
+ agent_id=context.agent_id,
+ pillar=self.pillar.value,
+ metric="ci_quality",
+ reason="no_quality_score",
+ )Also applies to: 78-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 64 - 67,
The current logic returns neutral when context.snapshot.overall_quality_score is
None even if CI quality is disabled or calibration data exists; update the flow
in intelligence_strategy.py so overall_quality_score is only treated as a CI
data source when ci_quality_enabled is true and only include CI-derived points
in _collect_metrics() when ci_quality_enabled is true (i.e., stop preloading
data_points from task_records unless ci_quality_enabled), change the
early-return that calls self._neutral(reason="no_quality_score") to check that
no enabled metric has usable data before returning neutral, and add a regression
test that sets ci_quality_enabled=False with overall_quality_score=None but with
calibration records present to ensure scoring proceeds using calibration only.
| agent_id: NotBlankStr = Field(description="Agent being evaluated") | ||
| now: AwareDatetime = Field(description="Reference timestamp") | ||
| config: EvaluationConfig = Field(description="Evaluation configuration") | ||
| snapshot: AgentPerformanceSnapshot = Field( | ||
| description="Performance snapshot from the tracker", | ||
| ) | ||
| task_records: tuple[TaskMetricRecord, ...] = Field( | ||
| default=(), | ||
| description="Raw task metric records", | ||
| ) | ||
| calibration_records: tuple[LlmCalibrationRecord, ...] = Field( | ||
| default=(), | ||
| description="LLM calibration records", | ||
| ) | ||
| feedback: tuple[InteractionFeedback, ...] = Field( | ||
| default=(), | ||
| description="Interaction feedback records", | ||
| ) | ||
| resilience_metrics: ResilienceMetrics | None = Field( | ||
| default=None, | ||
| description="Derived resilience metrics", | ||
| ) | ||
| audit_allow_count: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="Allowed audit entries in the window", | ||
| ) | ||
| audit_deny_count: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="Denied audit entries in the window", | ||
| ) | ||
| audit_escalate_count: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="Escalated audit entries in the window", | ||
| ) | ||
| audit_high_risk_count: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="High-risk audit entries in the window", | ||
| ) | ||
| trust_level: NotBlankStr | None = Field( | ||
| default=None, | ||
| description="Current trust level name", | ||
| ) | ||
| trust_demotions_in_window: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="Trust demotions in the window", | ||
| ) | ||
| autonomy_downgrades_in_window: int = Field( | ||
| ge=0, | ||
| default=0, | ||
| description="Autonomy downgrades in the window", | ||
| ) | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_agent_id_consistency(self) -> Self: | ||
| """Ensure context agent_id matches snapshot agent_id.""" | ||
| if self.agent_id != self.snapshot.agent_id: | ||
| msg = ( | ||
| f"Context agent_id ({self.agent_id}) does not match " | ||
| f"snapshot agent_id ({self.snapshot.agent_id})" | ||
| ) | ||
| raise ValueError(msg) | ||
| return self |
There was a problem hiding this comment.
Reject mixed-agent records in EvaluationContext.
The model only checks snapshot.agent_id. A caller can still build a context for one agent that carries agent-scoped records from another agent, and the strategies will score that foreign data as if it belonged to the current agent. Add an after-validator that enforces agent_id consistency across the agent-scoped collections in this model.
As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/models.py` around lines 262 - 328, Add an
additional after-model validator (e.g. def
_validate_agent_scoped_records_consistency(self) -> Self) that iterates
task_records, calibration_records, and feedback and ensures each record.agent_id
equals self.agent_id; if any mismatch is found raise ValueError with a clear
message identifying the collection and offending record (index or repr). Keep
the existing _validate_agent_id_consistency but implement this new validator to
enforce agent_id consistency across TaskMetricRecord, LlmCalibrationRecord, and
InteractionFeedback collections.
| pillar_weights: tuple[tuple[NotBlankStr, float], ...] = Field( | ||
| description="Applied weights as (pillar_name, weight) pairs", | ||
| ) | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_unique_pillars(self) -> Self: | ||
| """Ensure pillar scores have unique pillar names.""" | ||
| names = [ps.pillar for ps in self.pillar_scores] | ||
| if len(names) != len(set(names)): | ||
| seen: set[EvaluationPillar] = set() | ||
| dupes: list[str] = [] | ||
| for n in names: | ||
| if n in seen: | ||
| dupes.append(n.value) | ||
| seen.add(n) | ||
| msg = f"Duplicate pillar scores: {', '.join(dupes)}" | ||
| raise ValueError(msg) | ||
| return self | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_agent_id_consistency(self) -> Self: | ||
| """Ensure report agent_id matches snapshot agent_id.""" | ||
| if self.agent_id != self.snapshot.agent_id: | ||
| msg = ( | ||
| f"Report agent_id ({self.agent_id}) does not match " | ||
| f"snapshot agent_id ({self.snapshot.agent_id})" | ||
| ) | ||
| raise ValueError(msg) | ||
| return self | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_weights_match_scores(self) -> Self: | ||
| """Ensure pillar_weights entries correspond to pillar_scores.""" | ||
| score_pillars = {ps.pillar.value for ps in self.pillar_scores} | ||
| weight_pillars = {name for name, _ in self.pillar_weights} | ||
| if score_pillars != weight_pillars: | ||
| msg = ( | ||
| f"Pillar weight names {sorted(weight_pillars)} do not match " | ||
| f"pillar score names {sorted(score_pillars)}" | ||
| ) | ||
| raise ValueError(msg) | ||
| return self |
There was a problem hiding this comment.
pillar_weights validation is too weak for a public report model.
_validate_weights_match_scores() compares sets only, so duplicate entries like (("intelligence", 0.5), ("intelligence", 0.5)) still validate as long as the score set is {"intelligence"}. The field also accepts unconstrained floats, so negative or >1 weights can slip through. Reject duplicate names and enforce bounded, normalized weights here so EvaluationReport cannot represent an ambiguous weighting scheme.
As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 413, The current
_validate_weights_match_scores only compares sets and misses duplicate pillar
names and invalid floats; update validation for the pillar_weights field (and/or
_validate_weights_match_scores) to (1) detect and reject duplicate pillar names
in pillar_weights (collect seen names and raise ValueError listing duplicates),
(2) ensure each weight is a real number within [0.0, 1.0] (reject negatives or
>1), and (3) ensure the weights are normalized (sum(weights) ≈ 1.0 within a
small epsilon) and raise descriptive ValueError messages if any check fails;
keep these checks in the model_validator decorated method(s) for
EvaluationReport so invalid/ambiguous weighting schemes cannot be constructed.
- Fix intelligence strategy: CI quality is no longer a hard prerequisite; calibration-only mode works when overall_quality_score is None - Fix governance strategy: autonomy-only scoring no longer blocked by the early neutral return (total_audits==0 && trust_level==None) - Strengthen EvaluationReport pillar_weights validator: reject duplicate weight entries before set comparison - Fix efficiency tests to actually test 7d fallback and neutral paths using direct _score_efficiency calls with custom snapshots - Update governance no-data test to disable autonomy for true neutral
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
src/synthorg/hr/evaluation/models.py (2)
268-328:⚠️ Potential issue | 🟠 MajorReject foreign-agent records in
EvaluationContext.The current validator only ties
agent_idtosnapshot.agent_id. A caller can still passtask_records,calibration_records, orfeedbackbelonging to another agent, and the strategies will score that foreign data as if it were local.Proposed fix
`@model_validator`(mode="after") def _validate_agent_id_consistency(self) -> Self: """Ensure context agent_id matches snapshot agent_id.""" if self.agent_id != self.snapshot.agent_id: msg = ( f"Context agent_id ({self.agent_id}) does not match " f"snapshot agent_id ({self.snapshot.agent_id})" ) raise ValueError(msg) return self + + `@model_validator`(mode="after") + def _validate_agent_scoped_records(self) -> Self: + """Ensure agent-scoped collections match the context agent.""" + collections = ( + ("task_records", self.task_records), + ("calibration_records", self.calibration_records), + ("feedback", self.feedback), + ) + for collection_name, records in collections: + for index, record in enumerate(records): + if record.agent_id != self.agent_id: + msg = ( + f"{collection_name}[{index}] agent_id " + f"({record.agent_id}) does not match " + f"context agent_id ({self.agent_id})" + ) + raise ValueError(msg) + return selfAs per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/models.py` around lines 268 - 328, The current _validate_agent_id_consistency only compares self.agent_id to self.snapshot.agent_id but does not reject task_records, calibration_records, or feedback that belong to a different agent; update _validate_agent_id_consistency in EvaluationContext to iterate over task_records (TaskMetricRecord.agent_id), calibration_records (LlmCalibrationRecord.agent_id), and feedback (InteractionFeedback.agent_id) and raise a ValueError if any record.agent_id != self.agent_id (include which record type and offending id in the message); keep the existing snapshot check and return self at the end.
372-374:⚠️ Potential issue | 🟠 MajorFinish hardening
pillar_weightson the report model.
_validate_weights_match_scores()now rejects duplicate names, but it still accepts negative weights, weights above1.0, or totals that do not sum to1.0. That leavesEvaluationReportopen to ambiguous weighting schemes even thoughoverall_scoreis defined as weighted output.Proposed fix
`@model_validator`(mode="after") def _validate_weights_match_scores(self) -> Self: """Ensure pillar_weights entries correspond to pillar_scores.""" weight_names = [name for name, _ in self.pillar_weights] if len(weight_names) != len(set(weight_names)): msg = "Duplicate entries in pillar_weights" raise ValueError(msg) + invalid_weights = [ + str(name) + for name, weight in self.pillar_weights + if weight < 0.0 or weight > 1.0 + ] + if invalid_weights: + msg = ( + "pillar_weights must be within [0.0, 1.0] for: " + f"{', '.join(invalid_weights)}" + ) + raise ValueError(msg) + total_weight = sum(weight for _, weight in self.pillar_weights) + if abs(total_weight - 1.0) > 1e-9: + msg = f"pillar_weights must sum to 1.0, got {total_weight}" + raise ValueError(msg) score_pillars = {ps.pillar.value for ps in self.pillar_scores} weight_pillars = set(weight_names) if score_pillars != weight_pillars: msg = ( f"Pillar weight names {sorted(weight_pillars)} do not match "As per coding guidelines, "Validate at system boundaries in Python (user input, external APIs, config files)."
Also applies to: 402-417
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/models.py` around lines 372 - 374, The pillar_weights field must be hardened: update the _validate_weights_match_scores validator (used by EvaluationReport and the related scores validator) to reject weights < 0 or > 1, enforce that the sum of all weights equals 1.0 within a small epsilon (e.g., 1e-6), and keep the existing duplicate-name check; raise clear ValueError messages identifying the offending pillar name or the total sum mismatch. Ensure the validator is applied to the pillar_weights tuple[tuple[NotBlankStr, float], ...] field (and reused for the other weights-validated field handled by _validate_weights_match_scores) so all weight inputs are normalized and validated at the model boundary.src/synthorg/hr/evaluation/intelligence_strategy.py (1)
79-123:⚠️ Potential issue | 🟠 MajorConfidence is still inflated in calibration-only runs.
data_pointsstarts atlen(context.task_records)before CI quality is proven usable. Whenci_qualityis disabled oroverall_quality_scoreis missing, calibration-only scoring still gains confidence from unrelated task counts. Start from0and only add task records when the CI component is actually appended.Proposed fix
available: list[tuple[str, float, float]] = [] - data_points = len(context.task_records) + data_points = 0 calibration_drift = 0.0 ci_score = context.snapshot.overall_quality_score if context.config.intelligence.ci_quality_enabled and ci_score is not None: available.append( @@ context.config.intelligence.ci_quality_weight, ci_score, ) ) + data_points += len(context.task_records) elif context.config.intelligence.ci_quality_enabled: logger.debug( EVAL_METRIC_SKIPPED, agent_id=context.agent_id, pillar=self.pillar.value,Please add a regression test for calibration-only scoring with task records present so confidence stays tied to
calibration_records.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/intelligence_strategy.py` around lines 79 - 123, data_points is initialized from len(context.task_records) so calibration-only runs get inflated confidence; change initialization to data_points = 0 and only add len(context.task_records) when you append the "ci_quality" tuple (i.e., inside the block where you call available.append for "ci_quality") and keep adding len(records) for calibration_records as already done; also add a regression test (e.g., test_calibration_only_confidence_tied_to_calibration_records) that creates context with task_records present but ci_quality disabled or no overall_quality_score and asserts returned data_points equals number of calibration_records only.src/synthorg/hr/evaluation/governance_strategy.py (1)
29-35:⚠️ Potential issue | 🟠 MajorVerify the supported
customtrust level doesn't fall through the unknown path.If
TrustLevel.CUSTOMis still a valid value insrc/synthorg/core/enums.py, this table will log a legitimate trust state as unknown and score it with the neutral fallback. Add an explicit"custom"branch, or derive the score from the resolved custom policy instead of routing it throughEVAL_TRUST_LEVEL_UNKNOWN.Run this read-only check to confirm the upstream enum still exposes
CUSTOM:#!/bin/bash rg -n -C2 'class TrustLevel|CUSTOM|custom' src/synthorg/core/enums.pyIf that enum member is still present, please add a regression test for the legitimate
custompath as well.Also applies to: 149-169
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/hr/evaluation/governance_strategy.py` around lines 29 - 35, The trust-score map _TRUST_LEVEL_SCORES currently omits the "custom" key which causes legitimate TrustLevel.CUSTOM values to hit EVAL_TRUST_LEVEL_UNKNOWN and use the neutral fallback; update the mapping in governance_strategy.py to handle "custom" explicitly (or compute the score from the resolved custom policy) and update any code-path that maps TrustLevel -> score to use that branch instead of falling back to EVAL_TRUST_LEVEL_UNKNOWN; reference symbols: _TRUST_LEVEL_SCORES, TrustLevel.CUSTOM, EVAL_TRUST_LEVEL_UNKNOWN, and ensure you add a regression test that constructs a TrustLevel.CUSTOM case and asserts the expected non-neutral score/path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/synthorg/hr/evaluation/models.py`:
- Around line 65-139: InteractionFeedback currently allows records with all
ratings None and free_text blank; add an after-model validator on
InteractionFeedback (use Pydantic V2 `@model_validator`(mode="after") or
equivalent) that inspects clarity_rating, tone_rating, helpfulness_rating,
trust_rating, satisfaction_rating and free_text and raises a ValueError when
every rating is None and free_text is None or free_text.strip() == "" so at
least one numeric rating or a non-blank comment is required.
---
Duplicate comments:
In `@src/synthorg/hr/evaluation/governance_strategy.py`:
- Around line 29-35: The trust-score map _TRUST_LEVEL_SCORES currently omits the
"custom" key which causes legitimate TrustLevel.CUSTOM values to hit
EVAL_TRUST_LEVEL_UNKNOWN and use the neutral fallback; update the mapping in
governance_strategy.py to handle "custom" explicitly (or compute the score from
the resolved custom policy) and update any code-path that maps TrustLevel ->
score to use that branch instead of falling back to EVAL_TRUST_LEVEL_UNKNOWN;
reference symbols: _TRUST_LEVEL_SCORES, TrustLevel.CUSTOM,
EVAL_TRUST_LEVEL_UNKNOWN, and ensure you add a regression test that constructs a
TrustLevel.CUSTOM case and asserts the expected non-neutral score/path.
In `@src/synthorg/hr/evaluation/intelligence_strategy.py`:
- Around line 79-123: data_points is initialized from len(context.task_records)
so calibration-only runs get inflated confidence; change initialization to
data_points = 0 and only add len(context.task_records) when you append the
"ci_quality" tuple (i.e., inside the block where you call available.append for
"ci_quality") and keep adding len(records) for calibration_records as already
done; also add a regression test (e.g.,
test_calibration_only_confidence_tied_to_calibration_records) that creates
context with task_records present but ci_quality disabled or no
overall_quality_score and asserts returned data_points equals number of
calibration_records only.
In `@src/synthorg/hr/evaluation/models.py`:
- Around line 268-328: The current _validate_agent_id_consistency only compares
self.agent_id to self.snapshot.agent_id but does not reject task_records,
calibration_records, or feedback that belong to a different agent; update
_validate_agent_id_consistency in EvaluationContext to iterate over task_records
(TaskMetricRecord.agent_id), calibration_records
(LlmCalibrationRecord.agent_id), and feedback (InteractionFeedback.agent_id) and
raise a ValueError if any record.agent_id != self.agent_id (include which record
type and offending id in the message); keep the existing snapshot check and
return self at the end.
- Around line 372-374: The pillar_weights field must be hardened: update the
_validate_weights_match_scores validator (used by EvaluationReport and the
related scores validator) to reject weights < 0 or > 1, enforce that the sum of
all weights equals 1.0 within a small epsilon (e.g., 1e-6), and keep the
existing duplicate-name check; raise clear ValueError messages identifying the
offending pillar name or the total sum mismatch. Ensure the validator is applied
to the pillar_weights tuple[tuple[NotBlankStr, float], ...] field (and reused
for the other weights-validated field handled by _validate_weights_match_scores)
so all weight inputs are normalized and validated at the model boundary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fae58238-9e61-404e-8c41-b0e7e5860cd6
📒 Files selected for processing (5)
src/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.pysrc/synthorg/hr/evaluation/models.pytests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_governance_strategy.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Build Backend
- GitHub Check: Build Web
- GitHub Check: Build Sandbox
- GitHub Check: Test (Python 3.14)
- GitHub Check: Dependency Review
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotationsin Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use@computed_fieldfor derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)
Files:
src/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/models.pytests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_governance_strategy.pysrc/synthorg/hr/evaluation/intelligence_strategy.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/synthorg/**/*.py: Every Python module with business logic must have:from synthorg.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always belogger(not_logger, notlog)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly:from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: alwayslogger.info(EVENT, key=value)-- neverlogger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases
Files:
src/synthorg/hr/evaluation/governance_strategy.pysrc/synthorg/hr/evaluation/models.pysrc/synthorg/hr/evaluation/intelligence_strategy.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: All Python test files must use@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowmarkers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer@pytest.mark.parametrizefor testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given+@settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions
Files:
tests/unit/hr/evaluation/test_evaluator.pytests/unit/hr/evaluation/test_governance_strategy.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
📚 Learning: 2026-03-17T06:30:14.180Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T11:33:01.580Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-15T18:28:13.207Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-14T16:18:57.267Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-16T07:22:28.134Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-19T07:12:14.508Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)
Applied to files:
src/synthorg/hr/evaluation/governance_strategy.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins
Applied to files:
tests/unit/hr/evaluation/test_evaluator.py
Add model_validator requiring at least one rating or non-blank free_text in InteractionFeedback. Prevents empty feedback records with no signal from being stored. Add tests for empty feedback rejection and free-text-only feedback acceptance.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unit/hr/evaluation/test_models.py`:
- Around line 329-339: Remove the redundant in-function import of
EvaluationContext inside test_agent_id_mismatch_raises; instead import
EvaluationContext at the module level with the other model imports so the test
uses the top-level import. Specifically, delete the local "from
synthorg.hr.evaluation.models import EvaluationContext" inside
test_agent_id_mismatch_raises and add EvaluationContext to the existing model
imports at the top of the test file (where other model classes are imported).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d5b307c2-1f81-440c-bbde-adeeb2c836d2
📒 Files selected for processing (2)
src/synthorg/hr/evaluation/models.pytests/unit/hr/evaluation/test_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Test (Python 3.14)
- GitHub Check: Build Backend
- GitHub Check: Build Web
- GitHub Check: Build Sandbox
- GitHub Check: Dependency Review
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotationsin Python code; Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) in Python 3.14; ruff enforces this
All public functions in Python must have type hints; mypy strict mode enforced
Use Google-style docstrings on public classes and functions in Python; enforced by ruff D rules
Create new objects and never mutate existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time
Use@computed_fieldfor derived values instead of storing + validating redundant fields in Pydantic models (e.g. TokenUsage.total_tokens)
Use NotBlankStr from core.types for all identifier/name fields in Python (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in Python (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Python line length must not exceed 88 characters; enforced by ruff
Python functions must be under 50 lines; files must be under 800 lines
Handle errors explicitly in Python; never silently swallow exceptions
Validate at system boundaries in Python (user input, external APIs, config files)
Files:
tests/unit/hr/evaluation/test_models.pysrc/synthorg/hr/evaluation/models.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: All Python test files must use@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowmarkers
Python tests must maintain 80% minimum code coverage (enforced in CI)
Prefer@pytest.mark.parametrizefor testing similar cases in Python
Use test-provider, test-small-001, etc. in Python tests instead of real vendor names
Property-based testing in Python uses Hypothesis (@given+@settings); profiles: ci (50 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var
Never skip, dismiss, or ignore flaky Python tests; always fix them fully and fundamentally; for timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins
For Python tasks that must block indefinitely until cancelled (e.g. simulating a slow provider or stubborn coroutine), use asyncio.Event().wait() instead of asyncio.sleep(large_number) -- it is cancellation-safe and carries no timing assumptions
Files:
tests/unit/hr/evaluation/test_models.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/synthorg/**/*.py: Every Python module with business logic must have:from synthorg.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in Python application code; exceptions are observability/setup.py, observability/sinks.py, observability/syslog_handler.py, and observability/http_handler.py
Python logger variable name must always belogger(not_logger, notlog)
Use event name constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool); import directly:from synthorg.observability.events.<domain> import EVENT_CONSTANT
Use structured logging with kwargs in Python: alwayslogger.info(EVENT, key=value)-- neverlogger.info('msg %s', val)
All error paths in Python must log at WARNING or ERROR with context before raising
All state transitions in Python must log at INFO level
Use DEBUG logging level in Python for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports in Python do NOT need logging
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned Python code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases
Files:
src/synthorg/hr/evaluation/models.py
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...
📚 Learning: 2026-03-20T21:44:04.528Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (via `model_copy(update=...)`) for runtime state that evolves
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:38:44.202Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using `model_copy(update=...)`) for runtime state
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-19T07:13:44.964Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/hr/**/*.py : HR package (hr/): hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, LLM calibration, collaboration overrides, trend detection), promotion/demotion (criteria evaluation, approval strategies, model mapping)
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-17T22:08:13.456Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`. For derived values use `computed_field` instead of storing + validating redundant fields. Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators.
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 BaseModel, model_validator, computed_field, ConfigDict.
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T18:42:17.990Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:42:17.990Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 conventions: `BaseModel`, `model_validator`, `computed_field`, `ConfigDict`
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-16T06:24:56.341Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-15T19:14:27.144Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence).
Applied to files:
src/synthorg/hr/evaluation/models.py
📚 Learning: 2026-03-31T14:17:24.182Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Fix all valid issues found by review agents (including pre-existing issues in surrounding code, suggestions, and adjacent findings) — never skip or defer
Applied to files:
src/synthorg/hr/evaluation/models.py
🔇 Additional comments (13)
src/synthorg/hr/evaluation/models.py (7)
338-347: Add validation for agent-scoped record collections.The current validator only checks
agent_id == snapshot.agent_id, but does not validate thattask_records,calibration_records, andfeedbackentries all belong to the same agent. While the_build_contextmethod inevaluator.pyfetches data using consistentagent_id, the model itself does not enforce this invariant, allowing callers (e.g., tests or future code paths) to construct inconsistent contexts.
421-436: Weight bounds and normalization still not validated.The duplicate weight names check was added (lines 424-427), but
pillar_weightsstill accepts negative weights and weights that don't sum to 1.0. While the_build_reportmethod usesredistribute_weightswhich guarantees proper bounds and normalization, the model itself permits invalid states.
1-31: LGTM: Module setup and imports are correct.The module docstring is clear, imports are appropriate, and the pattern of using
ConfigDict(frozen=True, allow_inf_nan=False)aligns with coding guidelines for frozen Pydantic models. The# noqa: TC003and# noqa: TC001comments appropriately suppress type-checking-only import warnings for runtime-required types.
33-62: LGTM:redistribute_weightsutility is well-designed.The function correctly handles:
- Filtering disabled items
- Proportional redistribution
- Zero-weight fallback to equal distribution
- Error case when all items are disabled or input is empty
The docstring is complete with Args, Returns, and Raises sections.
140-157: LGTM: Empty feedback rejection properly implemented.The
_validate_has_signalvalidator correctly ensures at least one rating or non-blankfree_textis present, addressing the previous review feedback about rejecting feedback records with no usable signal.
160-220: LGTM:ResilienceMetricshas comprehensive cross-field validation.The validator correctly enforces all relational invariants:
failed_tasks <= total_tasksrecovered_tasks <= failed_taskslongest_success_streak >= current_success_streak
223-251: LGTM:PillarScoremodel is correctly constrained.The score (0.0-10.0) and confidence (0.0-1.0) bounds are properly enforced. The
breakdownfield appropriately stores component scores without rigid bounds since these are informational and may have varying scales depending on the strategy.tests/unit/hr/evaluation/test_models.py (6)
1-20: LGTM: Test file setup is correct.The
pytestmark = pytest.mark.unitproperly marks all tests, and imports are appropriate for testing Pydantic model validation behavior.
25-73: LGTM: Comprehensive tests forredistribute_weights.The test suite covers all important cases:
- Proportional preservation
- Redistribution when items are disabled
- Error cases (all disabled, empty)
- Zero-weight equal distribution fallback
- Single enabled item
- Sum-to-one invariant
Good use of epsilon comparisons for float assertions.
78-198: LGTM: ThoroughInteractionFeedbacktest coverage.The tests comprehensively cover:
- Valid construction with all/partial ratings
- Frozen immutability
- Parametrized bounds checking for all rating fields
free_textmax length- Auto-generated unique IDs
- Empty feedback rejection
- Free-text-only acceptance
Good use of
@pytest.mark.parametrizeto avoid test duplication.
203-271: LGTM:ResilienceMetricstests cover all validation invariants.All cross-field validation rules are tested:
failed_tasks > total_tasksrejectionrecovered_tasks > failed_tasksrejectioncurrent_success_streak > longest_success_streakrejection- Frozen immutability
276-321: LGTM:PillarScoretests verify bounds and structure.Good coverage of score/confidence bounds at boundary values (0.0, 10.0/1.0) and beyond, plus breakdown tuple structure verification.
345-482: LGTM:EvaluationReporttests cover key validation paths.The tests verify:
- Valid construction
- Duplicate pillar score rejection
- Unique ID generation
- Score/confidence bounds
- Frozen immutability
- Agent ID consistency
- Weight/score name mismatch
Good coverage of the model's validators.
🤖 I have created a release *beep* *boop* --- ## [0.5.8](v0.5.7...v0.5.8) (2026-04-03) ### Features * auto-select embedding model + fine-tuning pipeline wiring ([#999](#999)) ([a4cbc4e](a4cbc4e)), closes [#965](#965) [#966](#966) * ceremony scheduling batch 3 -- milestone strategy, template defaults, department overrides ([#1019](#1019)) ([321d245](321d245)) * five-pillar evaluation framework for HR performance tracking ([#1017](#1017)) ([5e66cbd](5e66cbd)), closes [#699](#699) * populate comparison page with 53 competitor entries ([#1000](#1000)) ([5cb232d](5cb232d)), closes [#993](#993) * throughput-adaptive and external-trigger ceremony scheduling strategies ([#1003](#1003)) ([bb5c9a4](bb5c9a4)), closes [#973](#973) [#974](#974) ### Bug Fixes * eliminate backup service I/O from API test lifecycle ([#1015](#1015)) ([08d9183](08d9183)) * update run_affected_tests.py to use -n 8 ([#1014](#1014)) ([3ee9fa7](3ee9fa7)) ### Performance * reduce pytest parallelism from -n auto to -n 8 ([#1013](#1013)) ([43e0707](43e0707)) ### CI/CD * bump docker/login-action from 4.0.0 to 4.1.0 in the all group ([#1027](#1027)) ([e7e28ec](e7e28ec)) * bump wrangler from 4.79.0 to 4.80.0 in /.github in the all group ([#1023](#1023)) ([1322a0d](1322a0d)) ### Maintenance * bump github.com/mattn/go-runewidth from 0.0.21 to 0.0.22 in /cli in the all group ([#1024](#1024)) ([b311694](b311694)) * bump https://github.com/astral-sh/ruff-pre-commit from v0.15.8 to 0.15.9 in the all group ([#1022](#1022)) ([1650087](1650087)) * bump node from `71be405` to `387eebd` in /docker/sandbox in the all group ([#1021](#1021)) ([40bd2f6](40bd2f6)) * bump node from `cf38e1f` to `ad82eca` in /docker/web in the all group ([#1020](#1020)) ([f05ab9f](f05ab9f)) * bump the all group in /web with 3 updates ([#1025](#1025)) ([21d40d3](21d40d3)) * bump the all group with 2 updates ([#1026](#1026)) ([36778de](36778de)) * enable additional eslint-react rules and fix violations ([#1028](#1028)) ([80423be](80423be)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
Implements a structured five-pillar agent evaluation framework based on the InfoQ evaluation framework, with fully pluggable pillars and metrics that can be independently enabled/disabled via
EvaluationConfig.Pillars
QualityBlendIntelligenceStrategy(70% CI / 30% LLM calibration)QualityScoreResult,LlmCalibrationRecordWindowMetrics(40% cost, 30% time, 30% tokens)WindowMetricsaveragesTaskBasedResilienceStrategy(success rate, recovery, consistency, streaks)TaskMetricRecordsequencesAuditBasedGovernanceStrategy(audit compliance, trust, autonomy)FeedbackBasedUxStrategy(clarity, tone, helpfulness, trust, satisfaction)InteractionFeedbackrecordsKey design decisions (D24)
PillarScoringStrategyprotocol withEvaluationContextbagredistribute_weightsutility)EvaluationService.evaluate()called on demandLlmCalibrationSampler-- drift above threshold reduces intelligence pillar confidenceNew files
src/synthorg/hr/evaluation/(10 source files)src/synthorg/observability/events/evaluation.pytests/unit/hr/evaluation/(10 test files, 126 tests)Also
docs/design/agents.md+docs/architecture/decisions.mdTest plan
@pytest.mark.unitReview coverage
12 agents: code-reviewer, python-reviewer, test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, conventions-enforcer, async-reviewer, issue-verifier, docs-consistency, resilience-audit, comment-analyzer. All 25 valid findings implemented.
Closes #699