Skip to content

feat: collaboration scoring enhancements — LLM sampling and human override#477

Merged
Aureliolo merged 8 commits intomainfrom
feat/collaboration-scoring-enhancements
Mar 16, 2026
Merged

feat: collaboration scoring enhancements — LLM sampling and human override#477
Aureliolo merged 8 commits intomainfrom
feat/collaboration-scoring-enhancements

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • LLM calibration sampling: New LlmCalibrationSampler service that probabilistically samples 1% (configurable) of collaboration interactions, sends them to an LLM for independent evaluation, and stores calibration records for drift analysis against the behavioral strategy. Opt-in via llm_sampling_model config.
  • Human override via API: New CollaborationOverrideStore and CollaborationController at /agents/{agent_id}/collaboration with GET/POST/DELETE endpoints for managing score overrides with optional expiration. Overrides take precedence over computed scores.
  • Protocol unchanged: CollaborationScoringStrategy protocol is untouched — both features are layered around it non-intrusively via PerformanceTracker.

New files

  • src/synthorg/hr/performance/llm_calibration_sampler.py — LLM sampling service
  • src/synthorg/hr/performance/collaboration_override_store.py — override store
  • src/synthorg/api/controllers/collaboration.py — API controller (5 endpoints)

Models

  • LlmCalibrationRecord — calibration sample with computed drift
  • CollaborationOverride — human override with optional expiration
  • CollaborationMetricRecord.interaction_summary — text context for LLM
  • CollaborationScoreResult.override_active — signals active override

Test plan

  • 13 override store tests (set/get/clear/list, expiration)
  • 16 LLM sampler tests (sampling, provider failure, drift, pruning, edge cases)
  • 10 tracker integration tests (override precedence, sampler wiring)
  • 11 API controller tests (all 5 endpoints, auth guards, validation)
  • All 8266 existing tests pass
  • 94.52% coverage (>80% required)
  • mypy strict, ruff clean, ruff format clean

Review coverage

Pre-reviewed by 8 agents: code-reviewer, conventions-enforcer, logging-audit, silent-failure-hunter, type-design-analyzer, test-analyzer, docs-consistency, issue-resolution-verifier. 17 findings addressed (4 critical, 5 major, 8 medium).

Closes #232

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 15, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 15, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1560550f-28c9-445d-b0a6-76a9e55f6109

📥 Commits

Reviewing files that changed from the base of the PR and between 7a2f9b4 and ca9a861.

📒 Files selected for processing (22)
  • CLAUDE.md
  • docs/architecture/decisions.md
  • docs/design/agents.md
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/api/state.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/config.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/observability/events/performance.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_models.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • web/src/api/client.ts
  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Human-managed collaboration overrides and calibration API with client helpers and typed responses (score, override CRUD, calibration summary).
    • LLM-based calibration sampling with per-agent calibration records and drift metrics; in-memory override store and sampler integration.
  • Configuration

    • New settings for LLM sampling rate, sampling model, and calibration retention.
    • App accepts an optional performance-tracking component at startup.
  • Documentation

    • Updated observability and HR/performance docs covering calibration, overrides, and structured logging.
  • Tests

    • Comprehensive unit tests for sampler, overrides, tracker, controller, and models.

Walkthrough

Adds opt-in LLM calibration sampling and a human override API for collaboration scoring: new LlmCalibrationSampler, CollaborationOverrideStore, CollaborationController and DTOs, config and models, PerformanceTracker/app-state wiring, observability events, web client endpoints/types, and extensive unit tests.

Changes

Cohort / File(s) Summary
API wiring & state
src/synthorg/api/app.py, src/synthorg/api/state.py
Thread optional PerformanceTracker through create_app and AppState; add _performance_tracker slot and performance_tracker accessor that raises ServiceUnavailable if not configured.
API controllers
src/synthorg/api/controllers/__init__.py, src/synthorg/api/controllers/collaboration.py
Add CollaborationController and register it; new endpoints at /agents/{agent_id}/collaboration (GET score, GET/POST/DELETE override, GET calibration) with DTOs, auth guards, and error handling.
HR performance models & config
src/synthorg/hr/performance/models.py, src/synthorg/hr/performance/config.py
Add LlmCalibrationRecord and CollaborationOverride; add interaction_summary to CollaborationMetricRecord and override_active to CollaborationScoreResult; add sampling/config fields llm_sampling_rate, llm_sampling_model, calibration_retention_days.
Sampler & override store
src/synthorg/hr/performance/llm_calibration_sampler.py, src/synthorg/hr/performance/collaboration_override_store.py
Implement LlmCalibrationSampler (sampling decision, LLM call, record storage/pruning, drift summary) and CollaborationOverrideStore (single active override per agent, expiration handling, event logging).
PerformanceTracker integration
src/synthorg/hr/performance/tracker.py
Inject sampler and override_store into PerformanceTracker ctor; add background sampling scheduling, _maybe_sample, override precedence in get_collaboration_score, and expose sampler/override_store properties.
Observability events
src/synthorg/observability/events/performance.py
Add PERF_LLM_SAMPLE_* and PERF_OVERRIDE_* event constants.
Web client types & endpoints
web/src/api/types.ts, web/src/api/endpoints/collaboration.ts, web/src/api/client.ts
Add collaboration-related TypeScript interfaces, typed client wrappers for score/override/calibration endpoints, and new unwrapVoid helper for ApiResponse.
Docs & design
CLAUDE.md, docs/design/agents.md, docs/architecture/decisions.md
Document implemented LLM calibration sampler, collaboration override endpoint, logging guidance, and update decision D3 to reflect implemented mechanisms.
Tests & fixtures
tests/unit/api/controllers/test_collaboration.py, tests/unit/hr/performance/*, tests/unit/hr/performance/conftest.py
Add comprehensive unit tests and fixtures for controller, override store, sampler, tracker integration, and new models (sampling behavior, override expiry, API error paths).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as CollaborationController
    participant Tracker as PerformanceTracker
    participant Store as CollaborationOverrideStore
    participant Strategy as CollaborationScoringStrategy

    Client->>API: GET /agents/{id}/collaboration/score
    API->>Tracker: get_collaboration_score(agent_id)
    Tracker->>Store: get_active_override(agent_id)
    alt Override active
        Store-->>Tracker: CollaborationOverride (active)
        Tracker-->>API: CollaborationScoreResult (override_active=true)
    else No active override
        Tracker->>Strategy: evaluate collaboration metrics
        Strategy-->>Tracker: score, confidence, strategy_name
        Tracker-->>API: CollaborationScoreResult (override_active=false)
    end
    API-->>Client: 200 CollaborationScoreResult
Loading
sequenceDiagram
    participant App as Application
    participant Tracker as PerformanceTracker
    participant Sampler as LlmCalibrationSampler
    participant Provider as CompletionProvider
    participant Storage as InMemoryRecords

    App->>Tracker: record_collaboration_event(metric_record)
    Tracker->>Tracker: compute behavioral_score
    Tracker->>Sampler: should_sample()
    alt Sampled
        Sampler-->>Tracker: true
        Tracker->>Sampler: sample(record, behavioral_score)
        Sampler->>Provider: completion(system_prompt, model)
        Provider-->>Sampler: response (llm_score, rationale, cost)
        Sampler->>Storage: store LlmCalibrationRecord
        Sampler-->>Tracker: LlmCalibrationRecord
        Tracker->>Tracker: log PERF_LLM_SAMPLE_COMPLETED
    else Not sampled / error
        Tracker-->>App: record stored (no calibration) / log PERF_LLM_SAMPLE_FAILED
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

  • #232: feat: implement collaboration scoring enhancements — LLM sampling and human override — This PR implements the sampler, override store, controller endpoints, and tracker wiring described by the issue.

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main changes: LLM sampling and human override for collaboration scoring are the primary features added in this PR.
Description check ✅ Passed The description comprehensively explains the main features (LLM calibration sampling, human override via API), lists new files, models, and provides detailed test coverage and quality metrics.
Linked Issues check ✅ Passed The PR fully addresses issue #232: implements periodic LLM sampling (1%, opt-in via llm_sampling_model config) with LlmCalibrationSampler, adds human override mechanism via CollaborationOverrideStore and CollaborationController at /agents/{agent_id}/collaboration endpoints, and integrates both non-intrusively via PerformanceTracker without changing the CollaborationScoringStrategy protocol.
Out of Scope Changes check ✅ Passed All changes are directly related to the issue #232 objectives: LLM sampling implementation, human override mechanism, integration with performance tracking, and supporting model/API additions. No unrelated scope creep detected.
Docstring Coverage ✅ Passed Docstring coverage is 88.15% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/collaboration-scoring-enhancements
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/collaboration-scoring-enhancements
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the collaboration scoring system by introducing two key features: LLM-based calibration sampling and a human override mechanism accessible via a new API. These additions provide greater flexibility and accuracy in evaluating agent collaboration, allowing for continuous monitoring of the scoring model's performance and manual adjustments when necessary, all while maintaining the integrity of the existing scoring protocol.

Highlights

  • LLM Calibration Sampling: Introduced a new LlmCalibrationSampler service that probabilistically samples 1% (configurable) of collaboration interactions. These samples are sent to an LLM for independent evaluation, and the results are stored as calibration records to analyze drift against the behavioral scoring strategy. This feature is opt-in via the llm_sampling_model configuration.
  • Human Override via API: Implemented a CollaborationOverrideStore and a CollaborationController with GET/POST/DELETE API endpoints at /agents/{agent_id}/collaboration. This allows for human-initiated score overrides, which take precedence over computed scores and can be set with an optional expiration.
  • Non-Intrusive Integration: Both the LLM sampling and human override features are layered around the existing CollaborationScoringStrategy protocol via the PerformanceTracker, ensuring the core protocol remains untouched.
  • New Models and Configuration: Added new data models (LlmCalibrationRecord, CollaborationOverride) and enhanced CollaborationMetricRecord with an interaction_summary field. New configuration options (llm_sampling_rate, llm_sampling_model, calibration_retention_days) were added to PerformanceConfig.
  • Comprehensive Testing: New unit tests cover the override store (13 tests), LLM sampler (16 tests), tracker integration (10 tests), and API controller (11 tests), ensuring robust functionality and integration.
Changelog
  • CLAUDE.md
    • Updated directory descriptions for api/ to include 'collaboration endpoint' and for hr/ to include 'LLM calibration sampling, collaboration overrides'.
  • docs/design/agents.md
    • Updated the description of collaboration scoring to explicitly mention the implementation of LLM sampling via LlmCalibrationSampler and human override via CollaborationOverrideStore and CollaborationController.
  • src/synthorg/api/app.py
    • Imported PerformanceTracker.
    • Added performance_tracker as an optional parameter to the create_app function.
    • Passed performance_tracker to the AppState initialization.
  • src/synthorg/api/controllers/init.py
    • Imported CollaborationController.
    • Added CollaborationController to the list of exported controllers.
  • src/synthorg/api/controllers/collaboration.py
    • Added new file implementing CollaborationController.
    • Defined API endpoints for /agents/{agent_id}/collaboration/score (GET), /override (GET, POST, DELETE), and /calibration (GET).
    • Implemented request/response DTOs (SetOverrideRequest, OverrideResponse, CalibrationSummaryResponse) for collaboration score management.
  • src/synthorg/api/state.py
    • Imported PerformanceTracker.
    • Added _performance_tracker to the AppState class attributes.
    • Included performance_tracker in the AppState constructor.
    • Added a property performance_tracker to AppState to retrieve the performance tracking service.
  • src/synthorg/hr/performance/collaboration_override_store.py
    • Added new file implementing CollaborationOverrideStore.
    • Provided methods for setting, getting, clearing, and listing human collaboration score overrides, including expiration handling.
  • src/synthorg/hr/performance/config.py
    • Added llm_sampling_rate field to configure the fraction of collaboration events sampled by LLM.
    • Added llm_sampling_model field to specify the model ID for LLM calibration sampling.
    • Added calibration_retention_days field to define the retention period for LLM calibration records.
  • src/synthorg/hr/performance/llm_calibration_sampler.py
    • Added new file implementing LlmCalibrationSampler.
    • Provided functionality for probabilistic LLM evaluation of collaboration interactions, storing calibration records, and computing drift summaries.
    • Defined a system prompt and completion configuration for LLM calls.
  • src/synthorg/hr/performance/models.py
    • Added interaction_summary field to CollaborationMetricRecord for LLM calibration context.
    • Added override_active field to CollaborationScoreResult to indicate if a human override is active.
    • Defined new LlmCalibrationRecord model for storing LLM calibration sample data, including LLM score, behavioral score, drift, rationale, model used, and cost.
    • Defined new CollaborationOverride model for human-applied score overrides, including agent ID, score, reason, applied by, and expiration.
  • src/synthorg/hr/performance/tracker.py
    • Imported CollaborationOverrideStore and LlmCalibrationSampler.
    • Modified the PerformanceTracker constructor to accept optional sampler and override_store instances.
    • Added _maybe_sample asynchronous method to probabilistically invoke the LLM sampler during collaboration event recording.
    • Updated record_collaboration_event to call _maybe_sample.
    • Modified get_collaboration_score to check for and apply active human overrides before delegating to the behavioral strategy.
    • Removed collab_records from get_snapshot as collaboration score is now retrieved via get_collaboration_score.
    • Added properties override_store and sampler to PerformanceTracker.
  • src/synthorg/observability/events/performance.py
    • Added new event constants for LLM calibration sampling: PERF_LLM_SAMPLE_STARTED, PERF_LLM_SAMPLE_COMPLETED, PERF_LLM_SAMPLE_FAILED.
    • Added new event constants for collaboration score overrides: PERF_OVERRIDE_SET, PERF_OVERRIDE_CLEARED, PERF_OVERRIDE_APPLIED, PERF_OVERRIDE_EXPIRED.
  • tests/unit/api/controllers/test_collaboration.py
    • Added new file containing unit tests for the CollaborationController.
    • Tested GET /score for neutral scores and active overrides.
    • Tested GET /override for active overrides and 404 for non-existent ones.
    • Tested POST /override for setting overrides with and without expiration, and access control.
    • Tested DELETE /override for clearing overrides and 404 for non-existent ones.
    • Tested GET /calibration for returning empty data when no sampler is configured.
  • tests/unit/hr/performance/conftest.py
    • Added interaction_summary parameter to make_collab_metric helper function.
    • Added make_calibration_record helper function for creating LlmCalibrationRecord instances.
    • Added make_collaboration_override helper function for creating CollaborationOverride instances.
  • tests/unit/hr/performance/test_collaboration_override_store.py
    • Added new file containing unit tests for CollaborationOverrideStore.
    • Tested setting, replacing, and retrieving overrides for single and multiple agents.
    • Tested active override retrieval with various expiration scenarios (expired, not expired, no expiration).
    • Tested clearing existing and non-existent overrides.
    • Tested listing overrides, including filtering by expiration.
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
    • Added new file containing unit tests for LlmCalibrationSampler.
    • Tested probabilistic sampling logic (should_sample).
    • Tested successful LLM sampling, skipping records without summaries, and handling provider failures or malformed LLM responses.
    • Verified drift calculation and record storage.
    • Tested filtering calibration records by agent and time, and average drift computation.
    • Tested retention pruning of old calibration records.
  • tests/unit/hr/performance/test_tracker_enhancements.py
    • Added new file containing unit tests for PerformanceTracker's new enhancements.
    • Tested that active human overrides take precedence over computed scores in get_collaboration_score and get_snapshot.
    • Verified that expired overrides fall through to the behavioral strategy.
    • Tested the invocation of the LLM sampler during record_collaboration_event under various conditions (summary present, sampling rate, sampler failure).
Activity
  • The pull request was pre-reviewed by 8 automated agents (code-reviewer, conventions-enforcer, logging-audit, silent-failure-hunter, type-design-analyzer, test-analyzer, docs-consistency, issue-resolution-verifier).
  • 17 findings were identified and addressed during the pre-review process (4 critical, 5 major, 8 medium).
  • All 8266 existing tests passed, and new tests were added for the introduced features.
  • Code coverage is reported at 94.52%, exceeding the 80% requirement.
  • The code adheres to mypy strict, ruff clean, and ruff format clean standards.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant and valuable enhancements to collaboration scoring, adding LLM-based calibration sampling and a mechanism for human overrides via a new API. The implementation is well-structured, with new services and endpoints that are cleanly integrated into the existing system. The addition of comprehensive tests for the new functionality is also a great strength. My review has identified a few critical syntax errors in exception handling that will prevent the code from running in a Python 3 environment, and a minor opportunity for refactoring in the new API controller to improve maintainability. Once these issues are addressed, this will be a solid feature addition.

Comment on lines +131 to +132
except MemoryError, RecursionError:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This except syntax is for Python 2. In Python 3, which this project uses, this will raise a SyntaxError. To catch multiple exceptions, they should be enclosed in a tuple.

Suggested change
except MemoryError, RecursionError:
raise
except (MemoryError, RecursionError):
raise

Comment on lines +448 to +449
except MemoryError, RecursionError:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This except syntax is for Python 2 and will cause a SyntaxError in Python 3. To catch multiple exceptions, they should be enclosed in a tuple.

        except (MemoryError, RecursionError):
            raise

Comment on lines +465 to +466
except MemoryError, RecursionError:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This except block uses Python 2 syntax for catching multiple exceptions. In Python 3, this will raise a SyntaxError. The exceptions should be enclosed in a tuple.

Suggested change
except MemoryError, RecursionError:
raise
except (MemoryError, RecursionError):
raise

Comment on lines +152 to +160
store = tracker.override_store
if store is None:
logger.warning(
API_REQUEST_ERROR,
path="collaboration/override",
reason="override_store_not_configured",
)
msg = "Override store not configured"
raise ServiceUnavailableError(msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to retrieve the override_store and handle the case where it's not configured is repeated across get_override, set_override, and clear_override. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method within the CollaborationController class.

For example, you could create a _get_override_store method that encapsulates this check and returns the store or raises ServiceUnavailableError.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

❌ Patch coverage is 97.76358% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.88%. Comparing base (42242b5) to head (ca9a861).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/synthorg/api/controllers/collaboration.py 95.65% 3 Missing and 1 partial ⚠️
src/synthorg/hr/performance/tracker.py 95.34% 2 Missing ⚠️
...synthorg/hr/performance/llm_calibration_sampler.py 98.91% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #477      +/-   ##
==========================================
+ Coverage   93.82%   93.88%   +0.06%     
==========================================
  Files         476      479       +3     
  Lines       22667    22977     +310     
  Branches     2186     2213      +27     
==========================================
+ Hits        21268    21573     +305     
- Misses       1091     1095       +4     
- Partials      308      309       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/api/controllers/collaboration.py`:
- Around line 217-237: The code currently substitutes "unknown" for applied_by
when request.scope.get("user") is not an AuthenticatedUser and proceeds to
persist a CollaborationOverride; instead, change the flow in the request handler
so that if auth_user is not an instance of AuthenticatedUser you log the
API_REQUEST_ERROR (with path, reason="user_identity_extraction_failed",
agent_id) and reject the request (return an HTTP 4xx or raise an appropriate
HTTPException) rather than creating a CollaborationOverride with
applied_by="unknown"; ensure this check is performed before constructing
CollaborationOverride and reference the existing symbols request.scope,
AuthenticatedUser, applied_by, CollaborationOverride and API_REQUEST_ERROR when
locating the code to modify.
- Around line 163-165: Before raising NotFoundError in the handler where
"override is None" (and the similar branch around lines 283-285), emit a
structured log with level WARNING or ERROR using the API_REQUEST_ERROR event,
including the agent_id and a specific reason string (e.g.,
"no_active_override_for_agent") so the missing override is visible in structured
logs; do this immediately before raising NotFoundError and use the same logging
pattern/logger used elsewhere in collaboration.py to keep context consistent.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py`:
- Around line 252-254: The JSON parsing of the LLM response
(json.loads(response.content) and subsequent access to parsed["score"] and
parsed["rationale"]) should be wrapped in its own try/except so that on
JSONDecodeError or KeyError you log the raw response.content (raw_content) along
with the exception (exc_info) — mirroring the out-of-range logging at line 264 —
before re-raising or handling the error; update the block around parsed, score,
and rationale to catch parse/key errors, log raw_content and the exception, and
then propagate or handle the failure.

In `@src/synthorg/hr/performance/tracker.py`:
- Around line 209-225: get_collaboration_score currently computes a snapshot
using the caller-supplied reference time (via get_snapshot(now=...)) but still
calls _override_store.get_active_override(agent_id) with the real current time,
causing mismatches for historical/replayed snapshots; modify
get_collaboration_score (and the other override lookup at the later site around
lines 281-283) to pass the same reference time used for get_snapshot into
get_active_override (e.g., get_active_override(agent_id, now=reference_time)) so
the override lookup is evaluated against the same snapshot timestamp; update any
calls to _override_store.get_active_override in this module to accept and
forward the snapshot reference time consistently.

In `@tests/unit/api/controllers/test_collaboration.py`:
- Around line 256-271: Add a complementary unit test to TestGetCalibration that
sets up a sampler with calibration records and asserts the GET
/api/v1/agents/agent-001/collaboration/calibration endpoint returns record_count
> 0 and a computed average_drift; create a new test (e.g.,
test_returns_calibration_when_sampler_configured) that arranges a sampler and
inserts known calibration entries into the test datastore or fixture, then call
collab_client.get(...) for the same path, assert status_code == 200, assert
body["data"]["record_count"] equals the number of inserted records and
body["data"]["average_drift"] equals the expected average (or within tolerance)
to validate the endpoint's aggregation logic.

In `@tests/unit/hr/performance/test_collaboration_override_store.py`:
- Around line 143-152: The test uses a fixed NOW (NOW = 2026-03-15) to set the
override expiry which will become expired in future runs; update
test_default_now_uses_current_time to set the override expiry relative to the
runtime clock (e.g., use datetime.now(timezone.utc) or the same clock helper
used by CollaborationOverrideStore) when calling _make_override so the stored
override expires now+365 days, or alternatively freeze/mock the module clock
used by get_active_override; make the change around the call to
store.set_override(...) and references to _make_override, get_active_override,
and the NOW constant so the test remains stable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 681236a6-2f47-4c3f-b0cd-0740db18145d

📥 Commits

Reviewing files that changed from the base of the PR and between 30e96d5 and e25c56f.

📒 Files selected for processing (17)
  • CLAUDE.md
  • docs/design/agents.md
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/api/state.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/config.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/observability/events/performance.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Build Sandbox
  • GitHub Check: Build Backend
  • GitHub Check: Build Web
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (9)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) — ruff enforces this on Python 3.14
Type hints required on all public functions; enforce with mypy strict mode
Google style docstrings required on public classes and functions; enforced by ruff D rules
Create new objects instead of mutating existing ones; use copy.deepcopy() at construction and MappingProxyType wrapping for read-only enforcement in non-Pydantic internal collections
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Line length: 88 characters (ruff); enforced by linter

Files:

  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/observability/events/performance.py
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/api/state.py
  • src/synthorg/hr/performance/config.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • src/synthorg/hr/performance/models.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow markers for test categorization
Maintain 80% minimum test coverage; enforced in CI
Use asyncio_mode = 'auto' in pytest config — no manual @pytest.mark.asyncio needed on individual tests
Set 30-second timeout per test
Always include -n auto when running pytest with pytest-xdist; never run tests sequentially
Prefer @pytest.mark.parametrize for testing similar cases
Use test-provider, test-small-001, etc. in test files
Use Hypothesis property-based testing with @given and @settings; control profile via HYPOTHESIS_PROFILE env var (ci: 200 examples, dev: 1000 examples)
Never skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally; mock time.monotonic() and asyncio.sleep() for timing-sensitive tests

Files:

  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
{src/synthorg/**/*.py,tests/**/*.py,web/src/**/*.{ts,tsx,vue},cli/**/*.go,**/*.md,**/*.yml,**/*.yaml,**/*.json,pyproject.toml}

📄 CodeRabbit inference engine (CLAUDE.md)

Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples; use generic names: example-provider, example-large-001, example-medium-001, example-small-001, or size aliases

Files:

  • tests/unit/api/controllers/test_collaboration.py
  • docs/design/agents.md
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/observability/events/performance.py
  • CLAUDE.md
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/api/state.py
  • src/synthorg/hr/performance/config.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • src/synthorg/hr/performance/models.py
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Markdown documentation files should follow the documented structure: docs/design/, docs/architecture/, docs/roadmap/, docs/security.md, docs/licensing.md, docs/reference/, docs/rest-api.md, docs/api/

Files:

  • docs/design/agents.md
docs/design/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Design specification pages in docs/design/ must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Files:

  • docs/design/agents.md
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Use frozen Pydantic models for config/identity; use separate mutable-via-copy models with model_copy(update=...) for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict)
Use @computed_field for derived values instead of storing and validating redundant fields (e.g., TokenUsage.total_tokens)
Use NotBlankStr (from core.types) for all identifier/name fields — including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants — instead of manual whitespace validators
Every module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in application code; use synthorg's get_logger() instead
Always use logger as the variable name, not _logger or log
Always use event name constants from the domain-specific module under synthorg.observability.events in logging calls
Log with structured kwargs: always logger.info(EVENT, key=value) — never logger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Log at DEBUG level for object creation, internal flow, entry/exit of key functions
Use Google-style docstrings in all public Python code for mkdocstrings auto-generation

Files:

  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/observability/events/performance.py
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/api/state.py
  • src/synthorg/hr/performance/config.py
  • src/synthorg/hr/performance/models.py
src/synthorg/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All Litestar API routes must use RFC 9457 structured errors with ErrorCategory, ErrorCode, ErrorDetail

Files:

  • src/synthorg/api/controllers/__init__.py
  • src/synthorg/api/app.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/api/state.py
src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Observability must use structured logging with correlation tracking and log sinks

Files:

  • src/synthorg/observability/events/performance.py
src/synthorg/hr/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/config.py
  • src/synthorg/hr/performance/models.py
🧠 Learnings (23)
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/observability/events/performance.py
  • CLAUDE.md
  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

  • CLAUDE.md
  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/hr/**/*.py : HR engine must provide: hiring, firing, onboarding, offboarding, agent registry, performance tracking (task metrics, collaboration scoring, trend detection), promotion/demotion

Applied to files:

  • CLAUDE.md
  • src/synthorg/api/app.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/api/state.py
  • tests/unit/hr/performance/conftest.py
  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.900Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.900Z
Learning: Applies to src/synthorg/core/**/*.py : Core module must contain shared domain models, base classes, resilience config (RetryConfig, RateLimiterConfig)

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/api/**/*.py : All Litestar API routes must use RFC 9457 structured errors with ErrorCategory, ErrorCode, ErrorDetail

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/communication/**/*.py : Communication system must implement: message bus, dispatcher, messenger, channels, delegation, loop prevention, conflict resolution

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/budget/**/*.py : Budget system must enforce: pre-flight/in-flight cost checks, auto-downgrade, billing periods, cost tiers, quota/subscription tracking, spending reports

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger` (not `_logger`, not `log`).

Applied to files:

  • CLAUDE.md
  • src/synthorg/api/app.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : Structured kwargs in logging: always `logger.info(EVENT, key=value)` — never `logger.info('msg %s', val)`.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Log with structured kwargs: always `logger.info(EVENT, key=value)` — never `logger.info('msg %s', val)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
  • src/synthorg/api/app.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.900Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.900Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Always use `logger` as the variable name, not `_logger` or `log`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`)

Applied to files:

  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/engine/approval_gate/**/*.py : Approval gate must detect escalations and use EscalationInfo/ResumePayload models for context parking/resume

Applied to files:

  • src/synthorg/api/state.py
📚 Learning: 2026-03-15T22:42:34.900Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.900Z
Learning: Applies to src/synthorg/tools/**/*.py : Tool registry must support role-based access control; built-in tools in file_system/, git, sandbox/, code_runner with approval gate integration

Applied to files:

  • src/synthorg/api/state.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/communication/meeting/**/*.py : Meeting protocol must support: round-robin, position papers, structured phases; scheduler must handle frequency, participant resolver; implement orchestrator

Applied to files:

  • src/synthorg/api/state.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Use `NotBlankStr` (from `core.types`) for all identifier/name fields — including optional (`NotBlankStr | None`) and tuple (`tuple[NotBlankStr, ...]`) variants — instead of manual whitespace validators

Applied to files:

  • tests/unit/hr/performance/conftest.py
📚 Learning: 2026-03-15T22:42:34.899Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T22:42:34.899Z
Learning: Applies to src/synthorg/**/*.py : Use `computed_field` for derived values instead of storing and validating redundant fields (e.g., `TokenUsage.total_tokens`)

Applied to files:

  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to **/*.py : Models: Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Use computed_field for derived values instead of storing + validating redundant fields. Use NotBlankStr (from core.types) for all identifier/name fields — including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants — instead of manual whitespace validators.

Applied to files:

  • src/synthorg/hr/performance/models.py
🧬 Code graph analysis (9)
tests/unit/api/controllers/test_collaboration.py (3)
src/synthorg/hr/performance/collaboration_override_store.py (2)
  • CollaborationOverrideStore (26-128)
  • get_active_override (53-83)
src/synthorg/api/controllers/collaboration.py (1)
  • set_override (179-249)
tests/unit/hr/performance/test_collaboration_override_store.py (2)
  • TestSetOverride (36-80)
  • TestClearOverride (156-181)
tests/unit/hr/performance/test_llm_calibration_sampler.py (2)
tests/unit/hr/performance/conftest.py (2)
  • make_calibration_record (72-93)
  • make_collab_metric (46-69)
src/synthorg/hr/performance/models.py (1)
  • drift (229-231)
src/synthorg/api/controllers/__init__.py (1)
src/synthorg/api/controllers/collaboration.py (1)
  • CollaborationController (102-323)
src/synthorg/api/app.py (3)
tests/unit/hr/promotion/conftest.py (1)
  • tracker (104-106)
src/synthorg/hr/performance/tracker.py (1)
  • PerformanceTracker (50-474)
src/synthorg/api/state.py (1)
  • performance_tracker (193-198)
src/synthorg/hr/performance/llm_calibration_sampler.py (1)
src/synthorg/hr/performance/models.py (3)
  • LlmCalibrationRecord (187-242)
  • CollaborationMetricRecord (70-130)
  • drift (229-231)
src/synthorg/hr/performance/tracker.py (3)
src/synthorg/hr/performance/collaboration_override_store.py (2)
  • CollaborationOverrideStore (26-128)
  • get_active_override (53-83)
src/synthorg/hr/performance/llm_calibration_sampler.py (3)
  • LlmCalibrationSampler (55-279)
  • should_sample (92-98)
  • sample (100-165)
src/synthorg/hr/performance/models.py (2)
  • CollaborationScoreResult (158-184)
  • CollaborationMetricRecord (70-130)
src/synthorg/hr/performance/collaboration_override_store.py (2)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (245-296)
src/synthorg/api/controllers/collaboration.py (2)
  • set_override (179-249)
  • clear_override (252-287)
tests/unit/hr/performance/conftest.py (1)
src/synthorg/hr/performance/models.py (2)
  • CollaborationOverride (245-296)
  • LlmCalibrationRecord (187-242)
tests/unit/hr/performance/test_collaboration_override_store.py (2)
src/synthorg/hr/performance/collaboration_override_store.py (5)
  • CollaborationOverrideStore (26-128)
  • set_override (37-51)
  • get_active_override (53-83)
  • clear_override (85-101)
  • list_overrides (103-128)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (245-296)
🔇 Additional comments (20)
CLAUDE.md (1)

116-116: LGTM!

Documentation updates correctly reflect the new collaboration scoring features:

  • API module description now includes the collaboration endpoint
  • HR engine description includes LLM calibration sampling and collaboration overrides
  • Event names follow the established PERF_* prefix pattern for performance domain events

Also applies to: 124-124, 194-194

src/synthorg/hr/performance/config.py (1)

49-63: LGTM!

New configuration fields are well-designed:

  • llm_sampling_rate correctly bounded [0.0, 1.0] with sensible 1% default
  • llm_sampling_model uses NotBlankStr | None pattern as per guidelines
  • calibration_retention_days has minimum of 1 day
  • Model inherits allow_inf_nan=False from ConfigDict, preventing invalid float values
tests/unit/hr/performance/conftest.py (2)

72-93: LGTM!

The make_calibration_record fixture correctly:

  • Uses test-small-001 for model_used (compliant with vendor-agnostic naming guideline)
  • Wraps all identifier fields with NotBlankStr
  • Provides sensible defaults for testing LLM calibration sampling

96-113: LGTM!

The make_collaboration_override fixture properly constructs CollaborationOverride with all required fields wrapped in NotBlankStr and sensible test defaults.

src/synthorg/api/state.py (1)

192-198: LGTM!

The performance_tracker property follows the established pattern used by other services in AppState:

  • Uses _require_service for consistent 503 handling when not configured
  • Properly typed return value
  • Docstring follows established convention
src/synthorg/api/controllers/__init__.py (1)

12-12: LGTM!

CollaborationController is properly registered:

  • Imported from the collaboration module
  • Added to ALL_CONTROLLERS tuple for Litestar routing
  • Exported via __all__ for public API surface

Also applies to: 40-40, 53-53

docs/design/agents.md (1)

340-345: LGTM!

Design spec accurately documents the implemented features:

  • LLM calibration sampling via LlmCalibrationSampler (opt-in with config requirement)
  • Human override via CollaborationOverrideStore + CollaborationController
  • API endpoint path /agents/{agent_id}/collaboration
  • Preserves mention of future strategies
tests/unit/api/controllers/test_collaboration.py (3)

28-80: LGTM!

Test fixtures properly set up the test environment:

  • override_store and perf_tracker fixtures correctly wired
  • collab_client uses create_app with all necessary dependencies
  • Test JWT secret meets minimum length requirement
  • Auth headers set up for CEO role by default

83-118: LGTM!

Score endpoint tests verify both scenarios:

  • Neutral 5.0 score when no collaboration data exists
  • Override score (9.0) reflected when active, with override_active: true

159-213: LGTM!

Override setting tests are comprehensive:

  • Verifies POST creates and persists override
  • Validates expiration handling with expires_in_days
  • Confirms observer role is denied write access (403)
src/synthorg/api/app.py (1)

440-440: LGTM!

The performance_tracker integration follows the established pattern:

  • Optional parameter in create_app signature
  • Properly documented in docstring
  • Passed through to AppState constructor
  • Consistent with how other optional services (coordinator, meeting_scheduler, etc.) are handled

Also applies to: 459-459, 504-504

src/synthorg/hr/performance/llm_calibration_sampler.py (9)

1-30: LGTM!

Imports are well-organized with proper TYPE_CHECKING separation. Logger setup follows project conventions.


32-52: LGTM!

System prompt is clear and structured for JSON output. Completion config uses sensible defaults for calibration tasks.


55-90: LGTM!

Constructor properly validates boundary inputs with clear error messages. Keyword-only parameters and proper type hints follow best practices.


92-98: LGTM!

Probabilistic sampling implementation is correct. The S311 suppression is appropriate since this is non-cryptographic usage.


100-165: LGTM!

Exception handling is well-designed: re-raising critical errors (MemoryError, RecursionError) while gracefully handling recoverable failures with proper logging. The PEP 758 except syntax follows project conventions. Structured logging at appropriate levels throughout.


167-190: LGTM!

Query method properly returns immutable tuples, preventing external mutation of internal state. Filter logic is straightforward and correct.


192-207: LGTM!

Clean implementation with proper handling of the no-records case. Using the computed drift field from the model is consistent with the codebase design.


271-279: LGTM!

Pruning logic correctly uses list(self._records) to safely iterate while modifying. Clean removal of empty agent entries prevents unbounded memory growth.


269-269: No action needed — response.usage is a required field.

The CompletionResponse.usage field is defined as usage: TokenUsage = Field(...) (non-optional, required), so accessing response.usage.cost_usd is always safe. There is no risk of AttributeError.

			> Likely an incorrect or invalid review comment.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/performance/collaboration_override_store.py`:
- Around line 75-82: The expiry-driven eviction currently logs with logger.debug
but performs a state transition by deleting from self._overrides; change the log
level to INFO so the removal is recorded as a state change. In the branch that
checks override.expires_at (the block using PERF_OVERRIDE_EXPIRED, agent_id,
expired_at=str(override.expires_at)), replace logger.debug(...) with
logger.info(...) and keep the same structured fields (PERF_OVERRIDE_EXPIRED,
agent_id, expired_at) so the eviction is visible in normal operational logs.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py`:
- Around line 170-210: Reads currently return records directly from
self._records so stale entries older than retention_days can persist until the
next sample(); update get_calibration_records and get_drift_summary to apply the
same retention pruning as sample() before returning/aggregating: compute the
cutoff (now - retention_days), filter each agent's records to exclude records
with sampled_at < cutoff (or call the existing internal prune method if one
exists), and then proceed with the existing filtering by agent/since and the
average drift calculation; reference functions: get_calibration_records,
get_drift_summary, sample(), and the in-memory store self._records (use the
existing retention_days attribute or the same retention logic used by sample()).

In `@src/synthorg/hr/performance/tracker.py`:
- Line 209: record_collaboration_event() currently awaits _maybe_sample(), which
can await the full provider round-trip via self._sampler.sample(...) and thus
blocks the ingestion path; change the behavior so sampling is fire-and-forget:
modify the call sites that await _maybe_sample (including
record_collaboration_event and the other location calling _maybe_sample) to
schedule _maybe_sample/_sampler.sample as an asynchronous background task (e.g.,
asyncio.create_task or equivalent) and do not await it, ensure _maybe_sample and
any sampler calls swallow/log exceptions so background failures don't affect
recording, and keep the sampling code purely best-effort and non-blocking to the
core ingestion flow.

In `@tests/unit/api/controllers/test_collaboration.py`:
- Around line 290-319: Collapse the three duplicate tests test_get_override_503,
test_post_override_503, and test_delete_override_503 into a single parametrized
test using pytest.mark.parametrize: parameterize the HTTP verb
("GET","POST","DELETE") and the optional json payload (None for GET/DELETE,
{"score":5.0,"reason":"Test"} for POST), use the existing no_store_client
fixture, call no_store_client.request(method,
"/api/v1/agents/agent-001/collaboration/override", json=payload) and assert
resp.status_code == 503; name the new test something like
test_override_503_parametrized and remove the three originals.

In `@tests/unit/hr/performance/test_llm_calibration_sampler.py`:
- Around line 355-370: The test uses a fixed NOW for make_calibration_record but
sampler.sample/pruning uses datetime.now(UTC), causing flakiness; capture a
single reference time (e.g., reference_now = NOW) at test start and use it for
both the seeded calibration record's sampled_at and for the sampler's cutoff by
either patching the sampler's clock/provider or monkeypatching datetime.now(UTC)
used by the sampler so that sampler.sample(...) and pruning use the same
reference time; update references to make_calibration_record(...
sampled_at=reference_now - timedelta(...)) and ensure sampler.sample(record=...,
behavioral_score=...) runs with the patched clock so get_calibration_records()
consistently reflects pruning.

In `@tests/unit/hr/performance/test_tracker_enhancements.py`:
- Around line 65-67: The test calls tracker.get_collaboration_score() without a
deterministic timestamp; update the call to pass the fixed NOW value (i.e.,
tracker.get_collaboration_score(NotBlankStr("agent-001"), now=NOW)) so the
expired-override assertion is deterministic; also ensure other timing-sensitive
behavior in this test uses the same NOW or mocks
time.monotonic()/asyncio.sleep() per guidelines to keep the test stable.

In `@web/src/api/endpoints/collaboration.ts`:
- Around line 38-41: The clearOverride function currently ignores the DELETE
response body and thus may treat application-level errors as success; update
clearOverride to call apiClient.delete<ApiResponse<null>> and inspect the
returned ApiResponse by either using a new unwrapVoid() helper (analogous to
unwrap() but allowing null bodies) or by checking response.success explicitly
and throwing or returning an error when success === false; reference the
clearOverride function, apiClient.delete call, ApiResponse.success flag, and the
existing unwrap() behavior in client.ts when implementing the explicit success
check or new unwrapVoid helper.

In `@web/src/api/types.ts`:
- Around line 691-695: The TypeScript DTO SetOverrideRequest currently requires
expires_in_days, but the backend default is None so it should be optional;
update the SetOverrideRequest interface (symbol: SetOverrideRequest, property:
expires_in_days) to make that property optional (allow omission) while keeping
its type number | null, and then update any callers that currently pass explicit
null to omit the property where appropriate so the client mirrors the backend
Pydantic model.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bd92141-1513-4ce5-bd5e-4228b5cbed34

📥 Commits

Reviewing files that changed from the base of the PR and between e25c56f and 91a2cb9.

📒 Files selected for processing (15)
  • docs/architecture/decisions.md
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/config.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/tracker.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_models.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Web
  • GitHub Check: Build Backend
  • GitHub Check: Build Sandbox
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use except A, B: syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14 (PEP 758 except syntax).
Include type hints on all public functions and classes. Strict mypy type-checking is enforced.
Use Google-style docstrings on all public classes and functions. Docstrings are enforced by ruff D rules.
Line length is 88 characters (enforced by ruff).

Files:

  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_models.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/api/controllers/test_collaboration.py
  • src/synthorg/hr/performance/config.py
  • tests/unit/hr/performance/conftest.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark tests with @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow. Maintain 80% coverage minimum. Use asyncio_mode = "auto" (no manual @pytest.mark.asyncio needed). Set 30-second timeout per test. Always include -n auto when running pytest (never sequential). Use @pytest.mark.parametrize for testing similar cases.
Use Python Hypothesis for property-based testing with @given + @settings. Profiles: ci (200 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var. Run dev profile: HYPOTHESIS_PROFILE=dev uv run python -m pytest tests/ -m unit -n auto -k properties. .hypothesis/ is gitignored.
NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins.

Files:

  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_models.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/conftest.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation source is in docs/ and built with Zensical. Design spec is in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations). All Markdown documentation must follow Zensical syntax.

Files:

  • docs/architecture/decisions.md
web/src/**/*.{vue,ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Vue 3 + PrimeVue + Tailwind CSS in the web dashboard. Organize components by feature (agents/, approvals/, budget/, common/, dashboard/, layout/, messages/, org-chart/, tasks/). Use Pinia stores for state management (auth, agents, tasks, budget, messages, meetings, approvals, websocket, analytics, company, providers).

Files:

  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts
web/src/**/*.ts

📄 CodeRabbit inference engine (CLAUDE.md)

Use TypeScript in the web dashboard. Mirror backend Pydantic models in web/src/api/ with TypeScript types.

Files:

  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts
web/src/**/*.{ts,tsx,vue}

📄 CodeRabbit inference engine (CLAUDE.md)

Use camelCase for variable and function names in TypeScript/Vue code.

Files:

  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Create new objects rather than mutating existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction and MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity. Use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with adopted conventions: use @computed_field for derived values instead of storing + validating redundant fields; use NotBlankStr from core.types for all identifier/name fields including optional and tuple variants instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Keep functions under 50 lines and files under 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries (user input, external APIs, config files).
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Vendor names may only appear in: (1) Operations design page (docs/design/operations.md), (2) .claude/ skill/agent files, (3) third-party import paths. Tests must use test-provider, test-small-001, etc.
Pure data models, enums, and re-exports do NOT need logging.

Files:

  • src/synthorg/hr/performance/config.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
src/synthorg/!(observability)/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/!(observability)/**/*.py: Every module with business logic MUST have: from synthorg.observability import get_logger followed by logger = get_logger(__name__). Never use import logging / logging.getLogger() / print() in application code. Always use variable name logger (not _logger, not log).
Always use event name constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.
Use structured logging with keyword arguments: always logger.info(EVENT, key=value) — never logger.info("msg %s", val).
All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG is for object creation, internal flow, entry/exit of key functions.

Files:

  • src/synthorg/hr/performance/config.py
  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/models.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
🧠 Learnings (12)
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG is for object creation, internal flow, entry/exit of key functions.

Applied to files:

  • src/synthorg/api/controllers/collaboration.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • src/synthorg/api/controllers/collaboration.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/**/*.py : Validate at system boundaries (user input, external APIs, config files).

Applied to files:

  • src/synthorg/api/controllers/collaboration.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins.

Applied to files:

  • tests/unit/hr/performance/test_collaboration_override_store.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/**/*.py : Use Pydantic v2 with adopted conventions: use `computed_field` for derived values instead of storing + validating redundant fields; use `NotBlankStr` from `core.types` for all identifier/name fields including optional and tuple variants instead of manual whitespace validators.

Applied to files:

  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to **/*.py : Models: Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Use computed_field for derived values instead of storing + validating redundant fields. Use NotBlankStr (from core.types) for all identifier/name fields — including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants — instead of manual whitespace validators.

Applied to files:

  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/**/*.py : Use frozen Pydantic models for config/identity. Use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • src/synthorg/hr/performance/models.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to **/*.py : Use `except A, B:` syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14 (PEP 758 except syntax).

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow exceptions.

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Always use variable name `logger` (not `_logger`, not `log`).

Applied to files:

  • src/synthorg/hr/performance/tracker.py
🧬 Code graph analysis (11)
tests/unit/hr/performance/test_models.py (3)
tests/unit/hr/performance/conftest.py (3)
  • make_calibration_record (72-93)
  • make_collab_metric (46-69)
  • make_task_metric (16-43)
src/synthorg/hr/performance/models.py (2)
  • drift (231-233)
  • CollaborationScoreResult (160-186)
web/src/api/types.ts (1)
  • CollaborationScoreResult (683-689)
tests/unit/hr/performance/test_tracker_enhancements.py (5)
src/synthorg/hr/performance/collaboration_override_store.py (2)
  • CollaborationOverrideStore (26-129)
  • set_override (37-51)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
src/synthorg/hr/performance/tracker.py (6)
  • override_store (429-431)
  • get_collaboration_score (211-254)
  • record_collaboration_event (186-209)
  • get_snapshot (256-312)
  • sampler (434-436)
  • get_collaboration_metrics (400-426)
src/synthorg/api/controllers/collaboration.py (1)
  • set_override (207-269)
src/synthorg/hr/performance/llm_calibration_sampler.py (2)
  • should_sample (95-101)
  • sample (103-168)
web/src/api/endpoints/collaboration.ts (2)
web/src/api/types.ts (5)
  • CollaborationScoreResult (683-689)
  • ApiResponse (125-127)
  • OverrideResponse (697-704)
  • SetOverrideRequest (691-695)
  • CalibrationSummaryResponse (719-724)
web/src/api/client.ts (2)
  • apiClient (12-16)
  • unwrap (70-80)
web/src/api/types.ts (2)
src/synthorg/hr/performance/models.py (2)
  • CollaborationScoreResult (160-186)
  • LlmCalibrationRecord (189-245)
src/synthorg/api/controllers/collaboration.py (3)
  • SetOverrideRequest (33-54)
  • OverrideResponse (57-76)
  • CalibrationSummaryResponse (79-102)
tests/unit/api/controllers/test_collaboration.py (4)
src/synthorg/hr/performance/collaboration_override_store.py (3)
  • CollaborationOverrideStore (26-129)
  • set_override (37-51)
  • get_active_override (53-84)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
src/synthorg/hr/performance/tracker.py (2)
  • PerformanceTracker (50-486)
  • override_store (429-431)
src/synthorg/api/controllers/collaboration.py (1)
  • set_override (207-269)
tests/unit/hr/performance/conftest.py (2)
src/synthorg/hr/performance/models.py (2)
  • CollaborationOverride (248-299)
  • LlmCalibrationRecord (189-245)
web/src/api/types.ts (1)
  • LlmCalibrationRecord (706-717)
src/synthorg/hr/performance/collaboration_override_store.py (2)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
src/synthorg/api/controllers/collaboration.py (2)
  • set_override (207-269)
  • clear_override (272-303)
tests/unit/hr/performance/test_collaboration_override_store.py (4)
src/synthorg/hr/performance/collaboration_override_store.py (4)
  • set_override (37-51)
  • get_active_override (53-84)
  • list_overrides (104-129)
  • clear_override (86-102)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
tests/unit/hr/performance/conftest.py (1)
  • make_collaboration_override (96-113)
src/synthorg/api/controllers/collaboration.py (2)
  • set_override (207-269)
  • clear_override (272-303)
src/synthorg/hr/performance/models.py (1)
web/src/api/types.ts (1)
  • LlmCalibrationRecord (706-717)
src/synthorg/hr/performance/llm_calibration_sampler.py (2)
src/synthorg/hr/performance/models.py (3)
  • LlmCalibrationRecord (189-245)
  • CollaborationMetricRecord (70-132)
  • drift (231-233)
web/src/api/types.ts (1)
  • LlmCalibrationRecord (706-717)
src/synthorg/hr/performance/tracker.py (4)
src/synthorg/hr/performance/collaboration_override_store.py (1)
  • get_active_override (53-84)
src/synthorg/hr/performance/llm_calibration_sampler.py (3)
  • LlmCalibrationSampler (58-309)
  • should_sample (95-101)
  • sample (103-168)
src/synthorg/hr/performance/config.py (1)
  • PerformanceConfig (10-80)
src/synthorg/hr/performance/models.py (2)
  • CollaborationScoreResult (160-186)
  • CollaborationMetricRecord (70-132)

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/performance/collaboration_override_store.py`:
- Around line 86-102: The clear_override method currently pops any stored
override without checking its expires_at, so expired entries are treated as
present; update clear_override (in collaboration_override_store.py) to lookup
self._overrides.get(str(agent_id)) and verify the override's expires_at (same
logic used in get_active_override) before deleting and logging
PERF_OVERRIDE_CLEARED — only pop and return True for non-expired overrides,
otherwise return False (and do not log); add a regression test that inserts an
expired override and asserts clear_override returns False and does not emit
PERF_OVERRIDE_CLEARED.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py`:
- Around line 103-168: The sample() method (and related _call_llm()) is too
large and mixes orchestration, logging, prompt assembly, response parsing,
validation and persistence; split responsibilities into small helpers: extract
prompt construction into a build_prompt(record, behavioral_score) helper, move
the raw LLM invocation into _call_llm_raw(prompt) and response parsing into
parse_llm_response(raw_response) that returns (llm_score, rationale, cost_usd),
and encapsulate record creation/persistence into
create_calibration_record(record, llm_score, rationale, cost_usd,
behavioral_score) which returns the LlmCalibrationRecord and appends it to
self._records; keep sample() focused on flow control (prune, start log, call
helpers, catch/handle exceptions, final log) and ensure _prune_expired,
PERF_LLM_SAMPLE_STARTED/PERF_LLM_SAMPLE_COMPLETED logging and exception handling
(MemoryError/RecursionError re-raise) remain intact; apply the same splitting
pattern to the code in the related block around _call_llm (lines ~222-309) so
each function stays under 50 lines.
- Around line 145-154: The code currently validates rationale only when
constructing LlmCalibrationRecord, which causes whitespace-only rationales from
_call_llm() to raise instead of honoring the sampler's None-on-failure contract;
update _call_llm() to strip and validate the provider rationale (e.g., return
None or an explicit failure indication when rationale.strip() == ""), ensure
callers (the sampler) treat a blank/whitespace rationale as None so NotBlankStr
is never invoked on bad data, and add a regression test for a provider payload
like {"score": 7.0, "rationale": "   "} to assert the sampler returns
None-or-fallback behavior rather than raising when creating
LlmCalibrationRecord.

In `@src/synthorg/hr/performance/tracker.py`:
- Around line 238-243: The INFO log emitting PERF_OVERRIDE_APPLIED currently
includes a sensitive actor field; remove the applied_by parameter from the
logger.info call in the code that logs overrides (the block using
logger.info(PERF_OVERRIDE_APPLIED, agent_id=agent_id, score=override.score,
applied_by=override.applied_by)). Update that call to only include non-PII
fields (e.g., agent_id and score) while leaving applied_by stored on the
override object itself (override.applied_by) and not emitted to logs.
- Around line 475-499: The except blocks that catch MemoryError and
RecursionError around calls to self._collaboration_strategy.score(...) and
sampler.sample(...) should log a structured error with context before
re-raising; add a logger.error call in each of those except MemoryError,
RecursionError handlers that includes PERF_LLM_SAMPLE_FAILED (or an appropriate
constant), agent_id=record.agent_id, record_id=record.id and exc_info=True, then
re-raise the exception so the existing behavior is preserved.
- Around line 450-459: The current _schedule_sampling() creates unbounded
background tasks and can leak setup failures; replace the bare
asyncio.create_task usage and _background_tasks set with a structured
concurrency approach (e.g., an owned asyncio.TaskGroup or an asyncio.Queue +
worker tasks protected by an asyncio.Semaphore) so sampling is bounded and
shutdown-able: have record_collaboration_event() enqueue records or call into a
non-blocking should_sample check, then let a dedicated TaskGroup/worker loop run
_maybe_sample(record) under try/except to contain exceptions, use a semaphore or
limited worker count to cap concurrency, and implement an async shutdown method
that cancels/drains the TaskGroup or stops workers and awaits pending
_maybe_sample completions; update references to _background_tasks,
_maybe_sample, and _schedule_sampling accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9967a704-e8d8-459f-b65c-bbfc2b992370

📥 Commits

Reviewing files that changed from the base of the PR and between 91a2cb9 and a42d81e.

📒 Files selected for processing (9)
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • src/synthorg/hr/performance/tracker.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • tests/unit/hr/performance/test_tracker_enhancements.py
  • web/src/api/client.ts
  • web/src/api/endpoints/collaboration.ts
  • web/src/api/types.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Web
  • GitHub Check: Build Sandbox
  • GitHub Check: Build Backend
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use except A, B: syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14 (PEP 758 except syntax).
Include type hints on all public functions and classes. Strict mypy type-checking is enforced.
Use Google-style docstrings on all public classes and functions. Docstrings are enforced by ruff D rules.
Line length is 88 characters (enforced by ruff).

Files:

  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark tests with @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow. Maintain 80% coverage minimum. Use asyncio_mode = "auto" (no manual @pytest.mark.asyncio needed). Set 30-second timeout per test. Always include -n auto when running pytest (never sequential). Use @pytest.mark.parametrize for testing similar cases.
Use Python Hypothesis for property-based testing with @given + @settings. Profiles: ci (200 examples, default) and dev (1000 examples), controlled via HYPOTHESIS_PROFILE env var. Run dev profile: HYPOTHESIS_PROFILE=dev uv run python -m pytest tests/ -m unit -n auto -k properties. .hypothesis/ is gitignored.
NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins.

Files:

  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/api/controllers/test_collaboration.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
web/src/**/*.{vue,ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Vue 3 + PrimeVue + Tailwind CSS in the web dashboard. Organize components by feature (agents/, approvals/, budget/, common/, dashboard/, layout/, messages/, org-chart/, tasks/). Use Pinia stores for state management (auth, agents, tasks, budget, messages, meetings, approvals, websocket, analytics, company, providers).

Files:

  • web/src/api/types.ts
  • web/src/api/client.ts
  • web/src/api/endpoints/collaboration.ts
web/src/**/*.ts

📄 CodeRabbit inference engine (CLAUDE.md)

Use TypeScript in the web dashboard. Mirror backend Pydantic models in web/src/api/ with TypeScript types.

Files:

  • web/src/api/types.ts
  • web/src/api/client.ts
  • web/src/api/endpoints/collaboration.ts
web/src/**/*.{ts,tsx,vue}

📄 CodeRabbit inference engine (CLAUDE.md)

Use camelCase for variable and function names in TypeScript/Vue code.

Files:

  • web/src/api/types.ts
  • web/src/api/client.ts
  • web/src/api/endpoints/collaboration.ts
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Create new objects rather than mutating existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction and MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity. Use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with adopted conventions: use @computed_field for derived values instead of storing + validating redundant fields; use NotBlankStr from core.types for all identifier/name fields including optional and tuple variants instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Keep functions under 50 lines and files under 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries (user input, external APIs, config files).
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Vendor names may only appear in: (1) Operations design page (docs/design/operations.md), (2) .claude/ skill/agent files, (3) third-party import paths. Tests must use test-provider, test-small-001, etc.
Pure data models, enums, and re-exports do NOT need logging.

Files:

  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
src/synthorg/!(observability)/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/!(observability)/**/*.py: Every module with business logic MUST have: from synthorg.observability import get_logger followed by logger = get_logger(__name__). Never use import logging / logging.getLogger() / print() in application code. Always use variable name logger (not _logger, not log).
Always use event name constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.
Use structured logging with keyword arguments: always logger.info(EVENT, key=value) — never logger.info("msg %s", val).
All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG is for object creation, internal flow, entry/exit of key functions.

Files:

  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
🧠 Learnings (8)
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins.

Applied to files:

  • tests/unit/hr/performance/test_tracker_enhancements.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG is for object creation, internal flow, entry/exit of key functions.

Applied to files:

  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO.

Applied to files:

  • src/synthorg/hr/performance/collaboration_override_store.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to **/*.py : Use `except A, B:` syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14 (PEP 758 except syntax).

Applied to files:

  • src/synthorg/hr/performance/tracker.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/!(observability)/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Always use variable name `logger` (not `_logger`, not `log`).

Applied to files:

  • src/synthorg/hr/performance/tracker.py
📚 Learning: 2026-03-16T06:18:07.373Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:18:07.373Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow exceptions.

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
🧬 Code graph analysis (5)
tests/unit/hr/performance/test_llm_calibration_sampler.py (3)
tests/unit/hr/performance/conftest.py (2)
  • make_calibration_record (72-93)
  • make_collab_metric (46-69)
src/synthorg/hr/performance/tracker.py (1)
  • sampler (436-438)
src/synthorg/hr/performance/models.py (1)
  • drift (231-233)
web/src/api/client.ts (1)
web/src/api/types.ts (2)
  • ApiResponse (125-127)
  • ErrorDetail (113-120)
src/synthorg/hr/performance/collaboration_override_store.py (2)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
src/synthorg/api/controllers/collaboration.py (2)
  • set_override (207-269)
  • clear_override (272-303)
web/src/api/endpoints/collaboration.ts (2)
web/src/api/types.ts (5)
  • CollaborationScoreResult (683-689)
  • ApiResponse (125-127)
  • OverrideResponse (697-704)
  • SetOverrideRequest (691-695)
  • CalibrationSummaryResponse (719-724)
web/src/api/client.ts (3)
  • apiClient (12-16)
  • unwrap (70-80)
  • unwrapVoid (86-95)
src/synthorg/hr/performance/llm_calibration_sampler.py (2)
web/src/api/types.ts (1)
  • LlmCalibrationRecord (706-717)
src/synthorg/hr/performance/models.py (3)
  • LlmCalibrationRecord (189-245)
  • CollaborationMetricRecord (70-132)
  • drift (231-233)

Comment on lines +238 to +243
logger.info(
PERF_OVERRIDE_APPLIED,
agent_id=agent_id,
score=override.score,
applied_by=override.applied_by,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drop applied_by from this INFO log.

This branch runs on every score/snapshot read while an override is active, and applied_by can carry a human identifier or email. Keep attribution in the override record itself, but avoid emitting raw actor identifiers in routine logs.

🔒 Suggested change
                 logger.info(
                     PERF_OVERRIDE_APPLIED,
                     agent_id=agent_id,
                     score=override.score,
-                    applied_by=override.applied_by,
+                    override_source="human",
                 )

As per coding guidelines, "Major/high-impact issues: Compliance/privacy risks (PII retention, logging sensitive data -- like emails and other user identifiers, GDPR/CCPA violations)."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.info(
PERF_OVERRIDE_APPLIED,
agent_id=agent_id,
score=override.score,
applied_by=override.applied_by,
)
logger.info(
PERF_OVERRIDE_APPLIED,
agent_id=agent_id,
score=override.score,
override_source="human",
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/performance/tracker.py` around lines 238 - 243, The INFO log
emitting PERF_OVERRIDE_APPLIED currently includes a sensitive actor field;
remove the applied_by parameter from the logger.info call in the code that logs
overrides (the block using logger.info(PERF_OVERRIDE_APPLIED, agent_id=agent_id,
score=override.score, applied_by=override.applied_by)). Update that call to only
include non-PII fields (e.g., agent_id and score) while leaving applied_by
stored on the override object itself (override.applied_by) and not emitted to
logs.

Comment on lines +450 to +459
if self._sampler is None:
return
if record.interaction_summary is None:
return
if not self._sampler.should_sample():
return

task = asyncio.create_task(self._maybe_sample(record))
self._background_tasks.add(task)
task.add_done_callback(self._background_tasks.discard)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether PerformanceTracker has any owned task lifecycle or backpressure
# for collaboration sampling.
ast-grep --pattern 'PerformanceTracker($$$)'
rg -n --type=py -C2 '\b(create_task|TaskGroup|Semaphore|Queue|aclose|shutdown|lifespan)\b'

Repository: Aureliolo/synthorg

Length of output: 50375


🏁 Script executed:

wc -l src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 104


🏁 Script executed:

sed -n '440,470p' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 1097


🏁 Script executed:

# Find the _schedule_sampling method definition
rg -n '_schedule_sampling' src/synthorg/hr/performance/tracker.py -B5 -A10

Repository: Aureliolo/synthorg

Length of output: 1987


🏁 Script executed:

# Check for any shutdown, cleanup, or lifecycle methods in PerformanceTracker
rg -n '(def.*shut|def.*close|def.*drain|async def.*exit|__aenter__|__aexit__|_background_tasks)' src/synthorg/hr/performance/tracker.py -B2 -A3

Repository: Aureliolo/synthorg

Length of output: 879


🏁 Script executed:

# Check the class definition, __init__, and any cleanup methods
sed -n '1,150p' src/synthorg/hr/performance/tracker.py | head -100

Repository: Aureliolo/synthorg

Length of output: 3838


🏁 Script executed:

# Check for __del__, __aexit__, or any cleanup pattern in the entire file
rg -n '(__del__|__aexit__|__aenter__|async def.*clean|async def.*shutdown|async def.*drain)' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 44


🏁 Script executed:

# Check if _background_tasks is used anywhere else (e.g., in cleanup)
rg -n '_background_tasks' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 308


🏁 Script executed:

# Check if PerformanceTracker is async context manager or has lifespan methods
ast-grep --pattern 'class PerformanceTracker { $$$ }'

Repository: Aureliolo/synthorg

Length of output: 44


🏁 Script executed:

# Check where record_collaboration_event is called to understand the call path
rg -n 'record_collaboration_event' src/synthorg/hr/performance/tracker.py -A3

Repository: Aureliolo/synthorg

Length of output: 191


🏁 Script executed:

# Verify if there are any tests that show expected behavior for background tasks
rg -n '_background_tasks' tests/ -l | head -5

Repository: Aureliolo/synthorg

Length of output: 116


🏁 Script executed:

# Find record_collaboration_event definition and context
sed -n '185,215p' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 1009


🏁 Script executed:

# Check if record_collaboration_event is async or sync
rg -n 'def record_collaboration_event|async def record_collaboration_event' src/synthorg/hr/performance/tracker.py -A2

Repository: Aureliolo/synthorg

Length of output: 172


🏁 Script executed:

# Verify the exact signature and if _schedule_sampling is called
sed -n '190,212p' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 780


🏁 Script executed:

# Check if _maybe_sample handles exceptions properly
sed -n '461,495p' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 1192


🏁 Script executed:

sed -n '461,495p' src/synthorg/hr/performance/tracker.py

Repository: Aureliolo/synthorg

Length of output: 1192


Wrap background sampling in an owned TaskGroup or queue with backpressure.

_schedule_sampling() executes should_sample() and asyncio.create_task() on the ingestion path without failure containment, so setup failures can escape record_collaboration_event(). More critically, _background_tasks is unbounded and grows one task per sampled record with no cap or drain mechanism.

Refactor to use an owned asyncio.TaskGroup, queue, or semaphore to bound concurrent sampling tasks and cleanly shut down pending work. This contains setup failures, prevents unbounded growth, and aligns with the structured concurrency guideline (prefer TaskGroup over bare create_task).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/performance/tracker.py` around lines 450 - 459, The current
_schedule_sampling() creates unbounded background tasks and can leak setup
failures; replace the bare asyncio.create_task usage and _background_tasks set
with a structured concurrency approach (e.g., an owned asyncio.TaskGroup or an
asyncio.Queue + worker tasks protected by an asyncio.Semaphore) so sampling is
bounded and shutdown-able: have record_collaboration_event() enqueue records or
call into a non-blocking should_sample check, then let a dedicated
TaskGroup/worker loop run _maybe_sample(record) under try/except to contain
exceptions, use a semaphore or limited worker count to cap concurrency, and
implement an async shutdown method that cancels/drains the TaskGroup or stops
workers and awaits pending _maybe_sample completions; update references to
_background_tasks, _maybe_sample, and _schedule_sampling accordingly.

Comment on lines +475 to +499
try:
behavioral_result = await self._collaboration_strategy.score(
agent_id=record.agent_id,
records=(record,),
)
except MemoryError, RecursionError:
raise
except Exception:
logger.warning(
PERF_LLM_SAMPLE_FAILED,
agent_id=record.agent_id,
record_id=record.id,
reason="behavioral_score_failed",
exc_info=True,
)
return

try:
await sampler.sample(
record=record,
behavioral_score=behavioral_result.score,
)
except MemoryError, RecursionError:
raise
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log the fatal re-raise branches with context.

Both MemoryError / RecursionError branches bypass the structured logger, so the only signal is an unstructured background-task failure without agent_id / record_id. Emit logger.error(...) before each raise.

🧭 Suggested change
         try:
             behavioral_result = await self._collaboration_strategy.score(
                 agent_id=record.agent_id,
                 records=(record,),
             )
         except MemoryError, RecursionError:
+            logger.error(
+                PERF_LLM_SAMPLE_FAILED,
+                agent_id=record.agent_id,
+                record_id=record.id,
+                reason="behavioral_score_fatal",
+                exc_info=True,
+            )
             raise
         except Exception:
             logger.warning(
                 PERF_LLM_SAMPLE_FAILED,
                 agent_id=record.agent_id,
@@
         try:
             await sampler.sample(
                 record=record,
                 behavioral_score=behavioral_result.score,
             )
         except MemoryError, RecursionError:
+            logger.error(
+                PERF_LLM_SAMPLE_FAILED,
+                agent_id=record.agent_id,
+                record_id=record.id,
+                reason="llm_sample_fatal",
+                exc_info=True,
+            )
             raise
         except Exception:
             logger.warning(
                 PERF_LLM_SAMPLE_FAILED,
                 agent_id=record.agent_id,

As per coding guidelines, "All error paths must log at WARNING or ERROR with context before raising."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/performance/tracker.py` around lines 475 - 499, The except
blocks that catch MemoryError and RecursionError around calls to
self._collaboration_strategy.score(...) and sampler.sample(...) should log a
structured error with context before re-raising; add a logger.error call in each
of those except MemoryError, RecursionError handlers that includes
PERF_LLM_SAMPLE_FAILED (or an appropriate constant), agent_id=record.agent_id,
record_id=record.id and exc_info=True, then re-raise the exception so the
existing behavior is preserved.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/synthorg/hr/performance/llm_calibration_sampler.py (1)

103-169: 🛠️ Refactor suggestion | 🟠 Major

sample() still exceeds the 50-line function limit.

The method at Lines 103-169 remains above the repository’s max function size and should be split further (e.g., record creation/persistence helper) to satisfy the enforced limit.

As per coding guidelines, "Functions must be less than 50 lines; files must be less than 800 lines."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py` around lines 103 -
169, The sample() method is too long; extract the LLM call handling and
calibration record construction/persistence into small helpers to reduce
sample() under 50 lines: keep sample() orchestrating flow and calls to
self._prune_expired(), await self._call_llm(record) (leave exception handling
for MemoryError/RecursionError and general Exception logging in sample()), then
move the block that creates LlmCalibrationRecord(...) and updates self._records
(including agent_key logic and logger.info(PERF_LLM_SAMPLE_COMPLETED,...)) into
a new method like _build_and_store_calibration_record(record, behavioral_score,
llm_score, rationale, cost_usd) that returns the calibration_record; update
sample() to call that helper and return its result. Ensure the new helper uses
NotBlankStr for rationale and model_used and sets sampled_at=datetime.now(UTC).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/hr/performance/collaboration_override_store.py`:
- Around line 113-116: In clear_override(), when detecting an expired override
(override.expires_at is not None and override.expires_at <= now) you should log
the eviction as an INFO state transition before deleting it; add a
processLogger/info-style log (or the module logger) that includes identifying
data such as agent_key and the override identifier (or the override object) to
mirror the behavior in get_active_override(), then perform del
self._overrides[agent_key] and return False.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py`:
- Around line 357-362: The retention-prune loop that filters self._records by
sampled_at and deletes empty agent buckets currently performs deletions
silently; add an INFO-level log for each state transition: when records are
pruned for an agent_key and when an agent_key bucket is deleted. Modify the loop
in the LLMCalibrationSampler (the block referencing self._records, agent_key,
and r.sampled_at) to call the module/class logger at INFO with messages that
include the agent_key, number of records removed (or remaining), and the cutoff
timestamp so observers can track retention-driven changes.

---

Duplicate comments:
In `@src/synthorg/hr/performance/llm_calibration_sampler.py`:
- Around line 103-169: The sample() method is too long; extract the LLM call
handling and calibration record construction/persistence into small helpers to
reduce sample() under 50 lines: keep sample() orchestrating flow and calls to
self._prune_expired(), await self._call_llm(record) (leave exception handling
for MemoryError/RecursionError and general Exception logging in sample()), then
move the block that creates LlmCalibrationRecord(...) and updates self._records
(including agent_key logic and logger.info(PERF_LLM_SAMPLE_COMPLETED,...)) into
a new method like _build_and_store_calibration_record(record, behavioral_score,
llm_score, rationale, cost_usd) that returns the calibration_record; update
sample() to call that helper and return its result. Ensure the new helper uses
NotBlankStr for rationale and model_used and sets sampled_at=datetime.now(UTC).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 689d1049-2dc6-4658-a506-c10f7d0c92a9

📥 Commits

Reviewing files that changed from the base of the PR and between a42d81e and 7a2f9b4.

📒 Files selected for processing (4)
  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax (no parentheses) for exception handling — PEP 758 exception syntax enforced by ruff on Python 3.14
All public functions and classes must have type hints; mypy strict mode enforced
Use Google-style docstrings required on all public classes and functions; enforced by ruff D rules
Create new objects for immutability; never mutate existing objects. For non-Pydantic internal collections, use copy.deepcopy() at construction and MappingProxyType for read-only enforcement. For Pydantic frozen models, use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Use frozen Pydantic models (v2) for config and identity; separate mutable-via-copy models for runtime state using model_copy(update=...). Never mix static config fields with mutable runtime fields in one model
Use Pydantic v2 conventions: @computed_field for derived values instead of storing redundant fields; use NotBlankStr from core.types for all identifier/name fields (including optional NotBlankStr | None and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls) for structured concurrency; avoid bare create_task
Line length: 88 characters (ruff configured)
Functions must be less than 50 lines; files must be less than 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, config files

Files:

  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Every module with business logic MUST have: from synthorg.observability import get_logger then logger = get_logger(__name__). Use logger variable name exclusively (not _logger, not log)
Never use import logging / logging.getLogger() / print() in application code — use the get_logger() function from synthorg.observability
Always use event name constants from synthorg.observability.events.<domain> modules (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT. Use structured logging: logger.info(EVENT, key=value) — never logger.info("msg %s", val)
All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit
Pure data models, enums, and re-exports do NOT require logging

Files:

  • src/synthorg/hr/performance/collaboration_override_store.py
  • src/synthorg/hr/performance/llm_calibration_sampler.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow to categorize tests
Minimum 80% code coverage (enforced in CI via pytest --cov-fail-under=80)
Use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed on async tests
Set 30-second timeout per test; use pytest-xdist via -n auto — ALWAYS include -n auto when running pytest, never run tests sequentially
Prefer @pytest.mark.parametrize for testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned tests. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, test-provider, test-small-001, etc. Use large/medium/small as aliases. Real vendor names only appear in: (1) Operations design page (docs/design/operations.md), (2) .claude/ skill/agent files, (3) third-party import paths (e.g., litellm.types.llms.openai)
Use Hypothesis for property-based testing with @given and @settings decorators. Hypothesis profiles: ci (200 examples, default) and dev (1000 examples) controlled via HYPOTHESIS_PROFILE env var. Run dev profile: HYPOTHESIS_PROFILE=dev uv run python -m pytest tests/ -m unit -n auto -k properties. .hypothesis/ is gitignored
NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins

Files:

  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
🧠 Learnings (6)
📚 Learning: 2026-03-15T11:48:14.867Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T11:48:14.867Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO.

Applied to files:

  • src/synthorg/hr/performance/collaboration_override_store.py
📚 Learning: 2026-03-16T07:07:34.946Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:07:34.946Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

  • src/synthorg/hr/performance/collaboration_override_store.py
📚 Learning: 2026-03-16T07:07:34.946Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:07:34.946Z
Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow exceptions

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
📚 Learning: 2026-03-16T07:07:34.946Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:07:34.946Z
Learning: Applies to **/*.py : Use `except A, B:` syntax (no parentheses) for exception handling — PEP 758 exception syntax enforced by ruff on Python 3.14

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
📚 Learning: 2026-03-16T07:07:34.946Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:07:34.946Z
Learning: Applies to **/*.py : Functions must be less than 50 lines; files must be less than 800 lines

Applied to files:

  • src/synthorg/hr/performance/llm_calibration_sampler.py
📚 Learning: 2026-03-16T07:07:34.946Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:07:34.946Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

  • tests/unit/hr/performance/test_collaboration_override_store.py
  • tests/unit/hr/performance/test_llm_calibration_sampler.py
🧬 Code graph analysis (4)
src/synthorg/hr/performance/collaboration_override_store.py (2)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
src/synthorg/api/controllers/collaboration.py (2)
  • set_override (207-269)
  • clear_override (272-303)
src/synthorg/hr/performance/llm_calibration_sampler.py (1)
src/synthorg/hr/performance/models.py (3)
  • LlmCalibrationRecord (189-245)
  • CollaborationMetricRecord (70-132)
  • drift (231-233)
tests/unit/hr/performance/test_collaboration_override_store.py (3)
src/synthorg/hr/performance/collaboration_override_store.py (4)
  • set_override (37-51)
  • get_active_override (53-84)
  • list_overrides (125-150)
  • clear_override (86-123)
src/synthorg/hr/performance/models.py (1)
  • CollaborationOverride (248-299)
tests/unit/hr/performance/conftest.py (1)
  • make_collaboration_override (96-113)
tests/unit/hr/performance/test_llm_calibration_sampler.py (4)
src/synthorg/hr/performance/llm_calibration_sampler.py (2)
  • LlmCalibrationSampler (58-362)
  • sample (103-168)
tests/unit/hr/performance/conftest.py (2)
  • make_calibration_record (72-93)
  • make_collab_metric (46-69)
src/synthorg/hr/performance/tracker.py (1)
  • sampler (436-438)
src/synthorg/hr/performance/models.py (1)
  • drift (231-233)
🔇 Additional comments (3)
tests/unit/hr/performance/test_collaboration_override_store.py (1)

157-170: Deterministic default-time test looks good.

Lines 157-170 now use a runtime-derived reference time, which keeps this test stable over calendar time.

tests/unit/hr/performance/test_llm_calibration_sampler.py (1)

366-401: Retention pruning test is deterministic and solid.

The datetime pinning in Lines 372-381 eliminates clock-based flakiness for pruning assertions.

src/synthorg/hr/performance/llm_calibration_sampler.py (1)

274-274: The current syntax is correct. PEP 758 allows parentheses-free multi-exception syntax (except ValueError, TypeError:) only when there is no as clause. Since this code uses as exc to bind the exception, parentheses are required by the PEP 758 specification. The suggested change would produce invalid Python 3.14 syntax.

			> Likely an incorrect or invalid review comment.

Comment on lines +357 to +362
for agent_key in list(self._records):
self._records[agent_key] = [
r for r in self._records[agent_key] if r.sampled_at >= cutoff
]
if not self._records[agent_key]:
del self._records[agent_key]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Log retention-prune deletions at INFO.

Lines 357-362 remove records and delete agent buckets from _records, but no INFO state-transition log is emitted. This creates an observability gap for retention-driven data changes.

As per coding guidelines, "All state transitions must log at INFO."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/hr/performance/llm_calibration_sampler.py` around lines 357 -
362, The retention-prune loop that filters self._records by sampled_at and
deletes empty agent buckets currently performs deletions silently; add an
INFO-level log for each state transition: when records are pruned for an
agent_key and when an agent_key bucket is deleted. Modify the loop in the
LLMCalibrationSampler (the block referencing self._records, agent_key, and
r.sampled_at) to call the module/class logger at INFO with messages that include
the agent_key, number of records removed (or remaining), and the cutoff
timestamp so observers can track retention-driven changes.

… override (#232)

Add two D3 enhancements to the collaboration scoring system:

1. **LLM calibration sampling**: New `LlmCalibrationSampler` service that
   probabilistically samples 1% of collaboration interactions (configurable),
   sends them to an LLM for independent evaluation, and stores calibration
   records for drift analysis against the behavioral strategy. Opt-in via
   `llm_sampling_model` config (defaults to None/disabled).

2. **Human override via API**: New `CollaborationOverrideStore` and
   `CollaborationController` at `/agents/{agent_id}/collaboration` with
   GET/POST/DELETE endpoints for managing score overrides. Overrides take
   precedence over computed scores with optional expiration.

New models: `LlmCalibrationRecord`, `CollaborationOverride`, plus
`interaction_summary` field on `CollaborationMetricRecord` and
`override_active` field on `CollaborationScoreResult`.

`PerformanceTracker` integrated with both services and added to `AppState`.

Closes #232
Pre-reviewed by 8 agents, 17 findings addressed:

- Fix user identity extraction (use Request.scope + AuthenticatedUser)
- Fix error types: ServiceUnavailableError for unconfigured services
- Add MemoryError/RecursionError guards on all except Exception blocks
- Separate behavioral score + LLM sample try blocks in _maybe_sample
- Convert LlmCalibrationRecord.drift to @computed_field
- Add expires_at > applied_at validator on CollaborationOverride
- Add constructor validation for sampling_rate/retention_days
- Change interaction_summary to NotBlankStr | None
- Convert CalibrationSummaryResponse.record_count to @computed_field
- Add allow_inf_nan=False to all DTOs
- Log raw LLM response before raising in _call_llm
- Hoist NotBlankStr import to module level in sampler
- Add max_length=4096 to CollaborationOverride.reason
- Add API controller tests (11 tests)
- Add _call_llm edge case tests (null content, out-of-range score)
- Wire performance_tracker into create_app
- Update CLAUDE.md (events, package structure) and design spec D3
Critical:
- Fix str.format() injection on user-controlled interaction_summary
  (escape curly braces, add prompt boundary markers)
- Reject requests when user identity cannot be determined instead of
  storing applied_by="unknown"

Major:
- Wrap JSON parsing in explicit try/except with raw content logging
- Thread now= parameter through get_collaboration_score for snapshot
  consistency
- Log 404 branches before raising NotFoundError (CLAUDE.md rule)
- Remove reflected agent_id from error messages (use generic text)
- Update docstrings: PerformanceTracker (sampler/override_store args),
  PerformanceConfig (3 new fields), CollaborationMetricRecord
  (interaction_summary), set_override (Raises section),
  CalibrationSummaryResponse (records attribute), _call_llm (Raises)

Medium:
- Update docs/architecture/decisions.md D3 to reflect implemented
  LLM sampling and human override
- Fix test_default_now_uses_current_time time-bomb (use runtime clock)
- Fix header mutation on shared client (pass headers to request)
- Remove duplicate _make_override helper (use conftest factory)
- Evict expired overrides from dict on get_active_override
- Add max_length=2048 to LlmCalibrationRecord.rationale
- Use Field(default=()) for CalibrationSummaryResponse.records
- Extract _require_override_store helper to DRY controller
- Render None metrics as "not observed" in LLM prompt
- Add model tests for LlmCalibrationRecord and CollaborationOverride
- Add 503 API tests when override store not configured
- Add constructor validation tests for LlmCalibrationSampler
- Add CollaborationOverride._validate_expiration_ordering test
- Add behavioral strategy failure path test in _maybe_sample
- Add negative LLM score test and calibration-with-sampler API test
- Add frontend TypeScript types and collaboration endpoint module
- Fix tuple[str, float] to tuple[NotBlankStr, float] in model fields

Minor:
- Fix conftest make_collab_metric type annotation (NotBlankStr | None)
Replace parametrized **kwargs pattern with explicit test methods to
avoid dict[str, float | int] type incompatibility with _make_sampler
keyword args.
- Override store: log eviction at INFO (state transition, not debug)
- Sampler: prune stale records on reads (get_calibration_records,
  get_drift_summary), not just on sample()
- Tracker: fire-and-forget sampling via asyncio.create_task with
  tracked task set — record_collaboration_event no longer blocks on
  the LLM round-trip
- Parametrize 503 override-store-not-configured API tests
- Make retention pruning test deterministic (monkeypatch datetime)
- Pass now=NOW to get_collaboration_score in expired-override test
- Frontend: add unwrapVoid helper, use it in clearOverride to validate
  response body
- Frontend: make SetOverrideRequest.expires_in_days optional (matches
  backend default=None)
- clear_override: check expiration before clearing — expired overrides
  return False and are silently evicted (not logged as CLEARED)
- Split _call_llm into _build_prompt + _parse_llm_response + _call_llm
  to keep each method under 50 lines
- Validate rationale is non-blank after stripping in _parse_llm_response
  — whitespace-only rationale now raises ValueError (caught by sample()
  as None) instead of hitting NotBlankStr in record construction

Skipped 4 findings:
- applied_by in PERF_OVERRIDE_APPLIED log: internal user ID for audit,
  not PII
- Log before re-raising MemoryError/RecursionError: logging during OOM
  may fail; immediate re-raise is the codebase-wide pattern
- Replace create_task with TaskGroup/Queue: over-engineering for 1%
  sampling rate; provider rate limiter already bounds concurrency
- Split sample() further: 47 lines, under the 50-line limit
Mirrors the INFO-level PERF_OVERRIDE_EXPIRED log already emitted by
get_active_override, so both eviction paths produce consistent
structured logs for operational visibility.
@Aureliolo Aureliolo force-pushed the feat/collaboration-scoring-enhancements branch from 7a2f9b4 to ca9a861 Compare March 16, 2026 07:34
@Aureliolo Aureliolo merged commit b3f3330 into main Mar 16, 2026
25 of 27 checks passed
@Aureliolo Aureliolo deleted the feat/collaboration-scoring-enhancements branch March 16, 2026 07:35
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 16, 2026 07:35 — with GitHub Actions Inactive
Aureliolo added a commit that referenced this pull request Mar 16, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.2.8](v0.2.7...v0.2.8)
(2026-03-16)


### Features

* add RRF rank fusion to memory ranking
([#478](#478))
([42242b5](42242b5))
* collaboration scoring enhancements — LLM sampling and human override
([#477](#477))
([b3f3330](b3f3330))


### Bug Fixes

* add .gitattributes to enforce LF line endings for Go files
([#483](#483))
([1b8c7b6](1b8c7b6))
* **cli:** Windows uninstall, update UX, health check, sigstore
([#476](#476))
([470ca72](470ca72))


### Refactoring

* **web:** extract WebSocket subscription into reusable composable
([#475](#475))
([96e6c46](96e6c46)),
closes [#351](#351)


### Maintenance

* bump hypothesis from 6.151.5 to 6.151.9 in the minor-and-patch group
([#482](#482))
([a7297d5](a7297d5))
* bump nginxinc/nginx-unprivileged from `aec540f` to `ccbac1a` in
/docker/web ([#479](#479))
([176e052](176e052))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement collaboration scoring enhancements — LLM sampling and human override

1 participant