feat: Prometheus /metrics endpoint and OTLP exporter (#1122) by Aureliolo · Pull Request #1135 · Aureliolo/synthorg

Aureliolo · 2026-04-07T23:45:39Z

Summary

Adds Prometheus /metrics scrape endpoint and OTLP log exporter, closing the most significant telemetry gap (G1) for enterprise control-plane positioning.

Phase 1: Prometheus /metrics endpoint

GET /api/v1/metrics -- unauthenticated Prometheus scrape target
PrometheusCollector with dedicated CollectorRegistry (no global pollution)
5 metric families:
- synthorg_agents_total{status, trust_level} -- agent counts by status and trust level
- synthorg_tasks_total{status, agent} -- task counts by lifecycle status and assigned agent
- synthorg_cost_total -- total accumulated cost
- synthorg_budget_used_percent / synthorg_budget_monthly_usd -- budget utilization
- synthorg_coordination_efficiency / synthorg_coordination_overhead_percent -- push-updated by coordination collector
- synthorg_security_evaluations_total{verdict} -- security verdict counter (bounded: allow/deny/escalate)
set_prometheus_collector() on AppState for deferred initialization
Returns 503 when collector not configured

Phase 2: OTLP exporter

SinkType.OTLP + OtlpProtocol enum (HTTP/protobuf only; gRPC rejected at init with NotImplementedError)
OtlpHandler -- background daemon thread with batched export
Maps structlog events to OTLP log records via self.format(record) (runs ProcessorFormatter + foreign_pre_chain) with correlation IDs (request_id, task_id, agent_id) as trace context
Configurable: endpoint, protocol, headers, export interval, batch size, timeout
SinkType.PROMETHEUS returns NullHandler (pull-based, no log handler needed)

Security hardening

CRLF validation on OTLP headers (header injection prevention)
SSRF: private/loopback IP rejection on OTLP endpoint
TLS warning for unencrypted HTTP with auth headers
/metrics always excluded from auth (even with custom auth.exclude_paths)
Security verdict label cardinality bounded to allow/deny/escalate

Documentation

docs/architecture/acg-glossary.md -- bidirectional ACG-to-SynthOrg concept mapping
docs/design/operations.md -- /metrics endpoint (all 9 metric families with types/labels), OTLP sink, G1 status updated
CLAUDE.md -- metrics event constants, otlp_handler.py logging exception

Test plan

15,616 tests pass (13 new tests added)
Unit tests for: MetricsController (200/503 with body assertion), PrometheusCollector (init, refresh, partial failure, all metric families, gauge clearing, coordination push, security verdicts with validation), OtlpHandler (emit, batch-ready, drain, format via self.format(), gRPC rejection, export failure with mocked urllib, close-always-drains), build_handler dispatch (PROMETHEUS NullHandler, OTLP delegation), SinkConfig validation (Prometheus/OTLP fields, cross-type rejection, batch_size/timeout fields, CRLF headers, SSRF)

Review coverage

Pre-reviewed by 15 agents, 18 findings addressed. Post-PR reviewed by 14 local agents + 3 external reviewers (CodeRabbit, Copilot, Gemini), 21 additional findings addressed:

4 CRITICAL: gRPC rejection, formatter bypass, stale gauges, auth exclusion
8 MAJOR: unused OTEL deps, verdict cardinality, test determinism, CRLF/SSRF/TLS, setter logs, docs G1 status, budget metric text
5 MEDIUM: 503 body assertion, /metrics docs, OTLP constants, task-by-agent gauge
2 MINOR: docstring push/pull model, lock semantics
1 INFRA: pre-push hook worker crashes (-n 4 for full-suite runs)

Deferred to #1148: daily budget %, per-agent budget, per-agent task count infrastructure (requires CostTracker time-windowed and per-agent queries).

Closes #1122
Closes #1124

coderabbitai · 2026-04-07T23:46:51Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1c0623f6-67ea-4ece-92af-70e5253b6a0a

📥 Commits

Reviewing files that changed from the base of the PR and between 68f1fa4 and 9cc33ab.

📒 Files selected for processing (1)

src/synthorg/api/app.py

Walkthrough

Adds a Prometheus scrape endpoint (GET /metrics) via a new Litestar MetricsController, a PrometheusCollector for in-memory metrics, and deferred wiring/support for a Prometheus collector in AppState and app startup. Introduces OTLP export: new OtlpHandler, OtlpProtocol enum, SinkType.OTLP, OTLP config fields and validation (HTTP/JSON only, endpoint safety checks), and OTLP sink builder. build_handler gains PROMETHEUS/OTLP dispatch branches. Documentation (design spec, ACG glossary) and extensive unit tests for metrics, OTLP, config, and enums were added.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.98% which is insufficient. The required threshold is 40.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat: Prometheus /metrics endpoint and OTLP exporter' clearly describes the main changes and aligns with the primary objectives of adding both Prometheus metrics endpoint and OTLP export capabilities.
Description check	✅ Passed	The PR description provides a comprehensive overview of both Phase 1 (Prometheus metrics) and Phase 2 (OTLP exporter) implementations, details metric families, security measures, and documentation updates that directly relate to the changeset.
Linked Issues check	✅ Passed	The PR fulfills requirements from `#1122` by implementing Prometheus /metrics endpoint with specified metric families, OTLP exporter with configurable options, security hardening (CRLF/SSRF/TLS), and from `#1124` by creating the ACG-to-SynthOrg glossary and updating design documentation.
Out of Scope Changes check	✅ Passed	All major changes align with objectives: new metrics endpoint and collector (Phase 1), OTLP handler and configuration (Phase 2), documentation updates, and comprehensive test coverage. Minor changes include logging level adjustments for startup events and updates to CLAUDE.md conventions, which support the core implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements comprehensive telemetry and observability enhancements, including a new Prometheus metrics scrape endpoint and an OTLP log exporter. It introduces an ACG Glossary to map architectural concepts to the Agentic Computation Graph formalism and updates the application state to manage business metrics such as costs, budget utilization, and agent/task statuses. Review feedback identifies a critical issue where the custom OTLP handler ignores protocol configurations and hardcodes paths, and it suggests clearing Prometheus gauges during refreshes to prevent stale data reporting.

gemini-code-assist · 2026-04-07T23:47:03Z

src/synthorg/observability/otlp_handler.py

+        url = self._endpoint.rstrip("/") + "/v1/logs"
+        request = urllib.request.Request(url, data=body, method="POST")  # noqa: S310


The OtlpHandler implementation currently ignores the protocol configuration and always uses HTTP with a JSON payload. This contradicts the OtlpProtocol enum which includes GRPC and HTTP_PROTOBUF (the default). Additionally, the code hardcodes the /v1/logs path, which may lead to incorrect URLs if the user provides a full signal-specific endpoint. Since opentelemetry-exporter-otlp-proto-http is already a dependency, consider using the official OTLPLogExporter to support multiple protocols and encodings correctly.

gemini-code-assist · 2026-04-07T23:47:03Z

src/synthorg/observability/prometheus_collector.py

+            for (status, trust), count in counts.items():
+                self._agents_total.labels(
+                    status=status,
+                    trust_level=trust,
+                ).set(count)


Gauges in prometheus_client persist their last value until explicitly updated or cleared. In _refresh_agent_metrics, if an agent status or trust level combination is no longer present in the agents list, its corresponding gauge will remain at its previous value instead of dropping to zero. This results in stale and incorrect metrics. You should clear the gauge before updating it with the current snapshot.

Suggested change

for (status, trust), count in counts.items():

self._agents_total.labels(

status=status,

trust_level=trust,

).set(count)

self._agents_total.clear()

for (status, trust), count in counts.items():

self._agents_total.labels(

status=status,

trust_level=trust,

).set(count)

gemini-code-assist · 2026-04-07T23:47:03Z

src/synthorg/observability/prometheus_collector.py

+            for status, count in status_counts.items():
+                self._tasks_total.labels(status=status).set(count)


Similar to the agent metrics, task status gauges will retain stale values if a particular status is no longer present in the tasks list. The gauge should be cleared before the update loop to ensure it accurately reflects the current state.

Suggested change

for status, count in status_counts.items():

self._tasks_total.labels(status=status).set(count)

self._tasks_total.clear()

for status, count in status_counts.items():

self._tasks_total.labels(status=status).set(count)

github-actions · 2026-04-07T23:48:03Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 9cc33ab.

Ensure that dependencies are being submitted on PR branches. Re-running this action after a short time may resolve the issue. See the documentation for more information and troubleshooting advice.

OpenSSF Scorecard

Package	Version	Score	Details
pip/importlib-metadata	8.7.1	Unknown	Unknown
pip/protobuf	5.29.6	Unknown	Unknown

Scanned Files

uv.lock

codecov · 2026-04-07T23:52:37Z

Codecov Report

❌ Patch coverage is 75.00000% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.87%. Comparing base (af11f0a) to head (9cc33ab).
⚠️ Report is 7 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/synthorg/observability/config.py	61.61%	27 Missing and 11 partials ⚠️
src/synthorg/observability/otlp_handler.py	75.22%	21 Missing and 7 partials ⚠️
src/synthorg/observability/prometheus_collector.py	85.00%	13 Missing and 2 partials ⚠️
src/synthorg/api/app.py	44.44%	4 Missing and 1 partial ⚠️
src/synthorg/api/controllers/metrics.py	82.14%	5 Missing ⚠️
src/synthorg/api/state.py	75.00%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1135      +/-   ##
==========================================
- Coverage   88.98%   88.87%   -0.11%     
==========================================
  Files         851      855       +4     
  Lines       49633    50014     +381     
  Branches     4977     5029      +52     
==========================================
+ Hits        44164    44450     +286     
- Misses       4533     4607      +74     
- Partials      936      957      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds first-class observability primitives to SynthOrg by introducing a Prometheus scrape endpoint and an OTLP sink option, along with supporting configuration/enums, tests, and documentation updates.

Changes:

Add /api/v1/metrics endpoint (unauthenticated) plus a PrometheusCollector with an isolated CollectorRegistry.
Add new sink types (PROMETHEUS, OTLP), OTLP configuration fields, and an OtlpHandler implementation.
Add unit tests and docs (including an ACG glossary and ops design updates) to cover/describe the new telemetry surfaces.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
uv.lock	Adds Prometheus/OpenTelemetry dependencies and updates transitive pins.
pyproject.toml	Declares new Prometheus/OpenTelemetry dependencies.
src/synthorg/observability/sinks.py	Routes `SinkType.PROMETHEUS` to a `NullHandler`; adds OTLP handler builder dispatch.
src/synthorg/observability/prometheus_collector.py	Implements collector + scrape-time refresh + push-updated coordination/security counters.
src/synthorg/observability/otlp_handler.py	Implements OTLP log shipper via background thread and HTTP POST.
src/synthorg/observability/events/metrics.py	Adds metrics/OTLP event constants.
src/synthorg/observability/enums.py	Adds `SinkType.PROMETHEUS`, `SinkType.OTLP`, and `OtlpProtocol`.
src/synthorg/observability/config.py	Adds Prometheus/OTLP sink config fields + validation/rejection rules.
src/synthorg/api/state.py	Adds `prometheus_collector` storage + setter/getter on `AppState`.
src/synthorg/api/controllers/metrics.py	Adds metrics controller returning Prometheus exposition output (200/503).
src/synthorg/api/controllers/init.py	Registers `MetricsController` in `ALL_CONTROLLERS`.
src/synthorg/api/app.py	Excludes `/metrics` from auth middleware by default.
tests/unit/api/controllers/test_metrics.py	Adds endpoint tests for 200/503 behavior and content type.
tests/unit/observability/test_prometheus_collector.py	Adds refresh/init/security/coordination/prefix tests for the collector.
tests/unit/observability/test_otlp_handler.py	Adds core behavior + factory + protocol tests for OTLP handler.
tests/unit/observability/test_sinks.py	Tests build_handler dispatch for PROMETHEUS/OTLP.
tests/unit/observability/test_config.py	Adds validation tests for Prometheus/OTLP sink config.
tests/unit/observability/test_enums.py	Tests new enums.
tests/unit/observability/test_events.py	Includes `metrics` in domain discovery test.
tests/unit/observability/conftest.py	Extends SinkConfigFactory defaults for new fields.
docs/design/operations.md	Documents metrics endpoint + OTLP sink; marks G1 as closed.
docs/DESIGN_SPEC.md	Links the new ACG glossary doc.
docs/architecture/acg-glossary.md	Adds bidirectional ACG↔SynthOrg concept mapping.
CLAUDE.md	Updates logging guidance to allow `otlp_handler.py` stdlib logging/print.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T23:53:22Z

src/synthorg/observability/prometheus_collector.py

+            agents = await app_state.agent_registry.list_active()
+            counts: Counter[tuple[str, str]] = Counter()
+            for agent in agents:
+                status = str(agent.status)
+                trust = str(agent.tools.access_level)
+                counts[(status, trust)] += 1
+            # Only set gauges for observed (status, trust_level) pairs;
+            # unobserved pairs remain at their previous value.
+            for (status, trust), count in counts.items():
+                self._agents_total.labels(
+                    status=status,
+                    trust_level=trust,
+                ).set(count)


_refresh_agent_metrics() only sets gauges for label pairs observed in the current scrape, which means label series for agents that disappear (or change status/trust_level) will retain their previous non-zero values indefinitely. This produces stale counts in Prometheus. Clear the labeled gauge (e.g., self._agents_total.clear()) before repopulating, or explicitly set missing label pairs to 0 each refresh.

Copilot · 2026-04-07T23:53:23Z

src/synthorg/observability/prometheus_collector.py

+            tasks, _ = await app_state.task_engine.list_tasks()
+            status_counts: Counter[str] = Counter()
+            for task in tasks:
+                status_counts[str(task.status)] += 1


Similar to agents, _refresh_task_metrics() only updates label values for statuses present in the current tasks list. If a status disappears between scrapes, its previous gauge value remains, causing incorrect task totals. Consider calling self._tasks_total.clear() before setting counts (or explicitly zeroing missing statuses).

Suggested change

status_counts[str(task.status)] += 1

status_counts[str(task.status)] += 1

self._tasks_total.clear()

Copilot · 2026-04-07T23:53:23Z

src/synthorg/observability/prometheus_collector.py

+    def record_security_verdict(self, verdict: str) -> None:
+        """Increment the security verdict counter.
+
+        Called by a thin hook around ``SecOpsService.evaluate_pre_tool()``.
+
+        Args:
+            verdict: The verdict string (e.g. ``"allow"``, ``"deny"``).
+        """
+        self._security_evaluations.labels(verdict=verdict).inc()
+


record_security_verdict() accepts an arbitrary verdict: str and uses it as a Prometheus label value. If the caller ever passes unexpected values (typos, exception messages, etc.), this can create unbounded label cardinality. Since SynthOrg already has a bounded verdict set (allow/deny/escalate), validate against that set (or accept SecurityVerdictType and use .value).

Copilot · 2026-04-07T23:53:23Z

src/synthorg/observability/prometheus_collector.py

+    async def _refresh_budget_metrics(self, app_state: AppState) -> None:
+        """Update budget utilization gauges from CostTracker."""
+        if not app_state.has_cost_tracker:
+            return
+        try:
+            tracker = app_state.cost_tracker
+            if tracker.budget_config is None:
+                return
+            monthly = tracker.budget_config.total_monthly
+            self._budget_monthly_usd.set(monthly)
+            if monthly > 0:
+                total = await tracker.get_total_cost()
+                self._budget_used_percent.set(
+                    min(100.0, (total / monthly) * 100.0),
+                )


Prometheus budget metrics currently compute budget_used_percent using tracker.get_total_cost() without a time window. This doesn’t necessarily represent monthly spend (and can diverge from the ‘monthly’ wording in the metric/help text and docs). Consider computing month-to-date spend explicitly (e.g., start at UTC month boundary) or adjusting the metric name/help text to match the actual aggregation period used by CostTracker.

Copilot · 2026-04-07T23:53:23Z

src/synthorg/api/state.py

+    @property
+    def has_prometheus_collector(self) -> bool:
+        """Check whether the Prometheus collector is configured."""
+        return self._prometheus_collector is not None
+
+    @property
+    def prometheus_collector(self) -> PrometheusCollector:
+        """Return Prometheus collector or raise 503."""
+        return self._require_service(
+            self._prometheus_collector,
+            "prometheus_collector",
+        )
+
+    def set_prometheus_collector(
+        self,
+        collector: PrometheusCollector,
+    ) -> None:
+        """Set the Prometheus collector (deferred initialisation).
+
+        Args:
+            collector: Fully configured Prometheus collector.
+
+        Raises:
+            RuntimeError: If the collector was already configured.
+        """
+        if self._prometheus_collector is not None:
+            msg = "Prometheus collector already configured"
+            logger.error(API_APP_STARTUP, error=msg)
+            raise RuntimeError(msg)
+        self._prometheus_collector = collector


The collector is never configured anywhere in the app factory/startup path (no callers of set_prometheus_collector() exist), so /api/v1/metrics will always return 503 in the default production wiring. If the intention is to enable metrics by default, initialize PrometheusCollector during app startup (or behind config) and call app_state.set_prometheus_collector(...); otherwise, the docs should clearly state the additional wiring required.

Copilot · 2026-04-07T23:53:24Z

src/synthorg/observability/otlp_handler.py

+        # OTLP JSON format: wrap in resourceLogs envelope
+        payload = {
+            "resourceLogs": [
+                {
+                    "resource": {"attributes": []},
+                    "scopeLogs": [
+                        {
+                            "scope": {"name": "synthorg"},
+                            "logRecords": log_records,
+                        },
+                    ],
+                },
+            ],
+        }
+        body = json.dumps(payload).encode()
+
+        # Use /v1/logs path for OTLP HTTP JSON
+        url = self._endpoint.rstrip("/") + "/v1/logs"
+        request = urllib.request.Request(url, data=body, method="POST")  # noqa: S310
+        request.add_header("Content-Type", "application/json")
+        for name, value in self._extra_headers.items():
+            request.add_header(name, value)
+
+        try:
+            with urllib.request.urlopen(  # noqa: S310
+                request,
+                timeout=self._timeout,
+            ):
+                pass
+        except Exception as exc:


The OTLP HTTP request is sent with Content-Type: application/json, but the configured protocol is named HTTP_PROTOBUF, and the payload shape uses ad-hoc dicts (e.g., attributes is a plain dict) rather than a protobuf/protojson ExportLogsServiceRequest encoding. This is very likely incompatible with standard OTLP collectors. Prefer using the official OpenTelemetry OTLP exporters (already added as dependencies) to build and send compliant requests, or rename/restrict this to the actually supported encoding.

Copilot · 2026-04-07T23:53:24Z

tests/unit/observability/test_otlp_handler.py

+    def test_export_failure_increments_dropped_count(self) -> None:
+        handler = _make_handler(batch_size=1)
+        try:
+            handler.emit(_make_record("will fail"))
+            # Manually drain and try to export (will fail because no server)
+            records: list[logging.LogRecord] = []
+            while True:
+                try:
+                    records.append(handler._queue.get_nowait())
+                except queue.Empty:
+                    break
+            if records:
+                handler._export_batch(records)
+            with handler._pending_lock:
+                assert handler._dropped_count > 0
+        finally:


test_export_failure_increments_dropped_count performs a real network call to http://localhost:4318 via _export_batch(). This makes the unit test environment-dependent (it can fail if something is listening on that port, and may be slow/flaky depending on network stack behavior). Mock urllib.request.urlopen (or the handler’s export method) to deterministically simulate failures/success.

Copilot · 2026-04-07T23:53:24Z

src/synthorg/observability/config.py

+    otlp_export_interval_seconds: float = Field(
+        default=5.0,
+        gt=0,
+        description="Seconds between OTLP export batches",
+    )
+    otlp_batch_size: int = Field(
+        default=100,
+        gt=0,
+        description="Records per OTLP export batch",
+    )
+    otlp_timeout_seconds: float = Field(
+        default=10.0,
+        gt=0,
+        description="HTTP request timeout in seconds for OTLP export",
+    )


SinkConfig introduces _DEFAULT_OTLP_EXPORT_INTERVAL, but OTLP defaults are still duplicated as literals (e.g., default=5.0 on the field, and magic numbers 100/10.0 inside _reject_otlp_fields). This makes future default changes error-prone. Consider defining constants for all OTLP defaults (export interval, batch size, timeout, protocol) and reusing them in both Field(default=...) and rejection checks.

Copilot · 2026-04-07T23:53:25Z

docs/design/operations.md

 | Endpoint | Purpose |
 |----------|---------|
 | `/api/v1/health` | Health check, readiness |
+| `/api/v1/metrics` | Prometheus metrics scrape endpoint (unauthenticated). Exposes: agent counts by status, cost totals, budget utilization, coordination efficiency/overhead, security verdict counts. Refreshed per-scrape from in-memory services. |


This update marks G1 as Closed and states the /api/v1/metrics endpoint exposes coordination/security verdict metrics. However, the current codebase doesn’t wire PrometheusCollector into app startup (so the endpoint returns 503 by default), and there are no call sites that push coordination metrics or record security verdicts. Either add the missing wiring/hooks or soften this doc claim to reflect what’s actually enabled in Phase 1.

Copilot · 2026-04-07T23:53:25Z

pyproject.toml

+    "opentelemetry-api==1.34.1",
+    "opentelemetry-exporter-otlp-proto-http==1.34.1",
+    "opentelemetry-sdk==1.34.1",


New dependencies add OpenTelemetry packages (opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http), but the implementation in src/synthorg/observability/otlp_handler.py does not import/use them (it uses urllib with a custom JSON payload instead). To keep the dependency footprint justified, either switch to the official OTLP exporter implementation or remove the unused OpenTelemetry dependencies.

Suggested change

"opentelemetry-api==1.34.1",

"opentelemetry-exporter-otlp-proto-http==1.34.1",

"opentelemetry-sdk==1.34.1",

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/synthorg/api/app.py (1)

1204-1221: ⚠️ Potential issue | 🟠 Major

Always append /metrics to auth exclusions, even with custom exclude_paths.

Right now /metrics is only included in the default branch. If users provide custom auth.exclude_paths and omit it, Prometheus scraping silently breaks.

Suggested fix

 def _build_auth_exclude_paths(
     auth: AuthConfig,
     prefix: str,
     ws_path: str,
 ) -> tuple[str, ...]:
     """Compute auth middleware exclude paths with fail-safe defaults."""
     setup_status_path = f"^{prefix}/setup/status$"
+    metrics_path = f"^{prefix}/metrics$"
     exclude_paths = (
         auth.exclude_paths
         if auth.exclude_paths is not None
         else (
             f"^{prefix}/health$",
-            f"^{prefix}/metrics$",
+            metrics_path,
             "^/docs",
             "^/api$",
             f"^{prefix}/auth/setup$",
             f"^{prefix}/auth/login$",
             setup_status_path,
         )
     )
+    if metrics_path not in exclude_paths:
+        exclude_paths = (*exclude_paths, metrics_path)
     if setup_status_path not in exclude_paths:
         exclude_paths = (*exclude_paths, setup_status_path)
     if ws_path not in exclude_paths:
         exclude_paths = (*exclude_paths, ws_path)
     return exclude_paths

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/api/app.py` around lines 1204 - 1221, The current exclude_paths
logic only adds the metrics path in the default branch, so when
auth.exclude_paths is provided the Prometheus path f"^{prefix}/metrics$" may be
omitted; update the function that builds exclude_paths (referencing
auth.exclude_paths, exclude_paths, prefix, ws_path, setup_status_path) to always
ensure the metrics regex f"^{prefix}/metrics$" is present—after initializing
exclude_paths, check for that metrics entry and append it if missing (similar to
how setup_status_path and ws_path are handled) so custom auth.exclude_paths
cannot accidentally drop /metrics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 26-28: The project currently accepts OtlpProtocol.GRPC but never
uses it in OtlpHandler, causing runtime failures; either add the missing gRPC
exporter dependency "opentelemetry-exporter-otlp-proto-grpc==1.34.1" to
pyproject.toml and update OtlpHandler to branch on the protocol and use the gRPC
exporter when OtlpProtocol.GRPC is selected, or enforce only HTTP by modifying
_validate_otlp_fields() to reject non-HTTP protocols (raise a clear
configuration error referencing OtlpProtocol and OtlpHandler) so the system
never attempts to export over unsupported gRPC.

In `@src/synthorg/api/state.py`:
- Around line 241-257: The set_prometheus_collector method currently logs only
on error but should mirror other setters by emitting a debug log on successful
configuration; after assigning self._prometheus_collector = collector in
set_prometheus_collector, add a logger.debug call using the same API_APP_STARTUP
event (e.g., logger.debug(API_APP_STARTUP, note="Prometheus collector
configured")) so successful initialization is recorded consistently with
set_review_gate_service and set_approval_timeout_scheduler.

In `@src/synthorg/observability/enums.py`:
- Around line 52-61: The OtlpProtocol enum advertises GRPC but the export path
(_export_batch) only implements HTTP, so add a guard in OtlpHandler.__init__ to
reject unimplemented GRPC: import/use OtlpProtocol and if self._protocol ==
OtlpProtocol.GRPC raise NotImplementedError("gRPC transport not implemented;
please use HTTP_PROTOBUF") so callers using OtlpProtocol.GRPC fail fast
(alternatively remove GRPC from OtlpProtocol if you prefer to hide it).

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 103-129: The handler currently bypasses the configured formatter
and foreign_pre_chain by using record.getMessage() inside _format_as_otlp_dict;
update _format_as_otlp_dict (and the analogous method around lines 266-276) to
call self.format(record) to produce the formatted message (so the
ProcessorFormatter and foreign_pre_chain run) and use that formatted string as
the "body" in the returned OTLP dict; ensure you still read
severity_number/severity_text/time_unix_nano/attributes from the record but
derive the log body via self.format(record) instead of record.getMessage().
- Around line 61-73: The constructor stores self._protocol but the class only
implements HTTP JSON export; update the code to make the interface consistent by
validating and rejecting unsupported protocols: in the constructor (where
_protocol is set) check if _protocol is OtlpProtocol.HTTP_PROTOBUF and raise a
clear ValueError if not, and update the config validation (otlp_protocol
handling) to match this restriction; alternatively, if you want to support GRPC,
implement GRPC export paths inside _export_batch, _send_* helpers and adjust
content-type/path accordingly — but do not leave _protocol unused; reference the
_protocol field, the _export_batch method, and the otlp_protocol config
validation when making the change.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 196-213: In _refresh_agent_metrics, clear the existing Gauge label
series before repopulating so stale (status, trust_level) combinations are
removed — call the GaugeVec clear/remove API on self._agents_total before
iterating and setting counts; do the same change in the analogous task refresh
function (referenced at the other refresh path) by clearing self._tasks_total
prior to repopulation so removed label combinations no longer appear in
/metrics.

In `@tests/unit/api/controllers/test_metrics.py`:
- Around line 71-74: Update the test_returns_503_when_collector_not_configured
to also assert the response body content from GET "/metrics" so the error
message remains stable; after creating the TestClient with
_make_app(collector=None) and checking resp.status_code == 503, add an assertion
that resp.content (or resp.data/resp.text as appropriate) equals the expected
message b"# No metrics collector configured\n" (or its decoded string) to ensure
the controller returns the specific error body.

In `@tests/unit/observability/test_otlp_handler.py`:
- Around line 80-88: The tests race the live flusher thread and a real HTTP
exporter; make them deterministic by stubbing the exporter and preventing the
flusher from starting in the test harness: update _make_handler (and tests at
test_batch_ready_signal and test_export_failure_increments_dropped_count) to
accept a test-only flag to disable the background flusher or inject a no-op/stub
exporter, and in timing-sensitive assertions mock time.monotonic() and
asyncio.sleep() so batch-ready signals and queue state advance
deterministically; apply the same change to the other affected tests (around
lines 219–235) so they no longer depend on real localhost:4318 or a racing
flusher thread.

---

Outside diff comments:
In `@src/synthorg/api/app.py`:
- Around line 1204-1221: The current exclude_paths logic only adds the metrics
path in the default branch, so when auth.exclude_paths is provided the
Prometheus path f"^{prefix}/metrics$" may be omitted; update the function that
builds exclude_paths (referencing auth.exclude_paths, exclude_paths, prefix,
ws_path, setup_status_path) to always ensure the metrics regex
f"^{prefix}/metrics$" is present—after initializing exclude_paths, check for
that metrics entry and append it if missing (similar to how setup_status_path
and ws_path are handled) so custom auth.exclude_paths cannot accidentally drop
/metrics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 00408acf-ac52-4131-8b5e-47cfa5c92ea4

📥 Commits

Reviewing files that changed from the base of the PR and between bd0929e and fb7e361.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (23)

CLAUDE.md
docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md
docs/design/operations.md
pyproject.toml
src/synthorg/api/app.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/enums.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/sinks.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/conftest.py
tests/unit/observability/test_config.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_events.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_sinks.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Agent
GitHub Check: Test (Python 3.14)
GitHub Check: Build Web
GitHub Check: Build Backend
GitHub Check: Socket Security: Pull Request Alerts
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

ALWAYS read the relevant docs/design/ page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 12 design pages. The design spec is the starting point for architecture, data models, and behavior.

If implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why -- user decides whether to proceed or update the spec. Do NOT silently diverge -- every deviation needs explicit user approval.

When approved deviations occur, update the relevant docs/design/ page to reflect the new reality.

Files:

docs/DESIGN_SPEC.md
CLAUDE.md
docs/architecture/acg-glossary.md
docs/design/operations.md

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow to categorize tests.

Coverage must be 80% minimum (enforced in CI).

Async tests: asyncio_mode = "auto" -- no manual @pytest.mark.asyncio needed (configured in pyproject.toml).

Timeout: 30 seconds per test (global in pyproject.toml -- do not add per-file pytest.mark.timeout(30) markers; non-default overrides like timeout(60) are allowed).

Parallelism: pytest-xdist via -n 8 -- ALWAYS include -n 8 when running pytest locally, never run tests sequentially. CI uses -n auto (fewer cores on runners).

Prefer @pytest.mark.parametrize for testing similar cases.

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases, or test-provider, test-small-001, etc. for tests.

Use Python Hypothesis for property-based testing with @given + @settings. Hypothesis profiles configured in tests/conftest.py: ci (deterministic, max_examples=10 + derandomize=True), dev (1000 examples), fuzz (10,000 examples, no deadline), extreme (500,000 examples, no deadline).

When Hypothesis finds a failure, it is a real bug -- the shrunk example is saved to ~/.synthorg/hypothesis-examples/ for analysis. Do NOT just rerun and move on. Fix the underlying bug and add an explicit @example(...) decorator to the test.

NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic. For tasks that must block indefinitely, use asyncio.Event().wait() instead of asyncio.sleep(large_number).

Files:

tests/unit/observability/test_events.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_sinks.py
tests/unit/observability/conftest.py
tests/unit/observability/test_otlp_handler.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_config.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_events.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_sinks.py
tests/unit/observability/conftest.py
tests/unit/observability/test_otlp_handler.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_config.py

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No from __future__ import annotations -- Python 3.14 has PEP 649 native lazy annotations.

Use PEP 758 except syntax: use except A, B: (no parentheses) -- ruff enforces this on Python 3.14.

All public functions must have type hints. mypy strict mode is enforced.

Docstrings must use Google style format, required on all public classes and functions (enforced by ruff D rules).

Create new objects for immutability, never mutate existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement.

For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence).

Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves (e.g. agent execution state, task progress). Never mix static config fields with mutable runtime fields in one model.

Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Adopted conventions: use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time.

Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens).

Use NotBlankStr (from core.types) for all identifier/name fields -- including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants -- instead of manual whitespace validators.

Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.

Maximum 88 characters per line (enforced by ruff).

Functions must be less than 50 lines, files must be less than 800 lines.

Handle...

Files:

src/synthorg/observability/sinks.py
src/synthorg/observability/enums.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/api/app.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/sinks.py
src/synthorg/observability/enums.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/api/app.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py

🧠 Learnings (46)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: Signed commits are required on `main` via branch protection -- all commits must be GPG/SSH signed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: Every implementation plan must be presented to the user for accept/deny before coding starts.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: At every phase of planning and implementation, be critical -- actively look for ways to improve the design. Surface improvements as suggestions, not silent changes -- user decides.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: Prioritize issues by dependency order, not priority labels -- unblocked dependencies come first.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: After finishing an issue implementation: always create a feature branch (`<type>/<slug>`), commit, and push -- do NOT create a PR automatically. Do NOT leave work uncommitted on main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: NEVER create a PR directly -- `gh pr create` is blocked by hookify. ALWAYS use `/pre-pr-review` to create PRs -- it runs automated checks + review agents + fixes before creating the PR. For trivial/docs-only changes: `/pre-pr-review quick`.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: After the PR exists, use `/aurelio-review-pr` to handle external reviewer feedback. When review agents find valid issues (including pre-existing issues in surrounding code, suggestions, and findings adjacent to the PR's changes), fix them all. No deferring, no 'out of scope' skipping.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: Branches must follow: `<type>/<slug>` from main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: NEVER use `cd` in Bash commands -- the working directory is already set to the project root. Use absolute paths or run commands directly. Exception: `bash -c "cd <dir> && <cmd>"` is safe (runs in a child process). Use this for tools without a `-C` flag.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-07T23:46:50.723Z
Learning: NEVER use Bash to write or modify files -- use the Write or Edit tools. Do not use `cat >`, `cat << EOF`, `echo >`, `echo >>`, `sed -i`, `python -c "open(...).write(...)"`, or `tee` to create or modify files. Read-only/inspection uses like piping to stdout are fine.

📚 Learning: 2026-03-30T10:41:40.176Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-30T10:41:40.176Z
Learning: Update the relevant `docs/design/` page to reflect new reality when approved deviations from spec occur

Applied to files:

docs/DESIGN_SPEC.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

docs/DESIGN_SPEC.md

📚 Learning: 2026-03-31T14:31:11.894Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:31:11.894Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality

Applied to files:

docs/DESIGN_SPEC.md

📚 Learning: 2026-04-07T22:12:30.141Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.141Z
Learning: Applies to docs/design/**/*.md : When approved deviations from the design spec occur, update the relevant `docs/design/` page to reflect the new reality

Applied to files:

docs/DESIGN_SPEC.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to docs/** : Docs source in docs/ (Markdown, built with Zensical); design spec in docs/design/ (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue; DESIGN_SPEC.md is a pointer file linking to 12 design pages

Applied to files:

docs/DESIGN_SPEC.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/sinks.py
tests/unit/observability/test_enums.py
src/synthorg/observability/enums.py
tests/unit/observability/conftest.py
CLAUDE.md
src/synthorg/api/state.py
src/synthorg/observability/events/metrics.py
tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly from the domain module

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/api/state.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.) — import directly

Applied to files:

tests/unit/observability/test_events.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

tests/unit/observability/test_events.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`, etc.); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-18T21:23:23.586Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

tests/unit/observability/test_events.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

tests/unit/observability/test_events.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-04-01T21:17:42.724Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T21:17:42.724Z
Learning: Applies to pyproject.toml : Required runtime dependencies: `mem0ai` (default and currently only memory backend), `cryptography` (Fernet encryption), `faker` (multi-locale name generation)

Applied to files:

pyproject.toml

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Applied to files:

src/synthorg/observability/sinks.py
tests/unit/observability/test_enums.py
src/synthorg/observability/enums.py
tests/unit/observability/conftest.py
src/synthorg/api/state.py
tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

src/synthorg/observability/sinks.py
tests/unit/observability/test_enums.py
src/synthorg/observability/enums.py
tests/unit/observability/test_sinks.py
tests/unit/observability/conftest.py
docs/design/operations.md
tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to !(src/synthorg/observability/{setup,sinks,syslog_handler,http_handler}.py)**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code (exception: `observability/setup.py`, `observability/sinks.py`, `observability/syslog_handler.py`, and `observability/http_handler.py` may use stdlib `logging`)

Applied to files:

src/synthorg/observability/sinks.py
tests/unit/observability/test_enums.py
CLAUDE.md
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-31T16:09:24.320Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T16:09:24.320Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

tests/unit/observability/test_enums.py
src/synthorg/api/state.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/config.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

tests/unit/observability/test_enums.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

tests/unit/observability/test_enums.py
CLAUDE.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`

Applied to files:

tests/unit/observability/test_enums.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-04-06T16:35:12.934Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-06T16:35:12.934Z
Learning: All project conventions, commands, and standards are defined in CLAUDE.md - refer to it for project structure, package layout, code conventions, quick commands, git workflow, testing standards, design specifications, logging, resilience, and security patterns

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.

Applied to files:

CLAUDE.md
src/synthorg/api/state.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger` (not `_logger`, not `log`).

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.

Applied to files:

CLAUDE.md
src/synthorg/api/state.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.

Applied to files:

CLAUDE.md
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to src/synthorg/**/*.py : Always use structured logging: `logger.info(EVENT, key=value)`—never `logger.info("msg %s", val)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-04-02T12:07:44.443Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T12:07:44.443Z
Learning: Applies to src/synthorg/**/*.py : Always use structured logging: `logger.info(EVENT, key=value)` — never `logger.info('msg %s', val)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`

Applied to files:

CLAUDE.md
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Use structured logging: always `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/api/**/*.py : API package (api/): Litestar REST + WebSocket with controllers, guards, channels, JWT + API key + WS ticket auth, approval gate integration, coordination endpoint, collaboration endpoint, settings endpoint, provider management endpoint (CRUD + test + presets), backup endpoint, RFC 9457 structured errors, AppState hot-reload slots, service auto-wiring (Phase 1 at construction, Phase 2 on startup), lifecycle helpers

Applied to files:

src/synthorg/api/controllers/__init__.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

src/synthorg/api/state.py
src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-20T08:28:32.845Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T08:28:32.845Z
Learning: Applies to src/synthorg/providers/**/*.py : Providers: LLM provider abstraction (LiteLLM adapter), auth types (api_key/oauth/custom_header/none), presets (PROVIDER_PRESETS), runtime CRUD (ProviderManagementService with asyncio.Lock serialization), hot-reload via AppState swap.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

src/synthorg/observability/events/metrics.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

src/synthorg/observability/otlp_handler.py

🔇 Additional comments (23)

docs/DESIGN_SPEC.md (1)

35-35: LGTM! Clean addition of the ACG Glossary reference.

The new Supporting Pages entry correctly links to the newly added glossary document and provides a clear description. The table formatting is consistent with existing entries.
docs/architecture/acg-glossary.md (6)
52-66: Reverse mapping table is consistent.

The SynthOrg → ACG table correctly mirrors the forward mapping established in the earlier tables. All entries are bidirectionally consistent.

69-80: Well-documented SynthOrg extensions.

The extensions table clearly identifies features beyond the ACG formalism, providing helpful context for readers comparing the two frameworks. The module references and descriptions are clear.

82-99: Clear explanation of partial fidelity areas.

The differences section helpfully explains where SynthOrg implements ACG concepts through different abstractions. The forward-looking recommendation for a NodeType enum (lines 90-91) is appropriate documentation of a potential future enhancement.

20-28: Update event domain count on line 24—actual count is 78, not 82+.

The "82+ event constant domains" claim is inaccurate. A search of src/synthorg/observability/events/ reveals exactly 78 event modules. The "There is no Node type" claim on line 25 is correct—no Node class exists in the engine.

Update the table with the accurate event domain count: 78 event constant domains or similar wording to reflect the real codebase structure.
			> Likely an incorrect or invalid review comment.
40-45: Both claims in the glossary table are accurate.

The verification confirms:

BudgetEnforcer "3-layer" (line 43): The class docstring explicitly states "Budget enforcement: pre-flight, in-flight, and auto-downgrade"—three distinct enforcement layers.

"hybrid->plan_execute at 80%" (line 44): The budget_tight_threshold field defaults to 80 (percent) in the loop selector, triggering budget-aware downgrade behavior at this utilization level.

Both facts in the glossary table are supported by the codebase implementation.

8-12: Both references in the glossary are valid: arXiv paper 2603.22386 exists ("From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents"), and the evaluation document exists at docs/research/acg-formalism-evaluation.md.
docs/design/operations.md (1)

1295-1295: Telemetry documentation updates look consistent and actionable.

These additions clearly close the documented G1 gap and make /api/v1/metrics + OTLP sink behavior discoverable in the operations spec.

Also applies to: 1539-1539, 1839-1839

tests/unit/observability/test_events.py (1)

209-209: Good update to event-domain discovery coverage.

Adding metrics to the expected module set keeps the discovery assertion aligned with the new observability domain.

src/synthorg/api/controllers/__init__.py (1)

25-25: Controller wiring is consistent.

MetricsController is imported, registered in ALL_CONTROLLERS, and exported in __all__, so default app registration is complete.

Also applies to: 53-53, 109-109

tests/unit/observability/test_sinks.py (1)

447-477: Nice dispatch coverage for new sink types.

These tests validate both pull-based Prometheus no-op handling and OTLP builder delegation in build_handler.

CLAUDE.md (1)

91-93: Guidance update is aligned with OTLP/metrics additions.

The logging exception allowlist and metrics event examples are now consistent with the new observability components.

tests/unit/observability/test_enums.py (1)

7-12: Enum coverage update looks correct.

The new assertions properly capture the expanded SinkType set and the OtlpProtocol values.

Also applies to: 67-81, 137-152

src/synthorg/observability/enums.py (1)

40-49: LGTM!

The PROMETHEUS and OTLP sink type additions are well-documented with clear attribute descriptions distinguishing pull-based (Prometheus) vs push-based (OTLP) semantics.

src/synthorg/observability/sinks.py (1)

386-397: LGTM!

The dispatch logic correctly handles both new sink types:

PROMETHEUS returns a NullHandler since it's pull-based (metrics served via the /metrics endpoint, not a log handler)

OTLP delegates to build_otlp_handler following the same pattern as SYSLOG and HTTP sinks

Both appropriately return early, bypassing _attach_formatter_and_routing which isn't applicable to these sink types.

tests/unit/observability/conftest.py (1)

51-57: LGTM!

The factory correctly adds all new Prometheus and OTLP configuration fields with defaults matching SinkConfig's field definitions. Since the factory defaults to sink_type=CONSOLE, these OTLP/Prometheus fields will pass the _reject_otlp_fields validation (which only rejects non-default values for non-OTLP sinks). Tests can override sink_type as needed.

src/synthorg/api/state.py (1)

228-239: LGTM!

The has_prometheus_collector and prometheus_collector properties follow the established pattern used by other services (e.g., has_persistence/persistence). The _require_service call ensures a clear 503 error when accessed before configuration.

tests/unit/observability/test_config.py (3)

663-715: LGTM!

Comprehensive test coverage for SinkType.PROMETHEUS validation:

Default and custom prometheus_port values

Cross-type field rejection (file_path, syslog_host, http_url, otlp_endpoint)

Immutability enforcement

JSON serialization roundtrip

721-858: LGTM!

Thorough test coverage for SinkType.OTLP validation including:

Required otlp_endpoint with blank/scheme validation

All OTLP-specific fields (protocol, headers, interval, batch_size, timeout)

Positive value constraints

Cross-type field rejection

Empty header name validation

Immutability and JSON roundtrip

864-912: LGTM!

Good coverage ensuring prometheus_port and otlp_endpoint are rejected by non-Prometheus/OTLP sink types (CONSOLE, FILE, HTTP, SYSLOG).

tests/unit/api/controllers/test_metrics.py (1)

14-42: LGTM!

Well-structured test helpers:

_make_app cleanly mocks AppState with configurable collector presence

_make_collector uses a real CollectorRegistry with test metrics, ensuring the Prometheus exposition format is exercised end-to-end

AsyncMock for refresh correctly matches the async signature in the controller

src/synthorg/observability/events/metrics.py (1)

1-20: LGTM!

Event constants follow the established taxonomy pattern with clear organization by subsystem (Prometheus scrape, coordination, OTLP export). The Final[str] type hints ensure immutability, and the naming convention (metrics.<subsystem>.<action>) is consistent with other event modules.

src/synthorg/api/controllers/metrics.py (1)

19-57: LGTM!

The MetricsController implementation is clean and follows established patterns:

Unauthenticated endpoint as documented (standard for Prometheus scrape targets)

Proper 503 handling when collector not configured with informative comment body

Correct Prometheus exposition content type (text/plain; version=0.0.4; charset=utf-8)

Appropriate log levels: WARNING for failures, DEBUG for successful scrapes (high-frequency operation)

Async refresh() call allows non-blocking metric collection

pyproject.toml

src/synthorg/api/state.py

src/synthorg/observability/otlp_handler.py

src/synthorg/observability/prometheus_collector.py

coderabbitai · 2026-04-07T23:58:18Z

tests/unit/api/controllers/test_metrics.py

+    def test_returns_503_when_collector_not_configured(self) -> None:
+        with TestClient(app=_make_app(collector=None)) as client:
+            resp = client.get("/metrics")
+            assert resp.status_code == 503


🧹 Nitpick | 🔵 Trivial

Consider asserting the 503 response body.

The controller returns a specific error message (b"# No metrics collector configured\n"). Asserting this ensures the response remains informative:

💡 Suggested enhancement

def test_returns_503_when_collector_not_configured(self) -> None: with TestClient(app=_make_app(collector=None)) as client: resp = client.get("/metrics") assert resp.status_code == 503 + assert "No metrics collector configured" in resp.text

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/api/controllers/test_metrics.py` around lines 71 - 74, Update the test_returns_503_when_collector_not_configured to also assert the response body content from GET "/metrics" so the error message remains stable; after creating the TestClient with _make_app(collector=None) and checking resp.status_code == 503, add an assertion that resp.content (or resp.data/resp.text as appropriate) equals the expected message b"# No metrics collector configured\n" (or its decoded string) to ensure the controller returns the specific error body.

tests/unit/observability/test_otlp_handler.py

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (2)

tests/unit/observability/test_otlp_handler.py (1)
52-64: ⚠️ Potential issue | 🟠 Major

These OTLP handler tests still race the live flusher.

_make_handler() starts a real background thread and points it at http://localhost:4318. That makes test_batch_ready_signal() nondeterministic and allows the export-failure tests to be drained before the patched urlopen() block runs. Please disable the flusher in the test harness or inject a synchronous stub exporter so these assertions stop depending on thread timing and a real localhost endpoint.

As per coding guidelines: "NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally."

Also applies to: 80-88, 222-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/observability/test_otlp_handler.py` around lines 52 - 64, The
tests race the OtlpHandler's background flusher started by _make_handler(),
causing nondeterministic behavior in test_batch_ready_signal and related
export-failure tests; update _make_handler to disable or control the background
flusher (e.g., add a parameter or use an OtlpHandler constructor flag to prevent
starting the flusher, or explicitly stop/replace the flusher thread after
construction) or inject a synchronous stub exporter into the handler so exports
occur deterministically in the test thread (target symbols: _make_handler,
OtlpHandler, test_batch_ready_signal, and any patched urlopen usage); ensure the
handler used in tests does not start a live background thread and that exports
are run synchronously so tests no longer depend on thread timing.
src/synthorg/observability/otlp_handler.py (1)
61-72: ⚠️ Potential issue | 🔴 Critical

The exporter still doesn't emit OTLP/HTTP-protobuf.

The only accepted protocol here is HTTP_PROTOBUF, but _export_batch() posts application/json and serializes each record as ad-hoc JSON (body is a plain string and attributes is a dict). Real OTLP collectors expect protobuf bytes or the protobuf-JSON mapping with AnyValue / KeyValue shapes, so this payload will not interoperate reliably. Please switch to a real OTLP encoder/exporter or rename/restrict the supported protocol to what is actually sent.

Also applies to: 187-208
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 61 - 72, The code
claims to support OtlpProtocol.HTTP_PROTOBUF but _export_batch() sends ad-hoc
JSON; either implement real OTLP/HTTP‑protobuf serialization or stop advertising
protobuf support. Update the constructor/validation around OtlpProtocol (the
protocol parameter and self._protocol) and one of: (A) replace the current JSON
body assembly in _export_batch() with proper OTLP protobuf encoding using the
opentelemetry/protobuf-generated message types (KeyValue/AnyValue/ResourceSpans
or ExportTraceServiceRequest as appropriate) or call an existing OTLP HTTP
protobuf exporter to produce bytes and send content-type application/x-protobuf,
or (B) change the allowed protocol enum and default to HTTP_JSON (or raise
NotImplementedError when protocol == OtlpProtocol.HTTP_PROTOBUF) and set
content-type to application/json while documenting the JSON-only behavior.
Ensure references to OtlpProtocol and the _export_batch method are updated
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/api/app.py`:
- Around line 1204-1219: The /metrics endpoint is being exposed but
AppState.prometheus_collector is never registered, so MetricsController will
return 503; in create_app() or _build_lifecycle() wire the Prometheus collector
from the logging/metrics configuration into AppState.prometheus_collector during
app construction/startup (or explicitly skip/adopt a disabled mode and remove
the route) so the collector exists before you add metrics_path to exclude_paths;
update the initialization logic that reads the logging config to instantiate and
assign the collector (or, if intentionally disabled, remove metrics_path from
advertised routes and document the deviation) — touch MetricsController,
AppState.prometheus_collector, create_app(), and _build_lifecycle() to implement
this registration.

In `@src/synthorg/api/state.py`:
- Around line 257-258: Replace the debug-level logs used after successful
AppState mutations with info-level logs: change logger.debug(API_APP_STARTUP,
note="Prometheus collector configured") (and the analogous logger.debug calls at
the other locations referenced) to logger.info(...) so that successful
service-configuration transitions are logged at INFO; keep the same
API_APP_STARTUP event and note text, and update the calls adjacent to
assignments like self._prometheus_collector = collector and the other
state-assignment sites (the debug statements at the other referenced locations)
to use logger.info.

In `@src/synthorg/observability/config.py`:
- Around line 208-212: prometheus_port is a dead config knob—either implement a
secondary metrics listener or remove/disable the setting; for now remove the
prometheus_port Field (and the duplicate block around lines ~387-398) from the
config to avoid misleading operators, and update observability/sinks.py so
SinkType.PROMETHEUS is no longer silently mapped to a NullHandler (instead
validate/raise if a PROMETHEUS sink or prometheus_port is present), or
alternatively implement the secondary listener wiring to honor prometheus_port;
reference symbols to change: prometheus_port (Field in config.py) and
SinkType.PROMETHEUS (in observability/sinks.py).
- Around line 82-114: _validate_otlp_endpoint_safety currently only rejects IP
literals (addr parsed from hostname) so hostnames like "otel.internal" that
resolve to private/loopback addresses bypass SSRF checks; update
_validate_otlp_endpoint_safety(endpoint, hostname, *, has_headers) to perform
DNS resolution of the hostname (A/AAAA) and reject if any resolved IP
is_private/is_loopback/is_link_local, or delegate to the repository's DNS-aware
SSRF validator utility instead of only parsing literals; if DNS resolution
fails, treat it as an error (or fail-closed) and surface a clear ValueError;
also apply the same hostname-resolution check to the other OTLP validation path
in this module that mirrors this logic.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 131-136: Before raising the ValueError when an unknown verdict is
detected (the block that checks verdict against self._VALID_VERDICTS and raises
ValueError), emit a warning-level log that includes the rejected verdict and the
expected valid set; e.g., call self._logger.warning(...) or the module's metrics
warning helper with a message like "Rejected unknown security verdict %r;
expected %s" (including verdict and sorted(self._VALID_VERDICTS)) immediately
before the raise so the bad label is recorded with context.
- Around line 181-182: Call app_state.cost_tracker.get_total_cost() once per
scrape and reuse that single snapshot for both metrics instead of awaiting it
twice; store the result in a local variable (e.g., total = await
app_state.cost_tracker.get_total_cost()) and then call
self._cost_total.set(total) and compute the budget percent for
self._budget_used_percent from that same total value (and apply the same change
to the other occurrence where get_total_cost() is awaited again). Ensure you
reference the existing symbols app_state.cost_tracker.get_total_cost(),
self._cost_total.set(), and self._budget_used_percent (synthorg_cost_total /
synthorg_budget_used_percent) so both metrics come from the identical snapshot.
- Around line 183-188: Generic except blocks around the METRICS_SCRAPE_FAILED
logging are swallowing critical system exceptions; import builtins if not
present and add an explicit guard before each "except Exception:" to re-raise
system errors, e.g. insert "except (builtins.MemoryError,
builtins.RecursionError): raise" immediately before the existing "except
Exception:" blocks that call logger.warning(...) (the handlers referencing
METRICS_SCRAPE_FAILED and component="cost_tracker") so
MemoryError/RecursionError are not downgraded to warnings.
- Around line 60-63: The Gauge _agents_total is defined as "Number of registered
agents" but _refresh_agent_metrics() only reads
AgentRegistryService.list_active(), so the metric never reports non-ACTIVE
statuses; either change _refresh_agent_metrics() to iterate over a new
AgentRegistryService.list_all() (or equivalent) and aggregate counts by
agent.status and trust_level before setting _agents_total, or rename the metric
and its description (and variable _agents_total) to indicate it counts only
active agents (e.g., synthorg_active_agents_total / "Number of active agents")
and keep using list_active(); update any label handling in
_refresh_agent_metrics() accordingly.

---

Duplicate comments:
In `@src/synthorg/observability/otlp_handler.py`:
- Around line 61-72: The code claims to support OtlpProtocol.HTTP_PROTOBUF but
_export_batch() sends ad-hoc JSON; either implement real OTLP/HTTP‑protobuf
serialization or stop advertising protobuf support. Update the
constructor/validation around OtlpProtocol (the protocol parameter and
self._protocol) and one of: (A) replace the current JSON body assembly in
_export_batch() with proper OTLP protobuf encoding using the
opentelemetry/protobuf-generated message types (KeyValue/AnyValue/ResourceSpans
or ExportTraceServiceRequest as appropriate) or call an existing OTLP HTTP
protobuf exporter to produce bytes and send content-type application/x-protobuf,
or (B) change the allowed protocol enum and default to HTTP_JSON (or raise
NotImplementedError when protocol == OtlpProtocol.HTTP_PROTOBUF) and set
content-type to application/json while documenting the JSON-only behavior.
Ensure references to OtlpProtocol and the _export_batch method are updated
accordingly.

In `@tests/unit/observability/test_otlp_handler.py`:
- Around line 52-64: The tests race the OtlpHandler's background flusher started
by _make_handler(), causing nondeterministic behavior in test_batch_ready_signal
and related export-failure tests; update _make_handler to disable or control the
background flusher (e.g., add a parameter or use an OtlpHandler constructor flag
to prevent starting the flusher, or explicitly stop/replace the flusher thread
after construction) or inject a synchronous stub exporter into the handler so
exports occur deterministically in the test thread (target symbols:
_make_handler, OtlpHandler, test_batch_ready_signal, and any patched urlopen
usage); ensure the handler used in tests does not start a live background thread
and that exports are run synchronously so tests no longer depend on thread
timing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: db453f2c-380e-423b-bff5-530c6add3490

📥 Commits

Reviewing files that changed from the base of the PR and between fb7e361 and d535acf.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

docs/design/operations.md
pyproject.toml
scripts/run_affected_tests.py
src/synthorg/api/app.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/test_otlp_handler.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Backend
GitHub Check: Build Web
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Python version: 3.14+ with PEP 649 native lazy annotations. Do not use from __future__ import annotations.

Use PEP 758 except syntax: use except A, B: (no parentheses) on Python 3.14 -- ruff enforces this.

Type hints: all public functions must have type hints. Use mypy strict mode.

Docstrings: Google style required on all public classes and functions. Enforced by ruff D rules.

Immutability: create new objects, never mutate existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction and wrap with MappingProxyType for read-only enforcement.

For frozen Pydantic models, use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persisting for persistence) to copy dict/list fields.

Config vs runtime state: use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Use Pydantic v2: BaseModel, model_validator, computed_field, ConfigDict. Use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time.

Use @computed_field for derived values instead of storing and validating redundant fields (e.g. TokenUsage.total_tokens).

Use NotBlankStr (from core.types) for all identifier/name fields -- including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants -- instead of manual whitespace validators.

Async concurrency: prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.

Line length: 88 characters (enforced by ruff).

Functions: keep < 50 lines. Files: keep < 800 lines.

Error handling: handle explicitly, never silently swallow exceptions.

Validation: validate at...

Files:

src/synthorg/api/app.py
scripts/run_affected_tests.py
tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/config.py
src/synthorg/observability/prometheus_collector.py
tests/unit/api/controllers/test_metrics.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from synthorg.observability import get_logger then logger = get_logger(__name__). Never use import logging / logging.getLogger() / print() in application code.

Logger variable name: always logger (not _logger, not log).

Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool, etc.). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT

Structured logging: always use logger.info(EVENT, key=value) -- never logger.info("msg %s", val).

All error paths must log at WARNING or ERROR with context before raising.

All state transitions must log at INFO level.

DEBUG level logs are for object creation, internal flow, entry/exit of key functions.

Files:

src/synthorg/api/app.py
src/synthorg/observability/config.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/api/app.py
src/synthorg/observability/config.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

docs/design/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

ALWAYS read the relevant docs/design/ page before implementing any feature or planning any issue. The design spec is the starting point for architecture, data models, and behavior.

Files:

docs/design/operations.md

docs/design/operations.md

📄 CodeRabbit inference engine (CLAUDE.md)

Vendor names (Anthropic, OpenAI, Claude, GPT, etc.) may only appear in the Operations design page provider list.

Files:

docs/design/operations.md

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow.

Coverage: 80% minimum (enforced in CI).

Async tests: use asyncio_mode = "auto" -- no manual @pytest.mark.asyncio needed.

Test timeout: 30 seconds per test (global in pyproject.toml). Do not add per-file pytest.mark.timeout(30) markers; non-default overrides like timeout(60) are allowed.

Test parallelism: pytest-xdist via -n 8. ALWAYS include -n 8 when running pytest locally, never run tests sequentially.

Parametrize: prefer @pytest.mark.parametrize for testing similar cases.

Vendor-agnostic: NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: test-provider, test-small-001, etc.

Flaky tests: NEVER skip, dismiss, or ignore. Always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic. For tasks that must block indefinitely until cancelled, use asyncio.Event().wait() instead of asyncio.sleep(large_number).

Files:

tests/unit/observability/test_otlp_handler.py
tests/unit/api/controllers/test_metrics.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_otlp_handler.py
tests/unit/api/controllers/test_metrics.py

src/synthorg/observability/{setup,sinks,syslog_handler,http_handler,otlp_handler}.py

📄 CodeRabbit inference engine (CLAUDE.md)

Exception: the observability handler and setup modules (setup.py, sinks.py, syslog_handler.py, http_handler.py, otlp_handler.py) may use stdlib logging and print(..., file=sys.stderr) for handler construction, bootstrap, and error reporting code.

Files:

src/synthorg/observability/otlp_handler.py

🧠 Learnings (38)

📓 Common learnings

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: If implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why. Do NOT silently diverge -- every deviation needs explicit user approval.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Update the relevant `docs/design/` page to reflect the new reality when approved deviations occur.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Every implementation plan must be presented to the user for accept/deny before coding starts.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: At every phase of planning and implementation, be critical -- actively look for ways to improve the design in the spirit of robustness, correctness, simplicity, and future-proofing.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Surface improvements as suggestions, not silent changes -- user decides.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Prioritize issues by dependency order, not priority labels -- unblocked dependencies come first.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Pure data models, enums, and re-exports do NOT need logging.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Signed commits: required on `main` via branch protection -- all commits must be GPG/SSH signed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Branches: use `<type>/<slug>` naming from main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Pre-commit hooks enforce: trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-json, check-merge-conflict, check-added-large-files, no-commit-to-branch (main), ruff check+format, gitleaks, hadolint (Dockerfile), golangci-lint + go vet (CLI conditional), no-em-dashes, no-redundant-timeout, eslint-web (React, conditional).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Pre-push hooks: mypy type-check (affected modules only) + pytest unit tests (affected modules only) + golangci-lint + go vet + go test (CLI, conditional) + eslint-web (conditional). Foundational module changes (core, config, observability) or conftest changes trigger full runs.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: GitHub issue queries: use `gh issue list` via Bash (not MCP tools). MCP `list_issues` has unreliable field data.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Merge strategy: squash merge. PR body becomes the squash commit message on main. Trailers (e.g. `Release-As`, `Closes `#N``) must be in the PR body to land in the final commit.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: PR issue references: preserve existing `Closes `#NNN`` references -- never remove unless explicitly asked.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: After finishing an issue implementation: always create a feature branch (`<type>/<slug>`), commit, and push. Do NOT create a PR automatically. Do NOT leave work uncommitted on main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: NEVER create a PR directly -- `gh pr create` is blocked by hookify. ALWAYS use `/pre-pr-review` to create PRs. It runs automated checks and review agents before creating the PR.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: For trivial/docs-only changes: use `/pre-pr-review quick` which skips agents but still runs automated checks.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: After the PR exists, use `/aurelio-review-pr` to handle external reviewer feedback.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: When review agents find valid issues (including pre-existing issues in surrounding code, suggestions, and findings adjacent to the PR's changes), fix them all. No deferring, no "out of scope" skipping.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: Shell usage: NEVER use `cd` in Bash commands. The working directory is already set to the project root. Use absolute paths or run commands directly. Exception: `bash -c "cd <dir> && <cmd>"` is safe (runs in a child process).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T06:16:10.548Z
Learning: NEVER use Bash to write or modify files -- use the Write or Edit tools. Do not use `cat >`, `cat << EOF`, `echo >`, `echo >>`, `sed -i`, `python -c "open(...).write(...)"`, or `tee` to create or modify files.

📚 Learning: 2026-04-01T21:17:42.724Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T21:17:42.724Z
Learning: Applies to pyproject.toml : Required runtime dependencies: `mem0ai` (default and currently only memory backend), `cryptography` (Fernet encryption), `faker` (multi-locale name generation)

Applied to files:

pyproject.toml

📚 Learning: 2026-04-01T18:57:46.655Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T18:57:46.655Z
Learning: Applies to pyproject.toml : Docs group in `pyproject.toml` includes: `zensical`, `mkdocstrings[python]`, `griffe-pydantic`

Applied to files:

pyproject.toml

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to !(src/synthorg/observability/{setup,sinks,syslog_handler,http_handler}.py)**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code (exception: `observability/setup.py`, `observability/sinks.py`, `observability/syslog_handler.py`, and `observability/http_handler.py` may use stdlib `logging`)

Applied to files:

pyproject.toml
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to tests/**/*.py : Test markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Coverage: 80% minimum. Async: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Timeout: 30 seconds per test. Parallelism: `pytest-xdist` via `-n auto` — ALWAYS include `-n auto` when running pytest, never run tests sequentially.

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Parallelism: pytest-xdist via `-n auto` — ALWAYS include `-n auto` when running pytest, never run tests sequentially.

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-04-02T18:48:02.489Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T18:48:02.489Z
Learning: ALWAYS run pytest with `-n auto` for parallel execution with pytest-xdist; never run tests sequentially

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Pre-push hooks check: mypy type-check (affected modules only) + pytest unit tests (affected modules only) + golangci-lint + go vet + go test (CLI, conditional on `cli/**/*.go`) + eslint-web (web dashboard); foundational module changes (core, config, observability) or conftest changes trigger full runs

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: ALWAYS include `-n 8` when running pytest locally; never run tests sequentially; CI uses `-n auto`

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-04-02T08:47:46.313Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T08:47:46.313Z
Learning: Run unit tests with `uv run python -m pytest tests/ -m unit -n auto`; integration tests with `-m integration -n auto`; e2e tests with `-m e2e -n auto`

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-04-02T08:47:46.313Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T08:47:46.313Z
Learning: Run full pytest suite with coverage: `uv run python -m pytest tests/ -n auto --cov=synthorg --cov-fail-under=80`

Applied to files:

scripts/run_affected_tests.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/api/**/*.py : API package (api/): Litestar REST + WebSocket with controllers, guards, channels, JWT + API key + WS ticket auth, approval gate integration, coordination endpoint, collaboration endpoint, settings endpoint, provider management endpoint (CRUD + test + presets), backup endpoint, RFC 9457 structured errors, AppState hot-reload slots, service auto-wiring (Phase 1 at construction, Phase 2 on startup), lifecycle helpers

Applied to files:

docs/design/operations.md
tests/unit/api/controllers/test_metrics.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

docs/design/operations.md
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-31T21:07:37.470Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T21:07:37.470Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally; mock `time.monotonic()` and `asyncio.sleep()` for determinism; use `asyncio.Event().wait()` for indefinite blocking instead of `asyncio.sleep(large_number)`

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-01T15:36:39.993Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T15:36:39.993Z
Learning: Applies to tests/**/*.py : Flaky tests: NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests—always fix them fully and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins; for tasks that must block indefinitely until cancelled, use `asyncio.Event().wait()` instead of `asyncio.sleep(large_number)`

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to tests/**/*.py : Never skip, dismiss, or ignore flaky tests; fix them fully and fundamentally by mocking `time.monotonic()` and `asyncio.sleep()` for determinism or using `asyncio.Event().wait()` for indefinite blocking

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

src/synthorg/observability/config.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Applied to files:

src/synthorg/observability/config.py
src/synthorg/api/state.py
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-31T16:09:24.320Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T16:09:24.320Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

src/synthorg/observability/config.py
src/synthorg/api/state.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use DEBUG logging for object creation, internal flow, and entry/exit of key functions.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly from the domain module

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-20T08:28:32.845Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T08:28:32.845Z
Learning: Applies to src/synthorg/providers/**/*.py : Providers: LLM provider abstraction (LiteLLM adapter), auth types (api_key/oauth/custom_header/none), presets (PROVIDER_PRESETS), runtime CRUD (ProviderManagementService with asyncio.Lock serialization), hot-reload via AppState swap.

Applied to files:

src/synthorg/api/state.py

🔇 Additional comments (4)

scripts/run_affected_tests.py (3)
179-179: LGTM on the full_suite=True call sites.

The three fallback paths (git errors at lines 179 and 185, and foundational module changes at line 197) correctly pass full_suite=True to reduce worker count during resource-intensive full-suite runs.

Also applies to: 185-185, 197-197

204-204: Affected-tests path correctly uses default worker count.

The selective test run (line 204) appropriately omits full_suite=True, using the higher worker count since fewer tests mean less resource pressure.

150-170: The concern about duplicate -n arguments is unfounded. pytest CLI arguments override addopts settings in pyproject.toml—this is standard pytest behavior and does not cause errors or conflicts.

The design is intentional and correct:

addopts = [..., "-n=8"] ensures the project default aligns with the coding guideline: "ALWAYS include -n 8 when running pytest locally"

The script overrides with -n 4 for full-suite runs to prevent resource exhaustion in the pre-push hook

The script keeps -n 8 for affected tests (per guidelines)

No changes needed. Remove the comment.
			> Likely an incorrect or invalid review comment.
src/synthorg/observability/prometheus_collector.py (1)

223-225: Nice fix for stale label cleanup.

Clearing both labeled gauges before repopulating prevents disappeared agent/task combinations from lingering across scrapes.

Also applies to: 249-250

src/synthorg/api/app.py

src/synthorg/api/state.py

src/synthorg/observability/config.py

src/synthorg/observability/prometheus_collector.py

coderabbitai · 2026-04-08T06:31:05Z

src/synthorg/observability/prometheus_collector.py

+        except Exception:
+            logger.warning(
+                METRICS_SCRAPE_FAILED,
+                component="cost_tracker",
+                exc_info=True,
+            )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Show each broad catch in this collector with surrounding context. rg -n -C3 --type=py 'except builtins\.MemoryError, RecursionError:|except Exception:' src/synthorg/observability/prometheus_collector.py

Repository: Aureliolo/synthorg

Length of output: 1160

Add system-error guards to all generic exception handlers.

These except Exception: blocks (at lines 183, 205, 231, 255) catch MemoryError and RecursionError, causing system failures to be downgraded to warnings. Add the project's except builtins.MemoryError, RecursionError: raise guard before each generic handler:

except builtins.MemoryError, RecursionError: raise except Exception: logger.warning(...)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/prometheus_collector.py` around lines 183 - 188, Generic except blocks around the METRICS_SCRAPE_FAILED logging are swallowing critical system exceptions; import builtins if not present and add an explicit guard before each "except Exception:" to re-raise system errors, e.g. insert "except (builtins.MemoryError, builtins.RecursionError): raise" immediately before the existing "except Exception:" blocks that call logger.warning(...) (the handlers referencing METRICS_SCRAPE_FAILED and component="cost_tracker") so MemoryError/RecursionError are not downgraded to warnings.

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (1)

src/synthorg/api/state.py (1)
241-258: ⚠️ Potential issue | 🟡 Minor

Finish the INFO-level transition logging in the remaining deferred setters.

This patch fixes several set_* success paths, but set_review_gate_service(), set_approval_timeout_scheduler(), and set_settings_service() still emit DEBUG after successful state changes. At the default INFO root level those lifecycle transitions remain invisible.

As per coding guidelines: "All state transitions must log at INFO level."

Also applies to: 322-322, 638-638, 754-754
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/api/state.py` around lines 241 - 258, Three deferred setter
methods (set_review_gate_service, set_approval_timeout_scheduler,
set_settings_service) currently log success at DEBUG so their lifecycle
transitions are invisible at default INFO level; change those success-path
logger.debug calls to logger.info and use a descriptive note string analogous to
set_prometheus_collector ("Prometheus collector configured") — e.g., "Review
gate service configured", "Approval timeout scheduler configured", "Settings
service configured" — leaving error handling and RuntimeError raises unchanged
so callers and tests keep the same behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/operations.md`:
- Line 1839: Update the G1 table row to reflect that Prometheus metrics export
is implemented (replace "will follow" with a statement that PrometheusCollector
is instantiated and /metrics endpoint is registered), and change the closing
paragraph so telemetry export is no longer described as the primary gap; instead
state current status (PrometheusCollector registered during on_startup and OTLP
sink type implemented) and list any remaining minor work. Also add a brief note
describing the deviation from the original design, reference PrometheusCollector
and on_startup, and record that this deviation was approved per the project's
deviation/approval process.

In `@src/synthorg/observability/config.py`:
- Around line 242-244: The SinkConfig currently allows OtlpProtocol.GRPC but
only fails later; add validation in the config model so unsupported grpc is
rejected during model validation: implement a pydantic field validator or
root_validator on SinkConfig that checks the otlp_protocol field (and any other
OTLP-related fields referenced around the otlp_protocol definition and the block
noted at lines 405-429) and raises a ValueError if otlp_protocol ==
OtlpProtocol.GRPC, with a clear message like "OTLP gRPC transport is not
supported; use HTTP_JSON", ensuring invalid YAML/config is rejected at parse
time.
- Around line 93-143: The direct-IP rejection in _validate_otlp_endpoint_safety
currently rejects loopback literals (127.0.0.1, ::1) while the later hostname
exemption allows "localhost", causing inconsistency; fix by making the loopback
allowlist consistent: define localhost_names = {"localhost","127.0.0.1","::1"}
up front and check if hostname is in that set before calling
_is_private_ip(hostname) so loopback literals are allowed the same way as
"localhost", and ensure the later DNS-resolution and the HTTP/unencrypted
warning logic also use the same localhost_names set.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 94-101: The catch-all except blocks in otlp_handler.py are
currently swallowing critical system exceptions; before each "except Exception:"
add a guard to re-raise MemoryError and RecursionError (i.e., "except
(MemoryError, RecursionError): raise") and then keep the generic handler for the
remaining exceptions. Apply this change to the try/except around
self._queue.put_nowait / self._pending_lock / self._batch_ready.set (shown in
the diff) and the other try/except blocks noted in the comment ranges (around
the exporter/send logic at the blocks covering 150-157, 180-184, and 212-232) so
system-level errors are not downgraded to handleError() or warnings.
- Around line 159-175: _drained_and_flush currently pulls the entire _queue into
memory then slices into batches; change _drain_and_flush to repeatedly pull up
to _batch_size items from self._queue (using get_nowait in a loop or pop
batch-wise) and call self._export_batch for each batch as you form it, so you
never accumulate the whole backlog in records. Update the pending count
decrement to subtract the number of records actually exported (inside the same
critical section guarded by self._pending_lock or by decrementing per-batch) and
keep references to _queue, _batch_size, _export_batch, _pending_lock, and
_pending_count to locate where to modify.
- Around line 111-141: The OTLP JSON payload built by _format_as_otlp_dict() is
invalid: change the "body" to an AnyValue object (wrap the formatted message as
{ "stringValue": "<message>" }), convert "attributes" from a dict to a repeated
KeyValue array (list of { "key": "<name>", "value": { "stringValue": "<value>" }
} entries, using _CORRELATION_FIELDS and record.name), and make time_unix_nano a
string (str(int(record.created * 1_000_000_000))). Also update the export logic
that assembles/sends the OTLP/HTTP JSON payload to expect the new AnyValue and
KeyValue array formats so the top-level resource/log records JSON matches the
OpenTelemetry OTLP/HTTP JSON spec.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 47-114: The __init__ is too long because each metric is created
inline; refactor to register metrics from a data-driven spec so __init__ stays
under 50 lines: create a list of metric specs (tuples/dicts) containing the
metric type (Info/Gauge/PromCounter), name suffix, help text, label names (if
any) and target attribute name (e.g. "_info", "_agents_total", "_tasks_total",
"_cost_total", "_budget_used_percent", "_budget_monthly_usd",
"_coordination_efficiency", "_coordination_overhead_percent",
"_security_evaluations"), then loop over that spec to instantiate each metric
with f"{prefix}_{suffix}" and registry=self.registry and assign to the
corresponding self.<attribute>; preserve the existing call
self._info.info({"version": __version__}) and
logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix).
- Around line 195-225: When a source is unavailable or an exception occurs the
metric methods leave stale values; update _refresh_cost_gauge,
_refresh_budget_metrics, _refresh_agent_metrics, and _refresh_task_metrics to
explicitly reset their Prometheus metrics to a safe “no-data” value instead of
doing nothing: e.g. in _refresh_cost_gauge set self._cost_total.set(0.0) (or
call the metric reset/clear API if available) when total_cost is None; in
_refresh_budget_metrics when app_state.has_cost_tracker is False or
tracker.budget_config is None set self._budget_monthly_usd.set(0.0) and
self._budget_used_percent.set(0.0); and wrap exception handlers to reset the
same gauges (self._budget_monthly_usd, self._budget_used_percent, etc.) before
logging so errors don’t leave stale data. Ensure the same pattern is applied to
the agent and task refresh methods and their associated metric symbols.

In `@tests/unit/observability/test_prometheus_collector.py`:
- Around line 66-70: The task stub _make_task currently doesn't set assigned_to,
so MagicMock creates a truthy child mock and tests hit per-task synthetic agent
labels; update _make_task to accept and set an assigned_to parameter (e.g.,
default None or empty string) and in the task-count test cases pass explicit
assigned_to values for repeated tasks to assert same-agent aggregation exercised
by PrometheusCollector._refresh_task_metrics; ensure tests that expect the
empty-agent bucket pass assigned_to="" or None as appropriate.

---

Duplicate comments:
In `@src/synthorg/api/state.py`:
- Around line 241-258: Three deferred setter methods (set_review_gate_service,
set_approval_timeout_scheduler, set_settings_service) currently log success at
DEBUG so their lifecycle transitions are invisible at default INFO level; change
those success-path logger.debug calls to logger.info and use a descriptive note
string analogous to set_prometheus_collector ("Prometheus collector configured")
— e.g., "Review gate service configured", "Approval timeout scheduler
configured", "Settings service configured" — leaving error handling and
RuntimeError raises unchanged so callers and tests keep the same behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d732b60-b9b3-42b2-bc8f-8416badb0750

📥 Commits

Reviewing files that changed from the base of the PR and between d535acf and 846a0d6.

📒 Files selected for processing (12)

docs/design/operations.md
src/synthorg/api/app.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/enums.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
tests/unit/observability/conftest.py
tests/unit/observability/test_config.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_prometheus_collector.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Web
GitHub Check: Build Backend
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No from __future__ import annotations in Python code -- Python 3.14 has PEP 649 native lazy annotations.

Use PEP 758 except syntax: use except A, B: (no parentheses) on Python 3.14. Ruff enforces this rule.

All public functions must have type hints and docstrings in Google style. Mypy strict mode is enforced. Ruff D rules enforce documentation on public classes and functions.

Use Pydantic v2 with BaseModel, model_validator, computed_field, and ConfigDict. Always include allow_inf_nan=False in all ConfigDict declarations. Use @computed_field for derived values instead of storing redundant fields. Use NotBlankStr (from core.types) for all identifier/name fields.

Line length must be 88 characters (ruff enforced).

Functions must be less than 50 lines, and files must be less than 800 lines.

Logging variable name must always be logger (not _logger or log).

Use vendor-agnostic names everywhere: NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Tests must use test-provider, test-small-001, etc.

Files:

src/synthorg/observability/enums.py
tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
tests/unit/observability/test_prometheus_collector.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/config.py
tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py
src/synthorg/api/state.py
src/synthorg/api/app.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic must have: from synthorg.observability import get_logger and then logger = get_logger(__name__).

Never use import logging, logging.getLogger(), or print() in application code. Exceptions: observability/setup.py, observability/sinks.py, observability/syslog_handler.py, observability/http_handler.py, and observability/otlp_handler.py may use stdlib logging and print(..., file=sys.stderr) for handler construction and bootstrap.

Event names must always use constants from domain-specific modules under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool, etc.). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Use structured logging kwargs: always logger.info(EVENT, key=value). Never use logger.info('msg %s', val).

Files:

src/synthorg/observability/enums.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
src/synthorg/api/state.py
src/synthorg/api/app.py

src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/enums.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
src/synthorg/api/state.py
src/synthorg/api/app.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow.

Use asyncio_mode = 'auto' for pytest -- no manual @pytest.mark.asyncio needed.

Timeout is 30 seconds per test (global in pyproject.toml). Do not add per-file pytest.mark.timeout(30) markers; non-default overrides like timeout(60) are allowed.

Prefer @pytest.mark.parametrize for testing similar cases.

Files:

tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_config.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_config.py

🧠 Learnings (56)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. The design spec is the starting point for architecture, data models, and behavior.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: If implementation deviates from the design spec, alert the user and explain why before proceeding. Every deviation needs explicit user approval and the relevant `docs/design/` page must be updated to reflect the new reality.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Every implementation plan must be presented to the user for accept/deny before coding starts. At every phase of planning and implementation, be critical and actively look for design improvements.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Prioritize issues by dependency order, not priority labels. Unblocked dependencies come first.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Create new objects instead of mutating existing ones. For non-Pydantic internal collections (registries, `BaseTool`), use `copy.deepcopy()` at construction plus `MappingProxyType` wrapping for read-only enforcement.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Use frozen Pydantic models for config/identity. Use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Prefer `asyncio.TaskGroup` for fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare `create_task`.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Handle errors explicitly, never silently swallow exceptions.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Validate at system boundaries: user input, external APIs, and config files.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: All error paths must log at WARNING or ERROR with context before raising.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: All state transitions must log at INFO level.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Use DEBUG level for object creation, internal flow, and entry/exit of key functions.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Pure data models, enums, and re-exports do NOT need logging.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: All provider calls must go through `BaseCompletionProvider` which applies retry and rate limiting automatically. Never implement retry logic in driver subclasses or calling code.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: `RetryConfig` and `RateLimiterConfig` are set per-provider in `ProviderConfig`.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Retryable errors (`is_retryable=True`): `RateLimitError`, `ProviderTimeoutError`, `ProviderConnectionError`, `ProviderInternalError`. Non-retryable errors raise immediately. `RetryExhaustedError` signals all retries failed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Rate limiter automatically respects `RateLimitError.retry_after` from providers and pauses future requests.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Coverage minimum is 80% (enforced in CI).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Always include `-n 8` when running pytest locally with parallelism via pytest-xdist. Never run tests sequentially. CI uses `-n auto`.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Vendor names may only appear in: (1) `docs/design/operations.md` operations design page, (2) `.claude/` skill/agent files, (3) third-party import paths/module names, (4) `src/synthorg/providers/presets.py` which is user-facing runtime data.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Property-based testing: Python uses Hypothesis (`given` + `settings`). Profiles configured in `tests/conftest.py`: `ci` (deterministic, `max_examples=10` + `derandomize=True`), `dev` (1000 examples), `fuzz` (10,000 examples, no deadline), `extreme` (500,000 examples, no deadline). Controlled via `HYPOTHESIS_PROFILE` env var.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: When Hypothesis finds a failure, it is a real bug. Do NOT just rerun and move on. Fix the underlying bug and add an explicit `example(...)` decorator to the test so the case is permanently covered in CI.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Never skip, dismiss, or ignore flaky tests. Always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins. For tasks that must block indefinitely until cancelled, use `asyncio.Event().wait()` instead of `asyncio.sleep(large_number)`.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Commits must follow format: `<type>: <description>` with types: feat, fix, refactor, docs, test, chore, perf, ci. Enforced by commitizen (commit-msg hook).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Signed commits are required on `main` via branch protection. All commits must be GPG/SSH signed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Branches must follow format: `<type>/<slug>` from main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Pre-commit hooks enforce: trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-json, check-merge-conflict, check-added-large-files, no-commit-to-branch (main), ruff check+format, gitleaks, hadolint (Dockerfile linting), golangci-lint + go vet (CLI), no-em-dashes, no-redundant-timeout, eslint-web (web dashboard, zero warnings).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Hookify rule: `block-pr-create` blocks direct `gh pr create` (must use `/pre-pr-review`).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Hookify rule: `enforce-parallel-tests` enforces `-n 8` with pytest.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Hookify rule: `no-cd-prefix` blocks `cd` prefix in Bash commands.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Hookify rule: `no-local-coverage` blocks `--cov` flags locally (CI handles coverage).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Pre-push hooks enforce: mypy type-check (affected modules only) + pytest unit tests (affected modules only) + golangci-lint + go vet + go test (CLI) + eslint-web (web dashboard). Foundational module changes (core, config, observability) or conftest changes trigger full runs.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Squash merge strategy: PR body becomes the squash commit message on main. Trailers (e.g., `Release-As`, `Closes `#N``) must be in the PR body to land in the final commit.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Preserve existing `Closes `#NNN`` references in PR issue references. Never remove unless explicitly asked.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: After finishing an issue implementation, always create a feature branch (`<type>/<slug>`), commit, and push. Do NOT create a PR automatically. Do NOT leave work uncommitted on main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: NEVER create a PR directly. ALWAYS use `/pre-pr-review` to create PRs. It runs automated checks + review agents + fixes before creating the PR. For trivial/docs-only changes: `/pre-pr-review quick` skips agents but still runs automated checks.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: After the PR exists, use `/aurelio-review-pr` to handle external reviewer feedback.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Fix everything valid during code review. Never skip valid issues found by review agents, including pre-existing issues in surrounding code, suggestions, and adjacent findings. No deferring or 'out of scope' skipping.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: See `web/CLAUDE.md` for full web dashboard component inventory, design token rules, and design system references.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: NEVER use `cd` in Bash commands. The working directory is already set to the project root. Use absolute paths or run commands directly. Exception: `bash -c "cd <dir> && <cmd>"` is safe (runs in child process).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: NEVER use Bash to write or modify files. Use Write or Edit tools instead. Do not use `cat >`, `cat << EOF`, `echo >`, `echo >>`, `sed -i`, or `tee` to create or modify files (read-only uses like piping to stdout are fine).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Quick command: `uv sync` -- install all deps (dev + test).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Quick command: `uv run ruff check src/ tests/` -- lint Python code.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Quick command: `uv run ruff format src/ tests/` -- format Python code.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: Quick command: `uv run mypy src/ tests/` -- type-check (strict mode).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: See `web/CLAUDE.md` for web dashboard commands, design system, and component inventory.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T08:19:23.846Z
Learning: See `cli/CLAUDE.md` for CLI commands, flags, and reference. Key rule: use `go -C cli` (never `cd cli`).

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

src/synthorg/observability/enums.py
tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py
src/synthorg/api/state.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Applied to files:

src/synthorg/observability/enums.py
tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py
src/synthorg/api/state.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

src/synthorg/observability/enums.py
docs/design/operations.md
tests/unit/observability/test_enums.py
tests/unit/observability/conftest.py
src/synthorg/observability/config.py
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-31T14:31:11.894Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:31:11.894Z
Learning: Applies to {docs/rest-api.md,docs/_generated/api-reference.html} : REST API reference in `docs/rest-api.md` + `docs/_generated/api-reference.html` (generated by `scripts/export_openapi.py`)

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/design/operations.md
src/synthorg/api/app.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/api/**/*.py : API package (api/): Litestar REST + WebSocket with controllers, guards, channels, JWT + API key + WS ticket auth, approval gate integration, coordination endpoint, collaboration endpoint, settings endpoint, provider management endpoint (CRUD + test + presets), backup endpoint, RFC 9457 structured errors, AppState hot-reload slots, service auto-wiring (Phase 1 at construction, Phase 2 on startup), lifecycle helpers

Applied to files:

docs/design/operations.md
src/synthorg/api/app.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-31T16:09:24.320Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T16:09:24.320Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from `synthorg.observability.events.<domain>` modules (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly and use in structured logging

Applied to files:

tests/unit/observability/test_enums.py
src/synthorg/observability/config.py
tests/unit/observability/test_config.py
src/synthorg/api/state.py

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from the domain-specific module under `synthorg.observability.events` in logging calls

Applied to files:

tests/unit/observability/test_enums.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Use event constants from `synthorg.observability.events.<domain>` (e.g., `API_REQUEST_STARTED` from `events.api`); import directly and log with structured kwargs: `logger.info(EVENT, key=value)`, never interpolated strings

Applied to files:

tests/unit/observability/test_enums.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to !(src/synthorg/observability/{setup,sinks,syslog_handler,http_handler}.py)**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code (exception: `observability/setup.py`, `observability/sinks.py`, `observability/syslog_handler.py`, and `observability/http_handler.py` may use stdlib `logging`)

Applied to files:

tests/unit/observability/test_enums.py
tests/unit/observability/test_config.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

tests/unit/observability/test_enums.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import `from synthorg.observability import get_logger` and define `logger = get_logger(__name__)`

Applied to files:

tests/unit/observability/test_enums.py
src/synthorg/api/state.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-15T16:55:07.730Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T16:55:07.730Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py

📚 Learning: 2026-03-31T20:07:03.035Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T20:07:03.035Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions

Applied to files:

src/synthorg/observability/prometheus_collector.py
src/synthorg/api/state.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/budget/**/*.py : Budget tracking includes pre-flight/in-flight checks, auto-downgrade, billing periods, cost tiers, quota/subscription. CFO includes anomaly detection, efficiency analysis, downgrade recommendations.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-07T22:12:30.141Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.141Z
Learning: Applies to src/synthorg/**/*.py : Use `computed_field` for derived values instead of storing + validating redundant fields (e.g. `TokenUsage.total_tokens`)

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/budget/**/*.py : Budget package (budget/): cost tracking, budget enforcement (pre-flight/in-flight checks, auto-downgrade), billing periods, cost tiers, quota/subscription tracking, CFO cost optimization (anomaly detection, efficiency analysis, downgrade recommendations, approval decisions), spending reports, budget errors (BudgetExhaustedError, DailyLimitExceededError, QuotaExhaustedError)

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow. Validate at system boundaries (user input, external APIs, config files).

Applied to files:

src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/config.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to **/*.py : Handle errors explicitly—never silently swallow exceptions.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to **/*.py : Handle errors explicitly, never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow them

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-26T15:18:16.848Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-26T15:18:16.848Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow them. All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-02T21:38:30.127Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T21:38:30.127Z
Learning: Applies to {**/*.py,web/src/**/*.{ts,tsx}} : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to src/synthorg/**/*.py : Validate at system boundaries (user input, external APIs, config files)

Applied to files:

src/synthorg/observability/config.py

📚 Learning: 2026-03-16T20:14:00.937Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T20:14:00.937Z
Learning: Applies to **/*.py : Validate: at system boundaries (user input, external APIs, config files).

Applied to files:

src/synthorg/observability/config.py

📚 Learning: 2026-04-01T14:22:06.315Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T14:22:06.315Z
Learning: Applies to {**/*.py,web/src/**/*.{ts,tsx}} : Validate at system boundaries (user input, external APIs, config files)

Applied to files:

src/synthorg/observability/config.py

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to **/*.py : Validate at system boundaries (user input, external APIs, config files)

Applied to files:

src/synthorg/observability/config.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

src/synthorg/observability/config.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-31T21:07:37.470Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T21:07:37.470Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally; mock `time.monotonic()` and `asyncio.sleep()` for determinism; use `asyncio.Event().wait()` for indefinite blocking instead of `asyncio.sleep(large_number)`

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-01T15:36:39.993Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T15:36:39.993Z
Learning: Applies to tests/**/*.py : Flaky tests: NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests—always fix them fully and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins; for tasks that must block indefinitely until cancelled, use `asyncio.Event().wait()` instead of `asyncio.sleep(large_number)`

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to tests/**/*.py : Never skip, dismiss, or ignore flaky tests; fix them fully and fundamentally by mocking `time.monotonic()` and `asyncio.sleep()` for determinism or using `asyncio.Event().wait()` for indefinite blocking

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : Use DEBUG logging for object creation, internal flow, and entry/exit of key functions.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : All state transitions must log at INFO level

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-04-07T22:12:30.142Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-07T22:12:30.142Z
Learning: Applies to src/synthorg/**/*.py : DEBUG level logging is for object creation, internal flow, entry/exit of key functions

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from synthorg.observability.events domain-specific modules (e.g., PROVIDER_CALL_START from events.provider). Import directly: from synthorg.observability.events.<domain> import EVENT_CONSTANT.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-15T19:14:27.144Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T19:14:27.144Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use import logging / logging.getLogger() / print() in application code.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly from the domain module

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` / `logging.getLogger()` / `print()` in application code. Variable name: always `logger`.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic MUST have `from synthorg.observability import get_logger` followed by `logger = get_logger(__name__)`.

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-19T11:33:01.580Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T11:33:01.580Z
Learning: Applies to src/synthorg/**/*.py : Every module with business logic must import logger via `from synthorg.observability import get_logger` and initialize with `logger = get_logger(__name__)`

Applied to files:

src/synthorg/api/state.py

📚 Learning: 2026-03-20T08:28:32.845Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T08:28:32.845Z
Learning: Applies to src/synthorg/providers/**/*.py : Providers: LLM provider abstraction (LiteLLM adapter), auth types (api_key/oauth/custom_header/none), presets (PROVIDER_PRESETS), runtime CRUD (ProviderManagementService with asyncio.Lock serialization), hot-reload via AppState swap.

Applied to files:

src/synthorg/api/state.py
src/synthorg/api/app.py

📚 Learning: 2026-03-26T15:18:16.848Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-26T15:18:16.848Z
Learning: Applies to src/synthorg/api/**/*.py : Litestar API must include setup wizard, auth/, auto-wiring, and lifecycle management

Applied to files:

src/synthorg/api/app.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/engine/**/*.py : Engine package (engine/): agent orchestration, parallel execution, task decomposition, routing, TaskEngine (centralized single-writer), task lifecycle/recovery/shutdown, workspace isolation, coordination (4 dispatchers: SAS/centralized/decentralized/context-dependent, wave execution), approval gates (escalation detection, context parking/resume), stagnation detection (ToolRepetitionDetector, corrective prompt injection), AgentRuntimeState (execution status), context budget management, conversation compaction (oldest-turns summarizer)

Applied to files:

src/synthorg/api/app.py

docs/design/operations.md

src/synthorg/observability/config.py

src/synthorg/observability/otlp_handler.py

coderabbitai · 2026-04-08T08:34:56Z

src/synthorg/observability/otlp_handler.py

+    def _drain_and_flush(self) -> None:
+        """Drain all queued records and export as OTLP batches."""
+        records: list[logging.LogRecord] = []
+        while True:
+            try:
+                records.append(self._queue.get_nowait())
+            except queue.Empty:
+                break
+
+        with self._pending_lock:
+            self._pending_count = max(0, self._pending_count - len(records))
+
+        for start in range(0, len(records), self._batch_size):
+            batch = records[start : start + self._batch_size]
+            if batch:
+                self._export_batch(batch)
+


⚠️ Potential issue | 🟠 Major

Do not drain the entire queue into memory before batching.

_drain_and_flush() empties the whole queue into a list and only then slices it. Under backlog, this defeats the configured batch size and can turn a logging spike into a large transient allocation or a long shutdown stall.

Suggested fix

- records: list[logging.LogRecord] = [] - while True: - try: - records.append(self._queue.get_nowait()) - except queue.Empty: - break - - with self._pending_lock: - self._pending_count = max(0, self._pending_count - len(records)) - - for start in range(0, len(records), self._batch_size): - batch = records[start : start + self._batch_size] - if batch: - self._export_batch(batch) + while True: + batch: list[logging.LogRecord] = [] + while len(batch) < self._batch_size: + try: + batch.append(self._queue.get_nowait()) + except queue.Empty: + break + + if not batch: + return + + with self._pending_lock: + self._pending_count = max(0, self._pending_count - len(batch)) + + self._export_batch(batch)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/otlp_handler.py` around lines 159 - 175, _drained_and_flush currently pulls the entire _queue into memory then slices into batches; change _drain_and_flush to repeatedly pull up to _batch_size items from self._queue (using get_nowait in a loop or pop batch-wise) and call self._export_batch for each batch as you form it, so you never accumulate the whole backlog in records. Update the pending count decrement to subtract the number of records actually exported (inside the same critical section guarded by self._pending_lock or by decrementing per-batch) and keep references to _queue, _batch_size, _export_batch, _pending_lock, and _pending_count to locate where to modify.

coderabbitai · 2026-04-08T08:34:56Z

src/synthorg/observability/prometheus_collector.py

+    def __init__(self, *, prefix: str = "synthorg") -> None:
+        self._prefix = prefix
+        self.registry = CollectorRegistry()
+
+        # -- Info --------------------------------------------------------
+        self._info = Info(
+            f"{prefix}_app",
+            "SynthOrg application info",
+            registry=self.registry,
+        )
+        self._info.info({"version": __version__})
+
+        # -- Agent gauges ------------------------------------------------
+        self._agents_total = Gauge(
+            f"{prefix}_active_agents_total",
+            "Number of active agents",
+            ["status", "trust_level"],
+            registry=self.registry,
+        )
+
+        # -- Task gauges -------------------------------------------------
+        self._tasks_total = Gauge(
+            f"{prefix}_tasks_total",
+            "Number of tasks by status and agent",
+            ["status", "agent"],
+            registry=self.registry,
+        )
+
+        # -- Cost gauges -------------------------------------------------
+        self._cost_total = Gauge(
+            f"{prefix}_cost_total",
+            "Total accumulated cost",
+            registry=self.registry,
+        )
+
+        # -- Budget gauges -----------------------------------------------
+        self._budget_used_percent = Gauge(
+            f"{prefix}_budget_used_percent",
+            "Accumulated cost as percentage of monthly budget limit",
+            registry=self.registry,
+        )
+        self._budget_monthly_usd = Gauge(
+            f"{prefix}_budget_monthly_usd",
+            "Monthly budget limit in USD",
+            registry=self.registry,
+        )
+
+        # -- Coordination gauges (push-updated) --------------------------
+        self._coordination_efficiency = Gauge(
+            f"{prefix}_coordination_efficiency",
+            "Coordination efficiency ratio",
+            registry=self.registry,
+        )
+        self._coordination_overhead_percent = Gauge(
+            f"{prefix}_coordination_overhead_percent",
+            "Coordination overhead percentage",
+            registry=self.registry,
+        )
+
+        # -- Security counters -------------------------------------------
+        self._security_evaluations = PromCounter(
+            f"{prefix}_security_evaluations_total",
+            "Security evaluation verdicts",
+            ["verdict"],
+            registry=self.registry,
+        )
+
+        logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix)


🛠️ Refactor suggestion | 🟠 Major

Register the metric definitions from data instead of inline.

__init__ is now well past the repository's 50-line limit. A small spec table plus a loop would keep setup readable without introducing one-off helper methods.

As per coding guidelines, "**/*.py: Functions must be less than 50 lines, and files must be less than 800 lines."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/prometheus_collector.py` around lines 47 - 114, The __init__ is too long because each metric is created inline; refactor to register metrics from a data-driven spec so __init__ stays under 50 lines: create a list of metric specs (tuples/dicts) containing the metric type (Info/Gauge/PromCounter), name suffix, help text, label names (if any) and target attribute name (e.g. "_info", "_agents_total", "_tasks_total", "_cost_total", "_budget_used_percent", "_budget_monthly_usd", "_coordination_efficiency", "_coordination_overhead_percent", "_security_evaluations"), then loop over that spec to instantiate each metric with f"{prefix}_{suffix}" and registry=self.registry and assign to the corresponding self.<attribute>; preserve the existing call self._info.info({"version": __version__}) and logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix).

coderabbitai · 2026-04-08T08:34:56Z

src/synthorg/observability/prometheus_collector.py

+    def _refresh_cost_gauge(self, total_cost: float | None) -> None:
+        """Update cost gauge from a pre-fetched total."""
+        if total_cost is not None:
+            self._cost_total.set(total_cost)
+
+    def _refresh_budget_metrics(
+        self,
+        app_state: AppState,
+        total_cost: float | None,
+    ) -> None:
+        """Update budget utilization gauges from CostTracker config."""
+        if not app_state.has_cost_tracker:
+            return
+        try:
+            tracker = app_state.cost_tracker
+            if tracker.budget_config is None:
+                return
+            monthly = tracker.budget_config.total_monthly
+            self._budget_monthly_usd.set(monthly)
+            if monthly > 0 and total_cost is not None:
+                self._budget_used_percent.set(
+                    min(100.0, (total_cost / monthly) * 100.0),
+                )
+        except MemoryError, RecursionError:
+            raise
+        except Exception:
+            logger.warning(
+                METRICS_SCRAPE_FAILED,
+                component="budget",
+                exc_info=True,
+            )


⚠️ Potential issue | 🟠 Major

Reset snapshot families when a source is unavailable.

_refresh_cost_gauge(), _refresh_budget_metrics(), _refresh_agent_metrics(), and _refresh_task_metrics() all have no-data/error paths that leave the previous scrape's value in place. After one successful scrape, a later outage or deconfiguration of one service keeps exporting stale totals and counts as if they were current.

Also applies to: 227-279

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/prometheus_collector.py` around lines 195 - 225, When a source is unavailable or an exception occurs the metric methods leave stale values; update _refresh_cost_gauge, _refresh_budget_metrics, _refresh_agent_metrics, and _refresh_task_metrics to explicitly reset their Prometheus metrics to a safe “no-data” value instead of doing nothing: e.g. in _refresh_cost_gauge set self._cost_total.set(0.0) (or call the metric reset/clear API if available) when total_cost is None; in _refresh_budget_metrics when app_state.has_cost_tracker is False or tracker.budget_config is None set self._budget_monthly_usd.set(0.0) and self._budget_used_percent.set(0.0); and wrap exception handlers to reset the same gauges (self._budget_monthly_usd, self._budget_used_percent, etc.) before logging so errors don’t leave stale data. Ensure the same pattern is applied to the agent and task refresh methods and their associated metric symbols.

tests/unit/observability/test_prometheus_collector.py

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (4)

docs/design/operations.md (1)

1839-1839: ⚠️ Potential issue | 🟡 Minor

G1 is marked closed here, but the section below still frames telemetry export as the primary open gap.

Please update the “Priority for closing gaps” and “Recommended Framing” text to remove the stale G1-open language so the page is internally consistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` at line 1839, The document still speaks of G1 as
an open gap despite the diff marking G1 closed (PrometheusCollector instantiated
in on_startup(), /metrics returning 200 with 9 metric families, and OTLP
implemented via HTTP/JSON); update the “Priority for closing gaps” and
“Recommended Framing” sections to remove any phrasing that treats G1 as open and
instead reflect that telemetry export is implemented (reference G1,
PrometheusCollector, on_startup(), /metrics, and OTLP HTTP/JSON) and adjust any
remaining priorities or framing language accordingly to maintain internal
consistency.

src/synthorg/observability/otlp_handler.py (1)

171-186: ⚠️ Potential issue | 🟠 Major

Drain in bounded batches instead of materializing the whole backlog.

_drain_and_flush() empties the full queue into records before slicing, so a logging spike can turn exporter lag into one large transient allocation. Pull and export at most self._batch_size records per iteration so batch_size is also your memory bound.

🛠️ Suggested fix

-        records: list[logging.LogRecord] = []
-        while True:
-            try:
-                records.append(self._queue.get_nowait())
-            except queue.Empty:
-                break
-
-        with self._pending_lock:
-            self._pending_count = max(0, self._pending_count - len(records))
-
-        for start in range(0, len(records), self._batch_size):
-            batch = records[start : start + self._batch_size]
-            if batch:
-                self._export_batch(batch)
+        while True:
+            batch: list[logging.LogRecord] = []
+            while len(batch) < self._batch_size:
+                try:
+                    batch.append(self._queue.get_nowait())
+                except queue.Empty:
+                    break
+
+            if not batch:
+                return
+
+            with self._pending_lock:
+                self._pending_count = max(0, self._pending_count - len(batch))
+
+            self._export_batch(batch)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 171 - 186, _current
implementation of _drain_and_flush collects the entire queue into records and
then slices into batches which can cause large memory spikes; change it to
repeatedly pop up to self._batch_size items from self._queue and call
_export_batch for each batch immediately so memory stays bounded. Specifically,
in _drain_and_flush use a loop that builds a single batch by calling
self._queue.get_nowait up to self._batch_size times (breaking on queue.Empty),
call self._export_batch(batch) when batch is non-empty, and under
self._pending_lock decrement self._pending_count by the number of records
actually consumed (or update it per batch), repeating until the queue is empty;
keep references to self._queue, self._batch_size, self._export_batch, and
self._pending_count/_pending_lock to locate the changes.

src/synthorg/observability/prometheus_collector.py (2)

195-225: ⚠️ Potential issue | 🟠 Major

Clear snapshot metrics when a source is absent or errors.

After one successful scrape, the no-data and exception paths here leave the previous cost, budget, agent, and task values in the registry. /metrics will keep exporting stale numbers during outages or after a service is deconfigured unless these gauges and label families are reset before returning.

Also applies to: 227-279

47-114: 🛠️ Refactor suggestion | 🟠 Major

Shrink __init__(); it is still well over the 50-line cap.

Inline metric registration keeps this constructor around 67 body lines. A small metric-spec table plus a loop would keep startup readable and satisfy the repository limit without changing behavior.

As per coding guidelines: "Keep functions under 50 lines and files under 800 lines."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` around lines 47 - 114,
The __init__ constructor is too long due to inline metric registrations;
refactor it by extracting the metric definitions into a small spec list and
iterating to register metrics (or call a tiny helper like
_register_gauge/_register_counter) so you keep the same attributes (_info,
_agents_total, _tasks_total, _cost_total, _budget_used_percent,
_budget_monthly_usd, _coordination_efficiency, _coordination_overhead_percent,
_security_evaluations) and the same CollectorRegistry usage but replace the
repeated Gauge/PromCounter instantiation blocks with a loop over a table of
metric specs (name, doc, labels, type, target_attribute) to bring __init__ under
50 lines while preserving behavior and the METRICS_COLLECTOR_INITIALIZED debug
call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/architecture/acg-glossary.md`:
- Line 24: Replace the hardcoded "82+ event constant domains" text in the
Execution Trace row of acg-glossary.md (the table cell containing "Strong |
SynthOrg's trace is richer... 82+ event constant domains.") with a less brittle
phrase such as "80+ event constant domains" or "numerous event constant domains
organized by domain" to avoid frequent doc churn; update the table cell text
where the string "82+ event constant domains" appears so the rest of the row
(mentions of per-turn cost, token usage, tool fingerprints, stagnation signals,
quality scores) remains unchanged.

In `@scripts/run_affected_tests.py`:
- Around line 150-157: The _run_pytest function sets workers = "4" when
full_suite is True, which violates the repo rule to always use -n 8 for local
pytest; change the implementation so workers is always "8" (never "4") and
ensure any call sites that currently route fallback paths through
full_suite=True (the fallback invocations referenced near the previous diff) do
not force fewer workers—keep full_suite semantics for other behavior but always
pass or use "-n 8" when assembling the pytest args in _run_pytest so all local
runs use -n 8.

In `@src/synthorg/api/controllers/metrics.py`:
- Around line 49-53: Wrap the calls to
app_state.prometheus_collector.refresh(app_state) and
generate_latest(collector.registry) in a try/except block that catches all
exceptions, logs a structured METRICS_SCRAPE_FAILED event via logger.error or
logger.warning with context (include exception info, collector identity from
app_state.prometheus_collector and the function names collector.refresh and
generate_latest), and then return a scrape-safe response (e.g., an empty metrics
payload or a minimal valid exposition text) instead of letting the framework
raise a raw 500; ensure you still emit METRICS_SCRAPE_COMPLETED on success and
reference the same symbols (collector.refresh, generate_latest,
METRICS_SCRAPE_FAILED, METRICS_SCRAPE_COMPLETED) so future reviewers can find
the change.

In `@src/synthorg/observability/config.py`:
- Around line 122-127: The current except socket.gaierror block silently returns
(allowing unresolved hostnames to bypass SSRF checks); change this to fail
closed by surfacing the resolution failure: catch socket.gaierror around
socket.getaddrinfo(hostname, ...) and raise a configuration/validation error
(e.g., ValueError or a ConfigValidationError) that includes hostname and the
original exception; alternatively, if you prefer runtime re-checking, add a
clear comment and perform the same getaddrinfo lookup immediately before each
export path (where the exporter is invoked) and raise if resolution fails—use
the symbols addrs, hostname, socket.getaddrinfo and socket.gaierror to locate
and update the code.

In `@src/synthorg/observability/enums.py`:
- Around line 52-61: The OtlpProtocol enum currently exposes HTTP_JSON which
conflicts with the PR objective to support HTTP/protobuf; update the enum
OtlpProtocol so the supported transport is named/valued for HTTP/protobuf (e.g.,
replace or add HTTP_PROTOBUF = "http/protobuf" and remove or deprecate
HTTP_JSON), then update any reference/tests/docs that use OtlpProtocol.HTTP_JSON
to use OtlpProtocol.HTTP_PROTOBUF (or add a migration/alias if needed) so the
public config and design docs/tests align with the HTTP/protobuf scope;
alternatively, if you intentionally kept HTTP_JSON, add an explicit note in the
design docs and tests documenting the scope change and adjust enum docstring to
reflect the approved deviation.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 61-73: The handler currently only supports HTTP_JSON but the
design calls for HTTP/protobuf support; either implement HTTP/protobuf or
explicitly restrict the API to JSON. To fix, in the constructor handling of the
protocol (referencing OtlpProtocol and the protocol parameter) either add a
branch for HTTP_PROTOBUF that sets up protobuf serialization and Content-Type
"application/x-protobuf" and uses the OTLP protobuf request encoding for the
export/send methods (see the send/serialize logic around lines 203-223), or
change the accepted protocols to reject HTTP_PROTOBUF (raise NotImplementedError
for HTTP_PROTOBUF and gRPC) and update the class docstring and any config
surface to state "JSON-only" (update references to OtlpProtocol usage and any
code that sets Content-Type to "application/json") so the implementation matches
the declared scope.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 178-180: The availability guards are checking bound methods
(has_cost_tracker, has_agent_registry, has_task_engine) instead of calling them,
so they always evaluate True; change the checks to call the boolean methods
(e.g., use app_state.has_cost_tracker(), app_state.has_agent_registry(),
app_state.has_task_engine()) before entering the try blocks that access
app_state.cost_tracker.get_total_cost,
app_state.agent_registry.get_active_agents, and
app_state.task_engine.get_queue_stats respectively, and apply the same fix to
the other occurrences noted (the blocks around lines 206-207, 229-230, 257-258)
so missing services are skipped rather than hitting the raising accessors.
- Around line 116-143: The metrics helper record_security_verdict currently
rejects a valid upstream verdict "output_scan" because _VALID_VERDICTS only
contains {"allow","deny","escalate"}; update the function to either accept the
full shared domain or restrict its callers to pre-tool verdicts. Fix options:
(1) add "output_scan" to _VALID_VERDICTS and ensure the _security_evaluations
Prometheus metric (labels via _security_evaluations.labels) can handle that
label, or (2) rename/document record_security_verdict as record_pre_tool_verdict
and validate against a narrowed set at call sites so AuditVerdictStr (from
src/synthorg/security/models.py) remains unchanged; choose one approach and make
consistent changes to _VALID_VERDICTS, record_security_verdict, and callers to
avoid raising ValueError for upstream-valid verdicts.

In `@tests/unit/api/controllers/test_metrics.py`:
- Around line 65-69: Update the test test_calls_refresh to assert the async
refresh was awaited rather than merely called: replace the current use of
collector.refresh.assert_called_once() with
collector.refresh.assert_awaited_once() so the AsyncMock returned by
_make_collector is verified as awaited by the controller when GET /metrics is
exercised in the TestClient; ensure you reference the collector fixture produced
by _make_collector and the test method test_calls_refresh when making this
change.

---

Duplicate comments:
In `@docs/design/operations.md`:
- Line 1839: The document still speaks of G1 as an open gap despite the diff
marking G1 closed (PrometheusCollector instantiated in on_startup(), /metrics
returning 200 with 9 metric families, and OTLP implemented via HTTP/JSON);
update the “Priority for closing gaps” and “Recommended Framing” sections to
remove any phrasing that treats G1 as open and instead reflect that telemetry
export is implemented (reference G1, PrometheusCollector, on_startup(),
/metrics, and OTLP HTTP/JSON) and adjust any remaining priorities or framing
language accordingly to maintain internal consistency.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 171-186: _current implementation of _drain_and_flush collects the
entire queue into records and then slices into batches which can cause large
memory spikes; change it to repeatedly pop up to self._batch_size items from
self._queue and call _export_batch for each batch immediately so memory stays
bounded. Specifically, in _drain_and_flush use a loop that builds a single batch
by calling self._queue.get_nowait up to self._batch_size times (breaking on
queue.Empty), call self._export_batch(batch) when batch is non-empty, and under
self._pending_lock decrement self._pending_count by the number of records
actually consumed (or update it per batch), repeating until the queue is empty;
keep references to self._queue, self._batch_size, self._export_batch, and
self._pending_count/_pending_lock to locate the changes.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 47-114: The __init__ constructor is too long due to inline metric
registrations; refactor it by extracting the metric definitions into a small
spec list and iterating to register metrics (or call a tiny helper like
_register_gauge/_register_counter) so you keep the same attributes (_info,
_agents_total, _tasks_total, _cost_total, _budget_used_percent,
_budget_monthly_usd, _coordination_efficiency, _coordination_overhead_percent,
_security_evaluations) and the same CollectorRegistry usage but replace the
repeated Gauge/PromCounter instantiation blocks with a loop over a table of
metric specs (name, doc, labels, type, target_attribute) to bring __init__ under
50 lines while preserving behavior and the METRICS_COLLECTOR_INITIALIZED debug
call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 147fe9bd-9c0d-45f1-b9d9-8381b3d28bea

📥 Commits

Reviewing files that changed from the base of the PR and between 846a0d6 and 3b04890.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (24)

CLAUDE.md
docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md
docs/design/operations.md
pyproject.toml
scripts/run_affected_tests.py
src/synthorg/api/app.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/enums.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/sinks.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/conftest.py
tests/unit/observability/test_config.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_events.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_sinks.py

docs/architecture/acg-glossary.md

scripts/run_affected_tests.py

src/synthorg/api/controllers/metrics.py

coderabbitai · 2026-04-08T11:59:49Z

src/synthorg/observability/config.py

+        try:
+            addrs = socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP)
+        except socket.gaierror:
+            # DNS resolution failed -- skip check (hostname may be valid
+            # at runtime even if not resolvable at config-load time).
+            return


⚠️ Potential issue | 🟠 Major

Fail closed when the OTLP hostname cannot be resolved.

Returning on socket.gaierror lets the endpoint bypass the SSRF guard entirely. A hostname that resolves to a private address only in the runtime environment will still be accepted here and used later by the exporter. Reject unresolved hosts at validation time, or repeat the resolution check immediately before each export.

🛡️ Suggested fix

- except socket.gaierror: - # DNS resolution failed -- skip check (hostname may be valid - # at runtime even if not resolvable at config-load time). - return + except socket.gaierror as exc: + msg = ( + f"otlp_endpoint hostname {hostname!r} could not be resolved " + "during safety validation" + ) + raise ValueError(msg) from exc

As per coding guidelines: "Validate at system boundaries (user input, external APIs, config files)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/config.py` around lines 122 - 127, The current except socket.gaierror block silently returns (allowing unresolved hostnames to bypass SSRF checks); change this to fail closed by surfacing the resolution failure: catch socket.gaierror around socket.getaddrinfo(hostname, ...) and raise a configuration/validation error (e.g., ValueError or a ConfigValidationError) that includes hostname and the original exception; alternatively, if you prefer runtime re-checking, add a clear comment and perform the same getaddrinfo lookup immediately before each export path (where the exporter is invoked) and raise if resolution fails—use the symbols addrs, hostname, socket.getaddrinfo and socket.gaierror to locate and update the code.

coderabbitai · 2026-04-08T11:59:49Z

src/synthorg/observability/enums.py

+class OtlpProtocol(StrEnum):
+    """OpenTelemetry Protocol transport.
+
+    Attributes:
+        HTTP_JSON: HTTP with JSON encoding (the only implemented transport).
+        GRPC: gRPC transport (not implemented; rejected at handler init).
+    """
+
+    HTTP_JSON = "http/json"
+    GRPC = "grpc"


⚠️ Potential issue | 🟠 Major

OtlpProtocol contract appears to drift from the stated issue objective.

The objective text for this PR tracks OTLP as HTTP/protobuf (with gRPC rejected), but this enum exposes HTTP_JSON = "http/json" as the primary supported transport. That creates a different public config contract than the issue scope.

Please either:

align enum/config/docs/tests to HTTP/protobuf, or

explicitly document and approve the scope change in design docs.

Minimal enum alignment sketch

class OtlpProtocol(StrEnum): @@ - Attributes: - HTTP_JSON: HTTP with JSON encoding (the only implemented transport). + Attributes: + HTTP_PROTOBUF: HTTP with protobuf encoding (the only implemented transport). GRPC: gRPC transport (not implemented; rejected at handler init). @@ - HTTP_JSON = "http/json" + HTTP_PROTOBUF = "http/protobuf" GRPC = "grpc"

As per coding guidelines: if implementation deviates from the design spec/scope, surface it explicitly and update the relevant docs/design/ page once approved.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/enums.py` around lines 52 - 61, The OtlpProtocol enum currently exposes HTTP_JSON which conflicts with the PR objective to support HTTP/protobuf; update the enum OtlpProtocol so the supported transport is named/valued for HTTP/protobuf (e.g., replace or add HTTP_PROTOBUF = "http/protobuf" and remove or deprecate HTTP_JSON), then update any reference/tests/docs that use OtlpProtocol.HTTP_JSON to use OtlpProtocol.HTTP_PROTOBUF (or add a migration/alias if needed) so the public config and design docs/tests align with the HTTP/protobuf scope; alternatively, if you intentionally kept HTTP_JSON, add an explicit note in the design docs and tests documenting the scope change and adjust enum docstring to reflect the approved deviation.

src/synthorg/observability/otlp_handler.py

src/synthorg/observability/prometheus_collector.py

coderabbitai · 2026-04-08T11:59:49Z

src/synthorg/observability/prometheus_collector.py

+        if app_state.has_cost_tracker:
+            try:
+                total_cost = await app_state.cost_tracker.get_total_cost()


⚠️ Potential issue | 🟠 Major

These availability guards are checking bound methods, not booleans.

In src/synthorg/api/state.py, has_cost_tracker, has_agent_registry, and has_task_engine are plain methods. These checks are therefore always truthy, so missing services fall through to the raising accessors and log false scrape failures instead of being skipped cleanly. That also explains why the current unit tests do not catch this—they mock these accessors as properties.

🛠️ Suggested fix

- if app_state.has_cost_tracker: + if app_state.has_cost_tracker(): try: total_cost = await app_state.cost_tracker.get_total_cost() @@ - if not app_state.has_cost_tracker: + if not app_state.has_cost_tracker(): return @@ - if not app_state.has_agent_registry: + if not app_state.has_agent_registry(): return @@ - if not app_state.has_task_engine: + if not app_state.has_task_engine(): return

Also applies to: 206-207, 229-230, 257-258

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/prometheus_collector.py` around lines 178 - 180, The availability guards are checking bound methods (has_cost_tracker, has_agent_registry, has_task_engine) instead of calling them, so they always evaluate True; change the checks to call the boolean methods (e.g., use app_state.has_cost_tracker(), app_state.has_agent_registry(), app_state.has_task_engine()) before entering the try blocks that access app_state.cost_tracker.get_total_cost, app_state.agent_registry.get_active_agents, and app_state.task_engine.get_queue_stats respectively, and apply the same fix to the other occurrences noted (the blocks around lines 206-207, 229-230, 257-258) so missing services are skipped rather than hitting the raising accessors.

tests/unit/api/controllers/test_metrics.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

docs/design/operations.md (1)
1846-1849: ⚠️ Potential issue | 🟡 Minor

Resolve the remaining stale G1 statement in the same section.

These lines correctly mark G1 closed, but the later “Recommended Framing” text (Lines 1856-1857) still says telemetry export is the primary gap, which is contradictory. Please update that trailing paragraph to keep the section internally consistent.

Based on learnings: "If implementation deviates from the spec, alert the user and explain why before proceeding. Every deviation needs explicit user approval and the relevant docs/design/ page must be updated to reflect the new reality."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` around lines 1846 - 1849, The trailing
"Recommended Framing" paragraph is stale because G1 (telemetry export) is
already closed—PrometheusCollector is instantiated in on_startup(), /metrics
returns 200 with 9 metric families, and the OTLP HTTP/JSON sink is
implemented—so update that paragraph to remove/replace the claim that telemetry
export is the primary gap and instead call out G3 (policy-as-code) and G4
(coordination metrics API) as the highest remaining priorities with G2 as
smallest; also add the explicit deviation policy note ("If implementation
deviates from the spec, alert the user and explain why before proceeding...")
and ensure the docs/design/ page is updated to reflect any accepted deviations.
src/synthorg/observability/prometheus_collector.py (1)
178-178: ⚠️ Potential issue | 🔴 Critical

Method reference instead of method call — availability checks are always truthy.

has_cost_tracker, has_agent_registry, and has_task_engine are methods on AppState (see src/synthorg/api/state.py), but they're being evaluated as bound method objects rather than called. A bound method is always truthy, so these guards never skip when services are unavailable.
🐛 Proposed fix
-        if app_state.has_cost_tracker:
+        if app_state.has_cost_tracker():
             try:
                 total_cost = await app_state.cost_tracker.get_total_cost()
Apply the same fix at lines 206, 229, and 257:

Line 206: if not app_state.has_cost_tracker():

Line 229: if not app_state.has_agent_registry():

Line 257: if not app_state.has_task_engine():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` at line 178, The
availability checks are using bound method objects instead of calling them, so
replace uses of app_state.has_cost_tracker, app_state.has_agent_registry, and
app_state.has_task_engine with method calls (app_state.has_cost_tracker(),
app_state.has_agent_registry(), app_state.has_task_engine()) and ensure the
boolean logic is correct (use `if not app_state.has_cost_tracker():` or `if
app_state.has_cost_tracker():` matching the original intent) in the
PrometheusCollector code where these guards occur.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/api/controllers/metrics.py`:
- Around line 41-47: The guard mistakenly checks the bound method object instead
of calling it: change the condition to call app_state.has_prometheus_collector()
so it evaluates the boolean result; update the block around
has_prometheus_collector (and the early-return 503
Response/METRICS_SCRAPE_FAILED log) to trigger when the method returns False
rather than testing the method object itself.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 119-130: Update the docstring for record_security_verdict to list
all allowed verdict values to match _VALID_VERDICTS (including "output_scan" in
addition to "allow", "deny", and "escalate"); modify the "Args" section so the
verdict description enumerates the full set of valid strings and optionally
refers to _VALID_VERDICTS for canonical source.

In `@tests/unit/api/controllers/test_metrics.py`:
- Around line 17-23: The test incorrectly mocks has_prometheus_collector as a
PropertyMock but AppState defines has_prometheus_collector as a method; change
the mock to be callable (e.g., set type(mock_state).has_prometheus_collector =
MagicMock(return_value=(collector is not None)) or patch the instance method
accordingly) and keep prometheus_collector as a PropertyMock only when the
collector is non-None; update references to has_prometheus_collector() in the
test to call the mock as a method so the mock signature matches the real
AppState.has_prometheus_collector method.

---

Duplicate comments:
In `@docs/design/operations.md`:
- Around line 1846-1849: The trailing "Recommended Framing" paragraph is stale
because G1 (telemetry export) is already closed—PrometheusCollector is
instantiated in on_startup(), /metrics returns 200 with 9 metric families, and
the OTLP HTTP/JSON sink is implemented—so update that paragraph to
remove/replace the claim that telemetry export is the primary gap and instead
call out G3 (policy-as-code) and G4 (coordination metrics API) as the highest
remaining priorities with G2 as smallest; also add the explicit deviation policy
note ("If implementation deviates from the spec, alert the user and explain why
before proceeding...") and ensure the docs/design/ page is updated to reflect
any accepted deviations.

In `@src/synthorg/observability/prometheus_collector.py`:
- Line 178: The availability checks are using bound method objects instead of
calling them, so replace uses of app_state.has_cost_tracker,
app_state.has_agent_registry, and app_state.has_task_engine with method calls
(app_state.has_cost_tracker(), app_state.has_agent_registry(),
app_state.has_task_engine()) and ensure the boolean logic is correct (use `if
not app_state.has_cost_tracker():` or `if app_state.has_cost_tracker():`
matching the original intent) in the PrometheusCollector code where these guards
occur.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b731e700-8625-4724-b12d-c6e1c9092797

📥 Commits

Reviewing files that changed from the base of the PR and between 3b04890 and c14c6b5.

📒 Files selected for processing (5)

docs/architecture/acg-glossary.md
docs/design/operations.md
src/synthorg/api/controllers/metrics.py
src/synthorg/observability/prometheus_collector.py
tests/unit/api/controllers/test_metrics.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Web
GitHub Check: Build Backend
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

ALWAYS read the relevant docs/design/ page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 12 design pages. The design spec is the starting point for architecture, data models, and behavior.

Files:

docs/architecture/acg-glossary.md
docs/design/operations.md

**/*

📄 CodeRabbit inference engine (CLAUDE.md)

NEVER use cd in Bash commands -- the working directory is already set to the project root. Use absolute paths or run commands directly. Exception: bash -c "cd <dir> && <cmd>" is safe (runs in a child process). Use this for tools without a -C flag -- e.g. bash -c "cd web && npm install".

NEVER use Bash to write or modify files -- use the Write or Edit tools. Do not use cat >, cat << EOF, echo >, echo >>, sed -i, python -c "open(...).write(...)", or tee` to create or modify files (read-only/inspection uses like piping to stdout are fine). This applies to all files and all subagents.

Branches must follow <type>/<slug> naming from main

Files:

docs/architecture/acg-glossary.md
src/synthorg/api/controllers/metrics.py
docs/design/operations.md
tests/unit/api/controllers/test_metrics.py
src/synthorg/observability/prometheus_collector.py

**/*.{py,ts,tsx,go}

📄 CodeRabbit inference engine (CLAUDE.md)

No from __future__ import annotations in Python files -- Python 3.14 has PEP 649 native lazy annotations

Files:

src/synthorg/api/controllers/metrics.py
tests/unit/api/controllers/test_metrics.py
src/synthorg/observability/prometheus_collector.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use PEP 758 except syntax: except A, B: (no parentheses) -- ruff enforces this on Python 3.14

All public functions must have type hints. Use mypy strict mode for type checking.

Use Google-style docstrings, required on public classes and functions (enforced by ruff D rules)

Create new objects instead of mutating existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement. For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, serializing for persistence).

Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves (e.g. agent execution state, task progress). Never mix static config fields with mutable runtime fields in one model.

Use Pydantic v2 with BaseModel, model_validator, computed_field, and ConfigDict. In all ConfigDict declarations, use allow_inf_nan=False to reject NaN/Inf in numeric fields at validation time. Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens). Use NotBlankStr (from core.types) for all identifier/name fields -- including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants -- instead of manual whitespace validators.

Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task. Existing code is being migrated incrementally.

Keep functions under 50 lines and files under 800 lines

Handle errors explicitly, never silently swallow exceptions

Validate at system boundaries (user input, external APIs, config files)

Use line length of 88 ...

Files:

src/synthorg/api/controllers/metrics.py
tests/unit/api/controllers/test_metrics.py
src/synthorg/observability/prometheus_collector.py

src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/api/controllers/metrics.py
src/synthorg/observability/prometheus_collector.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow

Maintain 80% minimum code coverage (enforced in CI)

Use asyncio_mode = "auto" for async tests -- no manual @pytest.mark.asyncio needed

ALWAYS include -n 8 when running pytest locally for parallelism via pytest-xdist, never run tests sequentially. CI uses -n auto (fewer cores on runners).

Prefer @pytest.mark.parametrize for testing similar cases

Use Hypothesis for property-based testing with @given + @settings. Hypothesis profiles configured in tests/conftest.py: ci (deterministic, max_examples=10 + derandomize=True), dev (1000 examples), fuzz (10,000 examples, no deadline), extreme (500,000 examples, no deadline). Controlled via HYPOTHESIS_PROFILE env var. When Hypothesis finds a failure, it is a real bug -- read the failing example, fix the underlying bug, and add an explicit @example(...) decorator permanently.

NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic instead of widening timing margins. For tasks that must block indefinitely until cancelled, use asyncio.Event().wait() instead of asyncio.sleep(large_number).

Files:

tests/unit/api/controllers/test_metrics.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/api/controllers/test_metrics.py

🧠 Learnings (39)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: If implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why -- user decides whether to proceed or update the spec. Do NOT silently diverge -- every deviation needs explicit user approval.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: When approved deviations occur, update the relevant `docs/design/` page to reflect the new reality.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Every implementation plan must be presented to the user for accept/deny before coding starts. At every phase of planning and implementation, be critical -- actively look for ways to improve the design in the spirit of what we're building (robustness, correctness, simplicity, future-proofing where it's free). Surface improvements as suggestions, not silent changes -- user decides. Prioritize issues by dependency order, not priority labels -- unblocked dependencies come first.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: After finishing an issue implementation: always create a feature branch (`<type>/<slug>`), commit, and push -- do NOT create a PR automatically. Do NOT leave work uncommitted on main -- branch, commit, push immediately after finishing.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: NEVER create a PR directly -- `gh pr create` is blocked by hookify. ALWAYS use `/pre-pr-review` to create PRs -- it runs automated checks + review agents + fixes before creating the PR. For trivial/docs-only changes: `/pre-pr-review quick` skips agents but still runs automated checks. After the PR exists, use `/aurelio-review-pr` to handle external reviewer feedback. Fix everything valid -- never skip valid issues found by review agents.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Pre-commit hooks enforce: trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-json, check-merge-conflict, check-added-large-files, no-commit-to-branch (main), ruff check+format, gitleaks, hadolint (Dockerfile linting), golangci-lint + go vet (CLI, conditional on `cli/**/*.go`), no-em-dashes, no-redundant-timeout, eslint-web (web dashboard, zero warnings, conditional on `web/src/**/*.{ts,tsx}`)

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Hookify rules (committed in `.claude/hookify.*.md`): `block-pr-create` blocks direct `gh pr create` (must use `/pre-pr-review`), `enforce-parallel-tests` enforces `-n 8` with pytest, `no-cd-prefix` blocks `cd` prefix in Bash commands, `no-local-coverage` blocks `--cov` flags locally (CI handles coverage)

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Pre-push hooks: mypy type-check (affected modules only) + pytest unit tests (affected modules only) + golangci-lint + go vet + go test (CLI) + eslint-web (web dashboard) (fast gate before push, skipped in pre-commit.ci). Foundational module changes (core, config, observability) or conftest changes trigger full runs.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Use `gh issue list` via Bash for GitHub issue queries (not MCP tools) -- MCP `list_issues` has unreliable field data

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T12:09:03.440Z
Learning: Use squash merge strategy -- PR body becomes the squash commit message on main. Trailers (e.g. `Release-As`, `Closes `#N``) must be in the PR body to land in the final commit. Preserve existing `Closes `#NNN`` references -- never remove unless explicitly asked.

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-18T21:23:23.586Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T21:23:23.586Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from the domain-specific module under synthorg.observability.events (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool). Import directly from synthorg.observability.events.<domain>.

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-20T11:18:48.128Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T11:18:48.128Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-15T18:28:13.207Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:28:13.207Z
Learning: Applies to src/synthorg/**/*.py : Event names: always use constants from domain-specific modules under synthorg.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget, etc.). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`.

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`); import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-04-08T11:07:20.358Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T11:07:20.358Z
Learning: Applies to src/synthorg/**/*.py : Use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`). Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to src/synthorg/**/*.py : Always use event name constants from domain-specific modules under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`); import directly rather than using string literals

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Engine: Agent orchestration, execution loops, parallel execution, task decomposition, routing, task assignment, centralized single-writer task state engine (TaskEngine), task lifecycle, recovery, shutdown, workspace isolation, coordination (multi-agent pipeline: TopologyDispatcher protocol, 4 dispatchers — SAS/centralized/decentralized/context-dependent, wave execution, workspace lifecycle integration, CoordinationSectionConfig company config bridge, build_coordinator factory), coordination error classification, prompt policy validation, checkpoint recovery (checkpoint/, per-turn persistence, heartbeat detection, CheckpointRecoveryStrategy), approval gate (escalation detection, context parking/resume, EscalationInfo/ResumePayload models), stagnation detection (stagnation/, StagnationDetector protocol, ToolRepetitionDetector, dual-signal analysis, corrective prompt injection), agent runtime state (AgentRuntimeState, lightweight per-agent execution status for dashboard queries and recove...

Applied to files:

docs/architecture/acg-glossary.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/api/**/*.py : API package (api/): Litestar REST + WebSocket with controllers, guards, channels, JWT + API key + WS ticket auth, approval gate integration, coordination endpoint, collaboration endpoint, settings endpoint, provider management endpoint (CRUD + test + presets), backup endpoint, RFC 9457 structured errors, AppState hot-reload slots, service auto-wiring (Phase 1 at construction, Phase 2 on startup), lifecycle helpers

Applied to files:

src/synthorg/api/controllers/metrics.py
docs/design/operations.md

📚 Learning: 2026-04-08T11:07:20.359Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T11:07:20.359Z
Learning: If implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why -- user decides whether to proceed or update the spec. Do NOT silently diverge. When approved deviations occur, update the relevant `docs/design/` page to reflect the new reality

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-18T08:23:08.912Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T08:23:08.912Z
Learning: When approved deviations occur, update the relevant `docs/design/` page to reflect the new reality.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-31T14:31:11.894Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:31:11.894Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-30T10:41:40.176Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-30T10:41:40.176Z
Learning: Update the relevant `docs/design/` page to reflect new reality when approved deviations from spec occur

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-21T11:08:01.542Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-21T11:08:01.542Z
Learning: When implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why — user decides whether to proceed or update the spec. Do NOT silently diverge — every deviation needs explicit user approval.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Alert the user and explain why if implementation deviates from the design spec; do NOT silently diverge

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Documentation source in `docs/` (Markdown, built with Zensical). Design spec in `docs/design/` (7 pages: index, agents, organization, communication, engine, memory, operations). Architecture in `docs/architecture/` (overview, tech-stack, decision log). Roadmap in `docs/roadmap/`. Security in `docs/security.md`. Licensing in `docs/licensing.md`. Reference in `docs/reference/`. REST API reference in `docs/rest-api.md`. Library reference in `docs/api/` (auto-generated from docstrings). Custom templates in `docs/overrides/`. Config in `mkdocs.yml`.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-15T16:55:07.730Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T16:55:07.730Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-31T20:07:03.035Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T20:07:03.035Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T06:43:14.114Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:43:14.114Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, entry/exit of key functions. Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/budget/**/*.py : Budget tracking includes pre-flight/in-flight checks, auto-downgrade, billing periods, cost tiers, quota/subscription. CFO includes anomaly detection, efficiency analysis, downgrade recommendations.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Applies to src/synthorg/budget/**/*.py : Budget package (budget/): cost tracking, budget enforcement (pre-flight/in-flight checks, auto-downgrade), billing periods, cost tiers, quota/subscription tracking, CFO cost optimization (anomaly detection, efficiency analysis, downgrade recommendations, approval decisions), spending reports, budget errors (BudgetExhaustedError, DailyLimitExceededError, QuotaExhaustedError)

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow. Validate at system boundaries (user input, external APIs, config files).

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to **/*.py : Handle errors explicitly—never silently swallow exceptions.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to **/*.py : Handle errors explicitly, never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-08T11:07:20.358Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T11:07:20.358Z
Learning: Applies to **/*.py : Handle all errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-26T15:18:16.848Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-26T15:18:16.848Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow them. All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-04-02T21:38:30.127Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T21:38:30.127Z
Learning: Applies to {**/*.py,web/src/**/*.{ts,tsx}} : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/security/**/*.py : Security package (security/): SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume)

Applied to files:

src/synthorg/observability/prometheus_collector.py

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/security/**/*.py : Security module includes SecOps agent, rule engine (soft-allow/hard-deny), audit log, output scanner, risk classifier, autonomy levels (4 strategies), timeout policies.

Applied to files:

src/synthorg/observability/prometheus_collector.py

🔇 Additional comments (10)

docs/design/operations.md (3)

1295-1295: Accurate API surface update for /api/v1/metrics.

This line now documents the implemented endpoint behavior and metric exposure clearly.

1539-1539: OTLP sink documentation is consistent with current implementation direction.

Good alignment with the configured HTTP-only OTLP transport path.

1839-1839: G1 closure entry is clear and properly captures the approved deviation.

This is the right level of specificity for design/spec traceability.

src/synthorg/observability/prometheus_collector.py (4)

47-114: __init__ exceeds 50-line limit.

This issue was previously flagged. The constructor body spans ~67 lines. Consider a data-driven metric registration approach as suggested in the prior review.

195-225: Stale cost/budget values persist when sources become unavailable.

When total_cost is None or the cost tracker becomes unavailable between scrapes, _refresh_cost_gauge and _refresh_budget_metrics leave the previous scrape's values in place. After one successful scrape, a subsequent outage exports stale totals as if they were current.

Consider resetting to a sentinel (e.g., 0.0 or NaN) or clearing the gauge when the source is unavailable.

227-253: LGTM!

The agent metrics refresh correctly clears stale label series before repopulating, and the metric name synthorg_active_agents_total accurately reflects that only active agents are counted.

255-279: LGTM!

Task metrics refresh follows the same correct pattern: clearing stale series and properly handling the (status, agent) label combinations.

docs/architecture/acg-glossary.md (1)

1-99: LGTM!

The ACG glossary provides a comprehensive bidirectional mapping between ACG formalism concepts and SynthOrg architecture. The tables are well-organized with appropriate fidelity labels and notes explaining where SynthOrg diverges from or extends ACG. The "82+" specificity issue from the prior review was addressed by using "numerous event constant domains."

src/synthorg/api/controllers/metrics.py (1)

49-71: LGTM!

The error handling correctly re-raises system errors (MemoryError, RecursionError), logs failures with context, and returns appropriate HTTP status codes with scrape-safe Prometheus-format bodies.

tests/unit/api/controllers/test_metrics.py (1)

45-75: LGTM!

Good test coverage for the /metrics endpoint. Tests verify content-type, metric presence, async refresh invocation, and the 503 response when no collector is configured. The past review suggestions (using assert_awaited_once() and asserting the 503 body) have been addressed.

coderabbitai · 2026-04-08T12:13:41Z

src/synthorg/api/controllers/metrics.py

+        if not app_state.has_prometheus_collector:
+            logger.warning(METRICS_SCRAPE_FAILED, reason="collector not configured")
+            return Response(
+                content=b"# No metrics collector configured\n",
+                media_type=_PROMETHEUS_CONTENT_TYPE,
+                status_code=503,
+            )


⚠️ Potential issue | 🔴 Critical

Method reference instead of method call — collector availability check is always truthy.

has_prometheus_collector is a method on AppState (see src/synthorg/api/state.py:229), but it's being evaluated as a bound method object rather than called. Bound methods are always truthy, so this guard will never trigger the 503 response.

🐛 Proposed fix

- if not app_state.has_prometheus_collector: + if not app_state.has_prometheus_collector(): logger.warning(METRICS_SCRAPE_FAILED, reason="collector not configured")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/api/controllers/metrics.py` around lines 41 - 47, The guard mistakenly checks the bound method object instead of calling it: change the condition to call app_state.has_prometheus_collector() so it evaluates the boolean result; update the block around has_prometheus_collector (and the early-return 503 Response/METRICS_SCRAPE_FAILED log) to trigger when the method returns False rather than testing the method object itself.

src/synthorg/observability/prometheus_collector.py

tests/unit/api/controllers/test_metrics.py

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (7)

docs/design/operations.md (1)
1846-1849: ⚠️ Potential issue | 🟡 Minor

Update the final G1 framing too.

Lines 1856-1857 below still say telemetry export is the primary gap, which now conflicts with this updated gap table and priority paragraph that close G1. Please make the closing section tell the same story as the rest of the page.

Based on learnings: "Update the relevant docs/design/ page when approved deviations occur to reflect the new reality."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` around lines 1846 - 1849, Update the final G1
framing so it matches the updated gap table and priority paragraph: change the
text that still describes telemetry export as the primary open gap to state that
G1 is closed because PrometheusCollector is instantiated in on_startup(),
/metrics returns 200 with nine metric families, and the OTLP HTTP/JSON sink is
implemented; then explicitly call out that G3 (policy-as-code) and G4
(coordination metrics API) are now the highest remaining differentiation
opportunities and that G2 is the smallest implementation scope so the conclusion
aligns with the rest of the page.
tests/unit/observability/test_otlp_handler.py (1)
253-261: ⚠️ Potential issue | 🟠 Major

This test still depends on a live OTLP export path.

close() flushes remaining records, so starting the flusher and leaving urlopen unpatched here makes the test hit localhost:4318 and race a real background thread. Stub the exporter in this test too, and keep _start_flusher=False unless thread behavior is the thing under test.

As per coding guidelines: "Never skip, dismiss, or ignore flaky tests. Fix them fully and fundamentally."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/observability/test_otlp_handler.py` around lines 253 - 261, The
test test_close_always_drains_remaining_records currently starts the background
flusher and leaves the OTLP export path unpatched causing a real network call;
change the _make_handler call to use start_flusher=False and inject/stub the
OTLP exporter (or monkeypatch urllib.request.urlopen used by the exporter) so
emit() and close() operate against the stubbed exporter rather than hitting
localhost:4318; keep using _make_record and handler.emit/_close to exercise
flushing, but ensure the handler._exporter or underlying urlopen is replaced
with a dummy exporter that records calls to avoid spinning up real threads or
network I/O.
src/synthorg/observability/otlp_handler.py (2)
171-186: ⚠️ Potential issue | 🟠 Major

Do not materialize the whole backlog before batching.

_drain_and_flush() drains the entire queue into records and only then slices it. Under a logging spike—or during close()—that defeats _batch_size and can turn backlog flushes into large transient allocations or long stalls.
♻️ Suggested shape
     def _drain_and_flush(self) -> None:
-        records: list[logging.LogRecord] = []
-        while True:
-            try:
-                records.append(self._queue.get_nowait())
-            except queue.Empty:
-                break
-
-        with self._pending_lock:
-            self._pending_count = max(0, self._pending_count - len(records))
-
-        for start in range(0, len(records), self._batch_size):
-            batch = records[start : start + self._batch_size]
-            if batch:
-                self._export_batch(batch)
+        while True:
+            batch: list[logging.LogRecord] = []
+            while len(batch) < self._batch_size:
+                try:
+                    batch.append(self._queue.get_nowait())
+                except queue.Empty:
+                    break
+            if not batch:
+                return
+            with self._pending_lock:
+                self._pending_count = max(0, self._pending_count - len(batch))
+            self._export_batch(batch)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 171 - 186, _draining
the entire queue into a list in _drain_and_flush causes large transient
allocations and stalls; change it to consume and export in incremental batches
instead of materializing all records. Rework _drain_and_flush to loop:
repeatedly build a local batch by repeatedly calling self._queue.get_nowait up
to self._batch_size (stop when queue.Empty), call self._export_batch(batch) for
each non-empty batch, and update self._pending_count under self._pending_lock
per batch (using max(0, ...)). Ensure you still exit when the queue is empty and
that _export_batch, _queue, _batch_size, _pending_count, and _pending_lock are
used as the unique symbols to locate the code.
57-73: ⚠️ Potential issue | 🟠 Major

This exporter is still JSON-only, not the HTTP/protobuf transport promised in #1122.

_export_batch() hard-codes JSON serialization and Content-Type: application/json, and protocol never changes that behavior. If the accepted scope is still "HTTP/protobuf + reject gRPC", this needs either a protobuf implementation or an explicit spec/docs change before merge.

Based on learnings: "When implementation deviates from the design spec, alert the user and explain why. Do NOT silently diverge -- every deviation needs explicit user approval."

Also applies to: 203-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 57 - 73, The
implementation currently hard-codes JSON serialization and Content-Type in
_export_batch while __init__ accepts a protocol parameter, so either add real
HTTP/protobuf support or explicitly reject that protocol; update behavior by
either (A) implementing protobuf serialization and request headers in
_export_batch (use OtlpProtocol.HTTP_PROTOBUF, set Content-Type:
application/x-protobuf, and serialize using the OTLP protobuf message types) or
(B) change __init__ to raise NotImplementedError when protocol ==
OtlpProtocol.HTTP_PROTOBUF (similar to the gRPC branch) so the API does not
silently diverge from the spec.
src/synthorg/observability/prometheus_collector.py (2)
47-114: 🛠️ Refactor suggestion | 🟠 Major

__init__ still exceeds the 50-line limit.

The inline metric registrations make the constructor longer than the repository rule permits. A small metric-spec table plus a loop would keep the setup readable without more one-off helpers.

As per coding guidelines: "Keep functions under 50 lines and files under 800 lines."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` around lines 47 - 114,
The __init__ method in PrometheusCollector is over 50 lines because each metric
is registered inline; refactor by defining a small metric specification list
(e.g., entries with name_suffix, help, type, label_names) and iterate to create
and assign metrics instead of the repeated one-off Gauge/Info/PromCounter calls.
Locate the __init__ function and replace the repeated registrations for symbols
like _info, _agents_total, _tasks_total, _cost_total, _budget_used_percent,
_budget_monthly_usd, _coordination_efficiency, _coordination_overhead_percent,
and _security_evaluations with a loop that constructs the appropriate metric
class (Info/Gauge/PromCounter) based on the spec, sets any initial info (for
__version__), and registers them with self.registry; keep the external behavior
and metric names identical. Ensure logger.debug(METRICS_COLLECTOR_INITIALIZED,
prefix=prefix) remains at the end of __init__.
196-227: ⚠️ Potential issue | 🟠 Major

Clear/reset metric families on every no-data or error path.

After one successful scrape, later "service missing" or exception paths leave the previous samples in place because these helpers only mutate metrics on success. /metrics then keeps exporting stale cost, budget, agent, and task data as current.

Also applies to: 228-280
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` around lines 196 - 227,
The helpers leave stale metric values when they early-return or on exceptions;
update _refresh_cost_gauge and _refresh_budget_metrics (and the similar helpers
around lines 228-280) to explicitly reset/clear the Prometheus gauges on all
no-data or error paths—e.g., when total_cost is None, when
app_state.has_cost_tracker is False, when tracker.budget_config is None, and
inside the except Exception: block set self._cost_total.set(0.0),
self._budget_monthly_usd.set(0.0) and self._budget_used_percent.set(0.0) (or
appropriate zero/default values) so metrics do not export stale samples after
failures. Ensure the same clear-on-no-data-and-exception pattern is applied to
the other metric-refresh functions referenced.
src/synthorg/observability/config.py (1)
122-127: ⚠️ Potential issue | 🟠 Major

Do not fail open when OTLP DNS resolution fails.

Returning on socket.gaierror lets an unresolved hostname bypass the private/loopback SSRF guard entirely. Reject it here, or repeat the resolution check immediately before export, but don't silently accept an unverified target.

As per coding guidelines: "Validate at system boundaries: user input, external APIs, config files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/config.py` around lines 122 - 127, The current
try/except around socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP)
swallows socket.gaierror and returns, which lets an unresolved hostname bypass
the private/loopback SSRF guard; instead, reject unresolved OTLP targets here
(or ensure a mandatory re-check immediately before export). Replace the early
return in the socket.gaierror handler with a raised configuration/validation
error (e.g., raise ValueError or a ConfigError) so callers of the config loader
know the hostname failed resolution, and keep the existing logic that inspects
addrs for private/loopback addresses to enforce the SSRF guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/operations.md`:
- Line 1295: The docs currently list synthorg_active_agents_total and omit the
agent label on synthorg_tasks_total, which conflicts with `#1122` that defines
synthorg_agents_total{status,trust_level} and
synthorg_tasks_total{status,agent}; update the design doc to explicitly match
the implemented metric names and labels or, if the implementation changed
intentionally, document the divergence and rationale: replace
synthorg_active_agents_total with synthorg_agents_total (include labels status
and trust_level) and add the agent label to synthorg_tasks_total, or add a short
“Deviation from `#1122`” note describing the intentional change and approval
decision referencing the implementation that produced
synthorg_active_agents_total and the removed/added labels.

In `@pyproject.toml`:
- Around line 25-26: Remove the redundant explicit dependency
"prometheus-client>=0.21.0" from pyproject.toml because litestar[prometheus]
already brings prometheus-client transitively; keep the existing
"litestar[standard,structlog,pydantic,brotli,prometheus]==2.21.1" entry and
delete the separate prometheus-client line so the dependency list is not
duplicated.

In `@src/synthorg/observability/events/metrics.py`:
- Around line 16-20: The OTLP-related constants (METRICS_OTLP_EXPORT_COMPLETED,
METRICS_OTLP_EXPORT_FAILED, METRICS_OTLP_FLUSHER_STARTED,
METRICS_OTLP_FLUSHER_STOPPED) belong in a dedicated OTLP events module instead
of metrics.py and should use the dot-taxonomy consistent with the repo (e.g.,
events.otlp.export.completed, events.otlp.export.failed,
events.otlp.flusher.started, events.otlp.flusher.stopped); move these constants
into a new observability/events/otlp.py (or similar) module, update any
imports/usages to reference the new symbols, and remove them from metrics.py so
metric domain only contains scrape/collector metric events.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 60-65: The metric name and label set in the Gauge created as
self._agents_total diverge from the agreed contract; change the metric name from
f"{prefix}_active_agents_total" to "synthorg_agents_total" (or derive exactly
from the agreed prefix if variable) and replace the labels ["status",
"trust_level"] plus remove any "output_scan" label usage, instead adding a
"verdict" label constrained to the values "allow","deny","escalate"; also update
the corresponding metrics created around the other occurrences (the similar
Gauge/Counter initializations referenced at lines ~116-118) so their names and
label sets match the spec. Ensure any code that writes/updates these metrics
(references to self._agents_total.inc/set or label values) is updated to provide
the "verdict" value and drop "output_scan" so the runtime exports conform to
synthorg_agents_total{status,trust_level,verdict}.

In `@tests/unit/observability/test_config.py`:
- Around line 720-749: Update the handler docstring in otlp_handler.py to
accurately state that the implementation supports HTTP/JSON only (not
HTTP/protobuf or gRPC), explicitly note this is an approved deviation from spec
`#1122` (or link to the approval), and remove or correct any claims like
"HTTP/protobuf or gRPC"; reference the actual implementation details that
enforce JSON (uses json.dumps() and sends content-type "application/json") so
readers and reviewers (and tests such as
tests/unit/observability/test_config.py) are not misled.

---

Duplicate comments:
In `@docs/design/operations.md`:
- Around line 1846-1849: Update the final G1 framing so it matches the updated
gap table and priority paragraph: change the text that still describes telemetry
export as the primary open gap to state that G1 is closed because
PrometheusCollector is instantiated in on_startup(), /metrics returns 200 with
nine metric families, and the OTLP HTTP/JSON sink is implemented; then
explicitly call out that G3 (policy-as-code) and G4 (coordination metrics API)
are now the highest remaining differentiation opportunities and that G2 is the
smallest implementation scope so the conclusion aligns with the rest of the
page.

In `@src/synthorg/observability/config.py`:
- Around line 122-127: The current try/except around
socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP) swallows
socket.gaierror and returns, which lets an unresolved hostname bypass the
private/loopback SSRF guard; instead, reject unresolved OTLP targets here (or
ensure a mandatory re-check immediately before export). Replace the early return
in the socket.gaierror handler with a raised configuration/validation error
(e.g., raise ValueError or a ConfigError) so callers of the config loader know
the hostname failed resolution, and keep the existing logic that inspects addrs
for private/loopback addresses to enforce the SSRF guard.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 171-186: _draining the entire queue into a list in
_drain_and_flush causes large transient allocations and stalls; change it to
consume and export in incremental batches instead of materializing all records.
Rework _drain_and_flush to loop: repeatedly build a local batch by repeatedly
calling self._queue.get_nowait up to self._batch_size (stop when queue.Empty),
call self._export_batch(batch) for each non-empty batch, and update
self._pending_count under self._pending_lock per batch (using max(0, ...)).
Ensure you still exit when the queue is empty and that _export_batch, _queue,
_batch_size, _pending_count, and _pending_lock are used as the unique symbols to
locate the code.
- Around line 57-73: The implementation currently hard-codes JSON serialization
and Content-Type in _export_batch while __init__ accepts a protocol parameter,
so either add real HTTP/protobuf support or explicitly reject that protocol;
update behavior by either (A) implementing protobuf serialization and request
headers in _export_batch (use OtlpProtocol.HTTP_PROTOBUF, set Content-Type:
application/x-protobuf, and serialize using the OTLP protobuf message types) or
(B) change __init__ to raise NotImplementedError when protocol ==
OtlpProtocol.HTTP_PROTOBUF (similar to the gRPC branch) so the API does not
silently diverge from the spec.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 47-114: The __init__ method in PrometheusCollector is over 50
lines because each metric is registered inline; refactor by defining a small
metric specification list (e.g., entries with name_suffix, help, type,
label_names) and iterate to create and assign metrics instead of the repeated
one-off Gauge/Info/PromCounter calls. Locate the __init__ function and replace
the repeated registrations for symbols like _info, _agents_total, _tasks_total,
_cost_total, _budget_used_percent, _budget_monthly_usd,
_coordination_efficiency, _coordination_overhead_percent, and
_security_evaluations with a loop that constructs the appropriate metric class
(Info/Gauge/PromCounter) based on the spec, sets any initial info (for
__version__), and registers them with self.registry; keep the external behavior
and metric names identical. Ensure logger.debug(METRICS_COLLECTOR_INITIALIZED,
prefix=prefix) remains at the end of __init__.
- Around line 196-227: The helpers leave stale metric values when they
early-return or on exceptions; update _refresh_cost_gauge and
_refresh_budget_metrics (and the similar helpers around lines 228-280) to
explicitly reset/clear the Prometheus gauges on all no-data or error paths—e.g.,
when total_cost is None, when app_state.has_cost_tracker is False, when
tracker.budget_config is None, and inside the except Exception: block set
self._cost_total.set(0.0), self._budget_monthly_usd.set(0.0) and
self._budget_used_percent.set(0.0) (or appropriate zero/default values) so
metrics do not export stale samples after failures. Ensure the same
clear-on-no-data-and-exception pattern is applied to the other metric-refresh
functions referenced.

In `@tests/unit/observability/test_otlp_handler.py`:
- Around line 253-261: The test test_close_always_drains_remaining_records
currently starts the background flusher and leaves the OTLP export path
unpatched causing a real network call; change the _make_handler call to use
start_flusher=False and inject/stub the OTLP exporter (or monkeypatch
urllib.request.urlopen used by the exporter) so emit() and close() operate
against the stubbed exporter rather than hitting localhost:4318; keep using
_make_record and handler.emit/_close to exercise flushing, but ensure the
handler._exporter or underlying urlopen is replaced with a dummy exporter that
records calls to avoid spinning up real threads or network I/O.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b34f708-f99d-48d6-8940-a46bf1246e30

📥 Commits

Reviewing files that changed from the base of the PR and between 3971791 and 3f738aa.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (23)

CLAUDE.md
docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md
docs/design/operations.md
pyproject.toml
src/synthorg/api/app.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/enums.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/sinks.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/conftest.py
tests/unit/observability/test_config.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_events.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_sinks.py

docs/design/operations.md

pyproject.toml

coderabbitai · 2026-04-08T13:41:09Z

src/synthorg/observability/events/metrics.py

+# OTLP export events
+METRICS_OTLP_EXPORT_COMPLETED: Final[str] = "metrics.otlp.export_completed"
+METRICS_OTLP_EXPORT_FAILED: Final[str] = "metrics.otlp.export_failed"
+METRICS_OTLP_FLUSHER_STARTED: Final[str] = "metrics.otlp.flusher_started"
+METRICS_OTLP_FLUSHER_STOPPED: Final[str] = "metrics.otlp.flusher_stopped"


🧹 Nitpick | 🔵 Trivial

Move OTLP exporter events to their own domain/module.

These constants describe log-shipping lifecycle, not scrape/collector metrics, and values like metrics.otlp.export_completed / metrics.otlp.flusher_started also drift from the dot-delimited taxonomy used elsewhere in this module. A dedicated OTLP event module with ...export.completed / ...flusher.started naming would keep routing and event queries cleaner.

Based on learnings: "Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/events/metrics.py` around lines 16 - 20, The OTLP-related constants (METRICS_OTLP_EXPORT_COMPLETED, METRICS_OTLP_EXPORT_FAILED, METRICS_OTLP_FLUSHER_STARTED, METRICS_OTLP_FLUSHER_STOPPED) belong in a dedicated OTLP events module instead of metrics.py and should use the dot-taxonomy consistent with the repo (e.g., events.otlp.export.completed, events.otlp.export.failed, events.otlp.flusher.started, events.otlp.flusher.stopped); move these constants into a new observability/events/otlp.py (or similar) module, update any imports/usages to reference the new symbols, and remove them from metrics.py so metric domain only contains scrape/collector metric events.

coderabbitai · 2026-04-08T13:41:09Z

src/synthorg/observability/prometheus_collector.py

+        self._agents_total = Gauge(
+            f"{prefix}_active_agents_total",
+            "Number of active agents",
+            ["status", "trust_level"],
+            registry=self.registry,
+        )


⚠️ Potential issue | 🟠 Major

These exported metric contracts now diverge from #1122.

The linked objective promises synthorg_agents_total{status,trust_level} and a verdict label bounded to allow/deny/escalate, but this file now emits synthorg_active_agents_total and admits output_scan. Those are user-visible scrape changes, so either restore the agreed contract or update the design/docs/objective before merge.

Based on learnings: "When implementation deviates from the design spec, alert the user and explain why. Do NOT silently diverge -- every deviation needs explicit user approval."

Also applies to: 116-118

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/prometheus_collector.py` around lines 60 - 65, The metric name and label set in the Gauge created as self._agents_total diverge from the agreed contract; change the metric name from f"{prefix}_active_agents_total" to "synthorg_agents_total" (or derive exactly from the agreed prefix if variable) and replace the labels ["status", "trust_level"] plus remove any "output_scan" label usage, instead adding a "verdict" label constrained to the values "allow","deny","escalate"; also update the corresponding metrics created around the other occurrences (the similar Gauge/Counter initializations referenced at lines ~116-118) so their names and label sets match the spec. Ensure any code that writes/updates these metrics (references to self._agents_total.inc/set or label values) is updated to provide the "verdict" value and drop "output_scan" so the runtime exports conform to synthorg_agents_total{status,trust_level,verdict}.

tests/unit/observability/test_config.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/design/operations.md (1)
1532-1540: ⚠️ Potential issue | 🟡 Minor

Finish documenting OTLP as a first-class sink.

The table now lists three shipping sink types, but the lead-in sentence still says "two", and the custom_sinks section later on this page still enumerates only file, syslog, and http. Right now the page advertises OTLP without documenting how to configure it end-to-end.

Based on learnings: "Update the relevant docs/design/ page when approved deviations occur to reflect the new reality."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` around lines 1532 - 1540, The docs claim "two"
shipping sink types but the table includes a third (OTLP), and the custom_sinks
section still lists only file, syslog, and http; update the lead-in sentence to
reflect three sink types, add OTLP to the `custom_sinks` enumeration and provide
end-to-end configuration guidance for the OTLP sink (how to set the
transport/endpoint, format OTLP JSON mapping, and correlation ID->trace context
behavior) so the page documents OTLP as a first-class sink alongside `syslog`
and `http`.

♻️ Duplicate comments (3)

docs/design/operations.md (2)

1295-1295: ⚠️ Potential issue | 🟠 Major

Keep the metric name aligned with the accepted /metrics contract.

This row documents synthorg_active_agents_total, but the accepted contract for #1122 is synthorg_agents_total{status,trust_level}. If the rename was intentional, add an explicit approved-deviation note here; otherwise the design page should keep the original series name so dashboards and implementations target the same metric.

Based on learnings: "When implementation deviates from the design spec, alert the user and explain why -- user decides whether to proceed or update the spec."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` at line 1295, The documented metric name
`synthorg_active_agents_total` conflicts with the accepted `/metrics` contract
`synthorg_agents_total{status,trust_level}` (per `#1122`); update the table row to
use `synthorg_agents_total` with the `{status,trust_level}` label signature and
adjust the description accordingly, or if the rename was intentional add an
explicit "approved deviation" note explaining the reason and linking to the
approving issue/PR (`#1122`) so dashboards/implementations remain consistent.
1839-1849: ⚠️ Potential issue | 🟡 Minor

Remove the remaining "G1 is the primary gap" wording.

This section now marks G1 as closed, but the "Recommended Framing" paragraph below still says telemetry export is the primary gap. Keeping both statements in the same design page makes the control-plane assessment internally inconsistent.

Based on learnings: "Update the relevant docs/design/ page when approved deviations occur to reflect the new reality."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` around lines 1839 - 1849, Update the "Recommended
Framing" paragraph to remove the statement that "G1 is the primary gap" and
instead reflect that G1 is closed; explicitly state current open gaps (G2, G3,
G4, G5, G6) or rephrase to focus on remaining priorities (e.g., G3 and G4
highest, G2 smallest), referencing the existing G1–G6 labels and the sentence
that currently claims G1 as primary so the page is internally consistent with
the status table and closure note for G1.

src/synthorg/observability/otlp_handler.py (1)

177-193: ⚠️ Potential issue | 🟠 Major

Drain and export batch-by-batch instead of copying the whole backlog first.

_drain_and_flush() materializes every pending record into records and only then slices it. Under backlog, that doubles peak memory and delays the first export until the entire queue has been copied.

♻️ Proposed fix

     def _drain_and_flush(self) -> None:
         """Drain all queued records and export as OTLP batches."""
-        records: list[logging.LogRecord] = []
         while True:
-            try:
-                records.append(self._queue.get_nowait())
-            except queue.Empty:
-                break
-
-        with self._pending_lock:
-            self._pending_count = max(0, self._pending_count - len(records))
-
-        for start in range(0, len(records), self._batch_size):
-            batch = records[start : start + self._batch_size]
-            if batch:
-                self._export_batch(batch)
+            batch: list[logging.LogRecord] = []
+            while len(batch) < self._batch_size:
+                try:
+                    batch.append(self._queue.get_nowait())
+                except queue.Empty:
+                    break
+
+            if not batch:
+                return
+
+            with self._pending_lock:
+                self._pending_count = max(0, self._pending_count - len(batch))
+
+            self._export_batch(batch)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 177 - 193, The
current _drain_and_flush() copies the entire queue into records before slicing,
which increases peak memory and delays exports; change it to drain the queue
into batches incrementally: loop calling self._queue.get_nowait(), append to a
local batch list until it reaches self._batch_size or queue.Empty occurs, then
call self._export_batch(batch) immediately, clear the batch and continue until
the queue is empty; keep a drained_count counter and after draining update
self._pending_count inside the self._pending_lock (self._pending_count = max(0,
self._pending_count - drained_count)); reference methods/fields:
_drain_and_flush, self._queue, self._batch_size, self._export_batch,
self._pending_lock, and self._pending_count.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/synthorg/observability/otlp_handler.py`:
- Around line 84-89: The pending buffer is unbounded (self._queue:
queue.SimpleQueue) so emit() can enqueue forever while _export_batch() is
blocked; change the queue to a bounded queue (e.g.,
queue.Queue(maxsize=MAX_PENDING)) and update emit() to attempt non-blocking put
or check full() under self._pending_lock, incrementing self._dropped_count on
overflow and returning early; remove/avoid relying on a separate _pending_count
by using queue.qsize() (or keep it but update/guard it consistently under
_pending_lock) and ensure _export_batch() reads from the bounded queue normally,
preserving existing shutdown/_batch_ready logic and using the same drop
accounting for the other enqueue sites referenced around _pending_count lines.

---

Outside diff comments:
In `@docs/design/operations.md`:
- Around line 1532-1540: The docs claim "two" shipping sink types but the table
includes a third (OTLP), and the custom_sinks section still lists only file,
syslog, and http; update the lead-in sentence to reflect three sink types, add
OTLP to the `custom_sinks` enumeration and provide end-to-end configuration
guidance for the OTLP sink (how to set the transport/endpoint, format OTLP JSON
mapping, and correlation ID->trace context behavior) so the page documents OTLP
as a first-class sink alongside `syslog` and `http`.

---

Duplicate comments:
In `@docs/design/operations.md`:
- Line 1295: The documented metric name `synthorg_active_agents_total` conflicts
with the accepted `/metrics` contract
`synthorg_agents_total{status,trust_level}` (per `#1122`); update the table row to
use `synthorg_agents_total` with the `{status,trust_level}` label signature and
adjust the description accordingly, or if the rename was intentional add an
explicit "approved deviation" note explaining the reason and linking to the
approving issue/PR (`#1122`) so dashboards/implementations remain consistent.
- Around line 1839-1849: Update the "Recommended Framing" paragraph to remove
the statement that "G1 is the primary gap" and instead reflect that G1 is
closed; explicitly state current open gaps (G2, G3, G4, G5, G6) or rephrase to
focus on remaining priorities (e.g., G3 and G4 highest, G2 smallest),
referencing the existing G1–G6 labels and the sentence that currently claims G1
as primary so the page is internally consistent with the status table and
closure note for G1.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 177-193: The current _drain_and_flush() copies the entire queue
into records before slicing, which increases peak memory and delays exports;
change it to drain the queue into batches incrementally: loop calling
self._queue.get_nowait(), append to a local batch list until it reaches
self._batch_size or queue.Empty occurs, then call self._export_batch(batch)
immediately, clear the batch and continue until the queue is empty; keep a
drained_count counter and after draining update self._pending_count inside the
self._pending_lock (self._pending_count = max(0, self._pending_count -
drained_count)); reference methods/fields: _drain_and_flush, self._queue,
self._batch_size, self._export_batch, self._pending_lock, and
self._pending_count.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 566e98b4-2f7e-44dd-86be-6a43f22e0c09

📥 Commits

Reviewing files that changed from the base of the PR and between 3f738aa and 1cfe8e1.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

docs/design/operations.md
src/synthorg/observability/otlp_handler.py
tests/unit/observability/test_otlp_handler.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test (Python 3.14)
GitHub Check: Build Web
GitHub Check: Build Backend
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Do not use from __future__ import annotations in Python files -- Python 3.14 has PEP 649 native lazy annotations.

Use PEP 758 except syntax: except A, B: (no parentheses) -- ruff enforces this on Python 3.14.

Include type hints on all public functions, enforced with mypy strict mode.

Use Google style docstrings, required on public classes and functions (enforced by ruff D rules).

Use frozen Pydantic models for config/identity; separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Use allow_inf_nan=False in all ConfigDict declarations to reject NaN/Inf in numeric fields at validation time.

Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens).

Use NotBlankStr (from core.types) for all identifier/name fields -- including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants -- instead of manual whitespace validators.

Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.

Maintain line length of 88 characters (enforced by ruff).

Handle errors explicitly, never silently swallow exceptions.

Validate at system boundaries (user input, external APIs, config files).

Use logger as the variable name (not _logger, not log).

Pure data models, enums, and re-exports do NOT need logging.

Files:

tests/unit/observability/test_otlp_handler.py
src/synthorg/observability/otlp_handler.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Mark tests with @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow.

Use asyncio_mode = "auto" for async tests -- no manual @pytest.mark.asyncio needed.

Prefer @pytest.mark.parametrize for testing similar cases in Python tests.

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in test files. Tests must use test-provider, test-small-001, etc.

Use Hypothesis for property-based testing with @given + @settings in Python tests.

NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock time.monotonic() and asyncio.sleep() to make them deterministic. For tasks that must block indefinitely, use asyncio.Event().wait() instead of asyncio.sleep(large_number).

Files:

tests/unit/observability/test_otlp_handler.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_otlp_handler.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Create new objects and never mutate existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement.

Keep functions under 50 lines and files under 800 lines.

Every module with business logic MUST have: from synthorg.observability import get_logger then logger = get_logger(__name__).

Never use import logging / logging.getLogger() / print() in application code. Exception: observability/setup.py, observability/sinks.py, observability/syslog_handler.py, observability/http_handler.py, and observability/otlp_handler.py may use stdlib logging and print(..., file=sys.stderr).

Always use constants from domain-specific modules under synthorg.observability.events for event names (e.g., API_REQUEST_STARTED from events.api, TOOL_INVOKE_START from events.tool).

Use structured kwargs in logging: always logger.info(EVENT, key=value) -- never logger.info("msg %s", val).

All error paths must log at WARNING or ERROR with context before raising.

All state transitions must log at INFO.

Use DEBUG for object creation, internal flow, entry/exit of key functions.

RetryExhaustedError signals that all retries failed -- the engine layer catches this to trigger fallback chains.

Maintain 80% minimum test coverage (enforced in CI).

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Vendor names only in: (1) Operations design page provider list, (2) .claude/ files, (3) third-party import paths, (4) provider presets in src/synthorg/providers/presets.py. Tests must use test-provider, test-small-001, etc.

Files:

src/synthorg/observability/otlp_handler.py

src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/otlp_handler.py

🧠 Learnings (33)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. The design spec is the starting point for architecture, data models, and behavior.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Alert the user and explain why when implementation deviates from the design spec. Do NOT silently diverge -- every deviation needs explicit user approval.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Update relevant `docs/design/` pages when approved deviations occur to reflect the new reality.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Every implementation plan must be presented to the user for accept/deny before coding starts.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: At every phase of planning and implementation, be critical -- actively look for ways to improve the design. Surface improvements as suggestions, not silent changes -- user decides.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Prioritize issues by dependency order, not priority labels -- unblocked dependencies come first.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Enforce 30 seconds per test timeout globally in `pyproject.toml` -- do not add per-file `pytest.mark.timeout(30)` markers; non-default overrides like `timeout(60)` are allowed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Use parallelism with `pytest-xdist` via `-n 8` -- ALWAYS include `-n 8` when running pytest locally, never run tests sequentially.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Control Hypothesis profile via `HYPOTHESIS_PROFILE` env var: `ci` (deterministic, `max_examples=10` + `derandomize=True`), `dev` (1000 examples), `fuzz` (10,000 examples, no deadline), `extreme` (500,000 examples, no deadline).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Use commit format `<type>: <description>` with types: feat, fix, refactor, docs, test, chore, perf, ci. Enforced by commitizen.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Signed commits are required on `main` via branch protection -- all commits must be GPG/SSH signed.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Use branch naming `<type>/<slug>` from main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Pre-commit hooks include: trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-json, check-merge-conflict, check-added-large-files, no-commit-to-branch (main), ruff check+format, gitleaks, hadolint (Dockerfile linting), golangci-lint + go vet (CLI), no-em-dashes, no-redundant-timeout, eslint-web (web dashboard).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Enforce hookify rules: `block-pr-create` blocks direct `gh pr create`, `enforce-parallel-tests` enforces `-n 8` with pytest, `no-cd-prefix` blocks `cd` prefix in Bash commands, `no-local-coverage` blocks `--cov` flags locally.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Pre-push hooks: mypy type-check (affected modules only) + pytest unit tests (affected modules only) + golangci-lint + go vet + go test (CLI) + eslint-web (web dashboard). Foundational module changes (core, config, observability) or conftest changes trigger full runs.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Pre-commit.ci: autoupdate disabled -- Dependabot owns hook version bumps via `pre-commit` ecosystem.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Use `gh issue list` via Bash for GitHub issue queries (not MCP tools) -- MCP `list_issues` has unreliable field data.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Merge strategy: squash merge -- PR body becomes the squash commit message on main. Trailers (e.g. `Release-As`, `Closes `#N``) must be in the PR body to land in the final commit.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Preserve existing `Closes `#NNN`` issue references in PR bodies -- never remove unless explicitly asked.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: After finishing an issue implementation: always create a feature branch (`<type>/<slug>`), commit, and push -- do NOT create a PR automatically. Do NOT leave work uncommitted on main.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: NEVER create a PR directly -- `gh pr create` is blocked by hookify. ALWAYS use `/pre-pr-review` to create PRs -- it runs automated checks + review agents + fixes before creating the PR. For trivial/docs-only changes: `/pre-pr-review quick` skips agents but still runs automated checks.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: After the PR exists, use `/aurelio-review-pr` to handle external reviewer feedback.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: The `/commit-push-pr` command is effectively blocked (it calls `gh pr create` internally).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: Fix everything valid when review agents find issues -- never skip. No deferring, no "out of scope" skipping.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: NEVER use `cd` in Bash commands -- the working directory is already set to the project root. Use absolute paths or run commands directly. Exception: `bash -c "cd <dir> && <cmd>"` is safe (runs in a child process).

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: NEVER use Bash to write or modify files -- use the Write or Edit tools. Do not use `cat >`, `cat << EOF`, `echo >`, `echo >>`, `sed -i`, `python -c "open(...).write(...)"`, or `tee` to create or modify files.

Learnt from: CR
URL: 
File: CLAUDE.md:undefined-undefined
Timestamp: 2026-04-08T14:12:23.521Z
Learning: A PostToolUse hook (`scripts/check_web_design_system.py`) enforces web dashboard design system rules on every Edit/Write to `web/src/`.

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-31T21:07:37.470Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T21:07:37.470Z
Learning: Applies to tests/**/*.py : NEVER skip, dismiss, or ignore flaky tests — always fix them fully and fundamentally; mock `time.monotonic()` and `asyncio.sleep()` for determinism; use `asyncio.Event().wait()` for indefinite blocking instead of `asyncio.sleep(large_number)`

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-03-20T21:44:04.528Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-20T21:44:04.528Z
Learning: Applies to tests/**/*.py : Fix flaky tests completely and fundamentally; for timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-01T15:36:39.993Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T15:36:39.993Z
Learning: Applies to tests/**/*.py : Flaky tests: NEVER skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic instead of widening timing margins

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-08T12:31:46.839Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T12:31:46.839Z
Learning: Applies to tests/**/*.py : Never skip, dismiss, or ignore flaky tests -- always fix them fully and fundamentally. For timing-sensitive tests, mock `time.monotonic()` and `asyncio.sleep()` to make them deterministic. For tasks that must block indefinitely, use `asyncio.Event().wait()` instead of `asyncio.sleep(large_number)`.

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-02T07:18:02.381Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T07:18:02.381Z
Learning: Applies to tests/**/*.py : Never skip, dismiss, or ignore flaky tests; fix them fully and fundamentally by mocking `time.monotonic()` and `asyncio.sleep()` for determinism or using `asyncio.Event().wait()` for indefinite blocking

Applied to files:

tests/unit/observability/test_otlp_handler.py

📚 Learning: 2026-04-08T12:31:46.839Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T12:31:46.839Z
Learning: Applies to **/*.{py,ts,tsx,js,go} : If implementation deviates from the design spec (better approach found, scope evolved, etc.), alert the user and explain why -- user decides whether to proceed or update the spec. Do NOT silently diverge -- every deviation needs explicit user approval.

Applied to files:

docs/design/operations.md
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-18T08:23:08.912Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-18T08:23:08.912Z
Learning: When approved deviations occur, update the relevant `docs/design/` page to reflect the new reality.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-31T14:31:11.894Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:31:11.894Z
Learning: Applies to docs/design/*.md : Update the relevant `docs/design/` page when approved deviations occur to reflect the new reality

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-30T10:41:40.176Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-30T10:41:40.176Z
Learning: Update the relevant `docs/design/` page to reflect new reality when approved deviations from spec occur

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-21T11:08:01.542Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-21T11:08:01.542Z
Learning: When implementation deviates from the spec (better approach found, scope evolved, etc.), alert the user and explain why — user decides whether to proceed or update the spec. Do NOT silently diverge — every deviation needs explicit user approval.

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to docs/design/**/*.md : Design specification pages in `docs/design/` must be consulted before implementing features (7 pages: index, agents, organization, communication, engine, memory, operations)

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-31T14:17:24.182Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T14:17:24.182Z
Learning: Alert the user and explain why if implementation deviates from the design spec; do NOT silently diverge

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:13:44.964Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:13:44.964Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue — DESIGN_SPEC.md is a pointer file linking to 7 design pages (Agents, Organization, Communication, Engine, Memory, Operations)

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to docs/design/*.md : Design spec pages: 7 pages in `docs/design/` — index, agents, organization, communication, engine, memory, operations

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-15T18:38:44.202Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-15T18:38:44.202Z
Learning: Always read the relevant `docs/design/` page before implementing any feature or planning any issue. DESIGN_SPEC.md is a pointer file linking to the 7 design pages (index, agents, organization, communication, engine, memory, operations).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-21T12:54:22.557Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-21T12:54:22.557Z
Learning: Version bumping (pre-1.0): `fix:` = patch, `feat:` = patch, `feat!:` = minor, `BREAKING CHANGE` trailer = minor. Update version in `pyproject.toml` (`[tool.commitizen].version`) and `src/synthorg/__init__.py` (`__version__`)

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/**/*.py : Package structure: src/synthorg/ organized as: api/ (REST+WebSocket, Litestar), auth/ (auth subpackage), backup/ (scheduled/manual backups), budget/ (cost tracking, CFO), cli/ (superseded by Go CLI), communication/ (message bus, meetings), config/ (YAML loading), core/ (domain models, resilience config), engine/ (orchestration, task state, coordination, approval gates, stagnation detection, context budget, compaction), hr/ (hiring, performance, promotion), memory/ (pluggable backend, Mem0, retrieval, consolidation), persistence/ (operational data, SQLite, settings), observability/ (logging, correlation, sinks), providers/ (LLM abstraction, LiteLLM, auth types, presets, runtime CRUD), settings/ (runtime-editable, typed definitions, encryption, config bridge), security/ (SecOps, rule engine, output scanning, progressive trust, autonomy levels), templates/ (company templates, personalities), tools/ (registry, built-in tools, git, sandbox, code_runner, MCP...

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-19T07:12:14.508Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-19T07:12:14.508Z
Learning: Applies to src/synthorg/api/**/*.py : API package (api/): Litestar REST + WebSocket with controllers, guards, channels, JWT + API key + WS ticket auth, approval gate integration, coordination endpoint, collaboration endpoint, settings endpoint, provider management endpoint (CRUD + test + presets), backup endpoint, RFC 9457 structured errors, AppState hot-reload slots, service auto-wiring (Phase 1 at construction, Phase 2 on startup), lifecycle helpers

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-16T06:24:56.341Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T06:24:56.341Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability must use structured logging with correlation tracking and log sinks

Applied to files:

docs/design/operations.md
src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Security: SecOps agent, rule engine (soft-allow/hard-deny, fail-closed), audit log, output scanner, output scan response policies (redact/withhold/log-only/autonomy-tiered), risk classifier, risk tier classifier, action type registry, ToolInvoker security integration, progressive trust (4 strategies: disabled/weighted/per-category/milestone), autonomy levels (presets, resolver, change strategy), timeout policies (park/resume).

Applied to files:

docs/design/operations.md

📚 Learning: 2026-03-17T06:30:14.180Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T06:30:14.180Z
Learning: Applies to src/synthorg/observability/**/*.py : Observability includes structured logging via `get_logger(__name__)`, correlation tracking, and log sinks.

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-17T22:08:13.456Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-17T22:08:13.456Z
Learning: Applies to src/synthorg/**/*.py : Handle errors explicitly, never silently swallow. Validate at system boundaries (user input, external APIs, config files).

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to **/*.py : Handle errors explicitly—never silently swallow exceptions.

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-04-08T12:31:46.839Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-08T12:31:46.839Z
Learning: Applies to **/*.py : Handle errors explicitly, never silently swallow exceptions.

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-26T15:18:16.848Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-26T15:18:16.848Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow them. All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-04-02T21:38:30.127Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T21:38:30.127Z
Learning: Applies to {**/*.py,web/src/**/*.{ts,tsx}} : Handle errors explicitly; never silently swallow exceptions

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-14T15:43:05.601Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T15:43:05.601Z
Learning: Applies to **/*.py : Handle errors explicitly, never silently swallow exceptions

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-14T16:18:57.267Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-14T16:18:57.267Z
Learning: Applies to src/ai_company/!(observability)/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-31T20:07:03.035Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-31T20:07:03.035Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, entry/exit of key functions

Applied to files:

src/synthorg/observability/otlp_handler.py

📚 Learning: 2026-03-16T07:22:28.134Z

Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-16T07:22:28.134Z
Learning: Applies to src/synthorg/**/*.py : All error paths must log at WARNING or ERROR with context before raising. All state transitions must log at INFO. DEBUG for object creation, internal flow, and key function entry/exit

Applied to files:

src/synthorg/observability/otlp_handler.py

coderabbitai · 2026-04-08T14:26:11Z

src/synthorg/observability/otlp_handler.py

+        self._queue: queue.SimpleQueue[logging.LogRecord] = queue.SimpleQueue()
+        self._pending_count = 0
+        self._pending_lock = threading.Lock()
+        self._dropped_count = 0
+        self._shutdown = threading.Event()
+        self._batch_ready = threading.Event()


⚠️ Potential issue | 🟠 Major

Bound the pending buffer before a collector outage can exhaust process memory.

emit() keeps accepting records, but there is no maximum pending-record limit. While _export_batch() is blocked on network I/O, request threads can keep enqueueing indefinitely, so a slow or unreachable collector turns a telemetry failure into unbounded memory growth. Please add a bounded pending buffer and convert overflow into dropped-record accounting.

Also applies to: 98-109

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/otlp_handler.py` around lines 84 - 89, The pending buffer is unbounded (self._queue: queue.SimpleQueue) so emit() can enqueue forever while _export_batch() is blocked; change the queue to a bounded queue (e.g., queue.Queue(maxsize=MAX_PENDING)) and update emit() to attempt non-blocking put or check full() under self._pending_lock, incrementing self._dropped_count on overflow and returning early; remove/avoid relying on a separate _pending_count by using queue.qsize() (or keep it but update/guard it consistently under _pending_lock) and ensure _export_batch() reads from the bounded queue normally, preserving existing shutdown/_batch_ready logic and using the same drop accounting for the other enqueue sites referenced around _pending_count lines.

- Add SinkType.PROMETHEUS and SinkType.OTLP to enums - Add OtlpProtocol enum (HTTP_PROTOBUF, GRPC) - Extend SinkConfig with prometheus_port and OTLP fields - Add cross-type field validation for all 6 sink types - Create PrometheusCollector with scrape-time refresh from AppState - Create MetricsController (unauthenticated /metrics endpoint) - Create OtlpHandler mapping structlog events to OTLP log records - Wire PrometheusCollector into AppState - Exclude /metrics from auth middleware - Add prometheus-client and opentelemetry-* dependencies - Add metrics event constants module - Update operations.md with /metrics endpoint and close G1 gap - 27 new tests across collector, controller, handler, enums, config

- Add all 5 metric families: agent counts (by status+trust level), task counts (by status), cost total, budget utilization (used %, monthly limit), coordination metrics (efficiency, overhead %) - Add set_prometheus_collector() to AppState for deferred init - Fix PrometheusCollector.refresh() to isolate per-service errors - Fix OtlpHandler.close() to always drain remaining records - Add otlp_batch_size + otlp_timeout_seconds to SinkConfig - Protect OtlpHandler._dropped_count with lock - Update CLAUDE.md: add metrics events + otlp_handler.py exception - Add tests: export failure, build_handler dispatch, partial failure, budget utilization, task counts, trust level labels, config fields Pre-reviewed by 15 agents, 18 findings addressed

…ilot, Gemini CRITICAL: - Reject OtlpProtocol.GRPC with NotImplementedError (not implemented) - Use self.format(record) in OtlpHandler instead of record.getMessage() so ProcessorFormatter and foreign_pre_chain run - Clear agent/task gauge labels before refresh to drop stale combos - Always append /metrics to auth exclude_paths (like setup_status_path) MAJOR: - Remove unused opentelemetry-api/-sdk/-exporter-otlp-proto-http deps (handler uses urllib, not official exporters) - Validate security verdict against bounded set (prevent label explosion) - Mock urllib in test_export_failure for determinism under xdist - Add CRLF validation for OTLP headers (header injection prevention) - Add SSRF check: reject private/loopback IPs in OTLP endpoint - Add TLS warning for unencrypted HTTP with auth headers - Add success debug logs to set_prometheus_collector, set_task_engine, set_auth_service, set_backup_service (consistency with other setters) - Fix operations.md G1 status: "infrastructure added" not "closed" - Fix budget metric help text to clarify accumulated vs monthly cost MEDIUM: - Assert 503 response body in test_returns_503_when_collector_not_configured - Expand /metrics endpoint description in operations.md with all 9 metrics - Extract _DEFAULT_OTLP_BATCH_SIZE/_TIMEOUT constants (no magic numbers) - Add agent label to tasks_total gauge (issue #1122 acceptance criteria) Closes #1122 Closes #1124

- Wire PrometheusCollector in on_startup() so /metrics returns 200 - Change setter success logs from DEBUG to INFO (state transitions) - Rename OtlpProtocol.HTTP_PROTOBUF to HTTP_JSON (sends JSON, not protobuf) - Rename synthorg_agents_total to synthorg_active_agents_total (uses list_active) - Remove dead prometheus_port config knob and all its validation/tests - DNS-aware SSRF: resolve hostnames and reject private IPs (best-effort) - Log warning before raising ValueError on unknown security verdict - Fetch get_total_cost() once per scrape, share across cost+budget metrics - Add except MemoryError, RecursionError: raise guards in all refresh methods - Add _start_flusher flag to OtlpHandler for deterministic tests - Update tests for all of the above

- Update G1 docs: collector IS wired in on_startup(), status now "Closed" - Reject OtlpProtocol.GRPC at SinkConfig validation (fail-fast at parse) - Fix loopback SSRF inconsistency: allow 127.0.0.1/::1 same as localhost - Add except MemoryError, RecursionError: raise to all otlp_handler excepts - OTLP JSON payload now matches OTLP/HTTP JSON spec: AnyValue body, KeyValue array attributes, string timeUnixNano, camelCase field names - Fix _make_task: set assigned_to=None explicitly (prevent MagicMock child) - Promote review_gate/approval_timeout setter logs from DEBUG to INFO - Move GRPC rejection tests from handler to config level

…ale G1 docs

…, handler docstring, stub drain test

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (9)

src/synthorg/observability/enums.py (1)

52-61: ⚠️ Potential issue | 🟠 Major

OTLP transport contract does not match the stated HTTP/protobuf objective.

At Line 60, HTTP_JSON = "http/json" codifies a different public protocol than the PR objective (HTTP/protobuf). Please either align enum/config/docs/tests to protobuf or explicitly document and approve this scope change.
Suggested alignment sketch
 class OtlpProtocol(StrEnum):
@@
-    HTTP_JSON = "http/json"
+    HTTP_PROTOBUF = "http/protobuf"
     GRPC = "grpc"
As per coding guidelines: "If implementation deviates from the design spec (better approach found, scope evolved, etc.), alert the user and explain why. Do NOT silently diverge -- every deviation needs explicit user approval."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/enums.py` around lines 52 - 61, The OtlpProtocol
enum currently defines HTTP_JSON = "http/json" which conflicts with the PR
objective of using HTTP/protobuf; update the implementation to match the
intended transport or explicitly document the scope change: either set
OtlpProtocol.HTTP_JSON (or better rename to HTTP_PROTOBUF) to the protocol
string "http/protobuf" and update the class docstring and any references
(configs, tests, serializers) that use OtlpProtocol or the string literal, or if
you intend to keep JSON, change the docstring and tests to state HTTP/JSON and
get explicit approval for the scope change; search for uses of
OtlpProtocol.HTTP_JSON, OtlpProtocol, and "http/json" to align all usages.

src/synthorg/api/controllers/metrics.py (1)

41-47: ⚠️ Potential issue | 🔴 Critical

Call the collector guard instead of testing the method object.

has_prometheus_collector is a bound method here, so this condition is always truthy and the 503 branch never runs. An unconfigured app will fall through to app_state.prometheus_collector instead of failing cleanly.
🐛 Proposed fix
-        if not app_state.has_prometheus_collector:
+        if not app_state.has_prometheus_collector():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/api/controllers/metrics.py` around lines 41 - 47, The condition
is checking the bound method object app_state.has_prometheus_collector instead
of invoking it, so the 503 branch never runs; change the guard to call the
method (use app_state.has_prometheus_collector()) when deciding to log
METRICS_SCRAPE_FAILED and return the 503 Response, ensuring you still reference
app_state.prometheus_collector only when the method returns True.

docs/design/operations.md (1)

1304-1304: ⚠️ Potential issue | 🟠 Major

Keep the documented agent series aligned with #1122.

This row still advertises synthorg_active_agents_total, while the linked objective defines synthorg_agents_total{status,trust_level}. If the rename was intentional, add the approved deviation note here; otherwise update the metric name so the published contract stays consistent.

Based on learnings: "When implementation deviates from the design spec (better approach found, scope evolved, etc.), alert the user and explain why. Do NOT silently diverge -- every deviation needs explicit user approval."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/operations.md` at line 1304, The docs row describes the
Prometheus metric as synthorg_active_agents_total but the design `#1122` defines
synthorg_agents_total{status,trust_level}; update the documentation to match the
approved contract by renaming synthorg_active_agents_total to
synthorg_agents_total{status,trust_level} (and adjust any accompanying label
descriptions), or if the implementation intentionally deviated, add an explicit
“approved deviation” note explaining why synthorg_active_agents_total was chosen
instead and reference the design item (`#1122`) and the responsible change (metric
name change).

src/synthorg/observability/events/metrics.py (1)

16-20: 🧹 Nitpick | 🔵 Trivial

Split OTLP lifecycle events out of events.metrics.

These constants describe exporter lifecycle, not metrics scrape/collector state, and names like metrics.otlp.export_completed also break the repo's dotted event taxonomy. Move them to a dedicated OTLP module and rename them to otlp.export.completed / otlp.flusher.started so routing and queries stay consistent.

Based on learnings: "Observability package (observability/): structured logging, correlation tracking, log sinks; event constants organized by domain under observability/events/ (e.g., events.api, events.tool, events.git, events.context_budget, events.backup)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/events/metrics.py` around lines 16 - 20, The OTLP
lifecycle constants METRICS_OTLP_EXPORT_COMPLETED, METRICS_OTLP_EXPORT_FAILED,
METRICS_OTLP_FLUSHER_STARTED and METRICS_OTLP_FLUSHER_STOPPED belong in a
dedicated OTLP events module, not in events.metrics; create a new observability
events module for OTLP (e.g., events.otlp) and move these constants there,
renaming both the constant identifiers and their string values to use the OTLP
domain (e.g., OTLP_EXPORT_COMPLETED -> OTLP_EXPORT_COMPLETED with value
"otlp.export.completed", OTLP_EXPORT_FAILED -> "otlp.export.failed",
OTLP_FLUSHER_STARTED -> "otlp.flusher.started", OTLP_FLUSHER_STOPPED ->
"otlp.flusher.stopped"), update any imports/usages to the new names and module,
and remove the original definitions from events.metrics so the dotted event
taxonomy remains consistent.

src/synthorg/observability/config.py (1)

122-127: ⚠️ Potential issue | 🟠 Major

Fail closed on OTLP DNS resolution errors.

Returning on socket.gaierror lets an unresolved hostname skip the private/loopback check entirely. That weakens the SSRF boundary and accepts configs that may resolve to an internal collector later.

🛡️ Proposed fix

-        except socket.gaierror:
-            # DNS resolution failed -- skip check (hostname may be valid
-            # at runtime even if not resolvable at config-load time).
-            return
+        except socket.gaierror as exc:
+            msg = (
+                f"otlp_endpoint hostname {hostname!r} could not be resolved "
+                "during safety validation"
+            )
+            raise ValueError(msg) from exc

As per coding guidelines: "Validate at system boundaries (user input, external APIs, config files)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/config.py` around lines 122 - 127, The try/except
around socket.getaddrinfo (hostname, addrs) swallows socket.gaierror and returns
early, which lets unresolved hostnames bypass the private/loopback checks;
change the behavior to fail-closed by treating DNS resolution failures as
configuration errors: when socket.getaddrinfo raises socket.gaierror for
hostname, propagate/raise a clear exception (e.g., ValueError or a
ConfigValidationError) or return an explicit failure so the caller rejects the
config instead of returning; keep the rest of the validation (the addrs loop and
private/loopback checks) unchanged so unresolved hostnames cannot skip those
checks.

src/synthorg/observability/otlp_handler.py (2)

177-192: ⚠️ Potential issue | 🟠 Major

Full queue drain before batching may cause large transient allocation.

_drain_and_flush() collects the entire queue into a list before slicing into batches. Under backlog, this defeats the configured batch size limit and creates a large temporary allocation.

🛠️ Proposed fix: batch incrementally

     def _drain_and_flush(self) -> None:
         """Drain all queued records and export as OTLP batches."""
-        records: list[logging.LogRecord] = []
-        while True:
-            try:
-                records.append(self._queue.get_nowait())
-            except queue.Empty:
-                break
-
-        with self._pending_lock:
-            self._pending_count = max(0, self._pending_count - len(records))
-
-        for start in range(0, len(records), self._batch_size):
-            batch = records[start : start + self._batch_size]
-            if batch:
-                self._export_batch(batch)
+        while True:
+            batch: list[logging.LogRecord] = []
+            while len(batch) < self._batch_size:
+                try:
+                    batch.append(self._queue.get_nowait())
+                except queue.Empty:
+                    break
+            if not batch:
+                return
+            with self._pending_lock:
+                self._pending_count = max(0, self._pending_count - len(batch))
+            self._export_batch(batch)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` around lines 177 - 192,
_draining_and_flush currently pulls the entire _queue into a records list then
slices into batches, which can allocate a very large temporary list under
backlog; change _drain_and_flush to read and process in incremental batches
(loop: build a single batch by calling self._queue.get_nowait up to
self._batch_size or until queue.Empty, then call self._export_batch(batch)),
update self._pending_count under self._pending_lock by subtracting the number of
records actually processed (not the length of a full list), and avoid creating a
full records list; keep use of _export_batch, _queue, _batch_size,
_pending_lock, and _pending_count to locate and modify the implementation.

84-84: ⚠️ Potential issue | 🟠 Major

Unbounded queue may exhaust memory during collector outage.

SimpleQueue has no maximum size. If the OTLP collector is unreachable, emit() continues enqueueing indefinitely while _export_batch() blocks or fails, potentially causing unbounded memory growth under high log volume.

🛠️ Proposed fix using bounded queue

+_MAX_PENDING = 10_000  # Configurable cap on queued records
+
     def __init__(
         ...
     ) -> None:
         ...
-        self._queue: queue.SimpleQueue[logging.LogRecord] = queue.SimpleQueue()
+        self._queue: queue.Queue[logging.LogRecord] = queue.Queue(maxsize=_MAX_PENDING)
         ...

     def emit(self, record: logging.LogRecord) -> None:
         """Queue a record for batched OTLP export."""
         try:
-            self._queue.put_nowait(record)
+            try:
+                self._queue.put_nowait(record)
+            except queue.Full:
+                with self._pending_lock:
+                    self._dropped_count += 1
+                return
             with self._pending_lock:
                 self._pending_count += 1
                 if self._pending_count >= self._batch_size:
                     self._batch_ready.set()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/otlp_handler.py` at line 84, The current use of
queue.SimpleQueue for self._queue allows unbounded growth if the OTLP collector
is down; change self._queue to a bounded queue (e.g., queue.Queue(maxsize=...))
and make the behavior explicit in emit() and _export_batch(): in emit(), attempt
to enqueue with non-blocking put (put_nowait) or put with a short timeout and
handle queue.Full by dropping the record and optionally incrementing a
dropped-log metric or logging a warning; in _export_batch(), consume items using
get_nowait or get with a timeout until the batch is full or the queue is empty;
update type hints for self._queue and ensure any existing blocking assumptions
are adjusted accordingly.

src/synthorg/observability/prometheus_collector.py (2)

47-114: 🧹 Nitpick | 🔵 Trivial

__init__ exceeds 50-line limit.

The constructor body spans ~67 lines. A data-driven registration pattern was suggested in prior reviews but not applied.

♻️ Proposed refactor using metric specs

# Define metric specs as class-level data
_METRIC_SPECS: tuple[tuple[str, type, str, tuple[str, ...]], ...] = (
    ("active_agents_total", Gauge, "Number of active agents", ("status", "trust_level")),
    ("tasks_total", Gauge, "Number of tasks by status and agent", ("status", "agent")),
    ("cost_total", Gauge, "Total accumulated cost", ()),
    # ... etc
)

def __init__(self, *, prefix: str = "synthorg") -> None:
    self._prefix = prefix
    self.registry = CollectorRegistry()
    
    # Register info metric
    self._info = Info(f"{prefix}_app", "SynthOrg application info", registry=self.registry)
    self._info.info({"version": __version__})
    
    # Register metrics from specs
    for suffix, metric_type, help_text, labels in self._METRIC_SPECS:
        metric = metric_type(f"{prefix}_{suffix}", help_text, labels or [], registry=self.registry)
        setattr(self, f"_{suffix.replace('_total', '_total')}", metric)
    
    # ... security counter (special case with _VALID_VERDICTS)
    logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix)

As per coding guidelines, "Functions must be less than 50 lines."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` around lines 47 - 114,
The __init__ method in PrometheusCollector exceeds the 50-line limit; refactor
it to a data-driven registration using a class-level metric spec tuple (e.g.,
_METRIC_SPECS) that lists metric suffix, metric class (Gauge/PromCounter), help
text, and label names, then in __init__ iterate over _METRIC_SPECS to
instantiate MetricClass(f"{prefix}_{suffix}", help, labels,
registry=self.registry) and setattr it to the instance (use a consistent
attribute name pattern like f"_{suffix}"); keep the Info metric initialization
(self._info) and the security counter (self._security_evaluations) as special
cases if needed (e.g., to pass _VALID_VERDICTS), and finish with the existing
logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix).

196-199: ⚠️ Potential issue | 🟡 Minor

Stale cost metric not cleared when tracker unavailable.

When total_cost is None (tracker unavailable or errored), the gauge retains its previous value. Subsequent scrapes will report stale data as current until the process restarts.

🛠️ Proposed fix

     def _refresh_cost_gauge(self, total_cost: float | None) -> None:
         """Update cost gauge from a pre-fetched total."""
         if total_cost is not None:
             self._cost_total.set(total_cost)
+        else:
+            # Reset to 0 when cost data is unavailable to avoid stale values
+            self._cost_total.set(0.0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/synthorg/observability/prometheus_collector.py` around lines 196 - 199,
The _refresh_cost_gauge currently leaves self._cost_total unchanged when
total_cost is None, causing stale values to persist; update _refresh_cost_gauge
so that when total_cost is None it explicitly clears the metric (e.g., call
self._cost_total.set(0.0) or remove/reset the gauge) and otherwise set the
provided value as before, referring to the _refresh_cost_gauge method and the
_cost_total gauge.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/operations.md`:
- Around line 1545-1552: The docs mention OTLP as a supported sink but the
runtime-settings/custom_sinks schema still omits it; update the custom_sinks
schema block to include an "otlp" sink entry (alongside file, syslog, http) that
documents required fields such as transport (http), endpoint/url, headers,
format (OTLP JSON), batch/window settings, timeout/retry, authentication
options, and the mapping options for structlog->OTLP (correlation_id ->
trace/span context); also add a short example config snippet and validation
notes so operators can configure OTLP sinks via the same
runtime-settings/custom_sinks structure.
- Around line 1859-1867: The opening sentence "All gaps G1-G5 are now closed. G6
(budget history granularity) remains low-priority." contradicts the later
statement that remaining gaps are G3 and G4; update the text so both statements
align: either change the opening sentence to list the actual remaining gaps
(e.g., G3 and G4 with G6 low-priority) or change the concluding paragraph to
reflect that G1–G5 are closed and only G6 remains; ensure references to
SynthOrg, "Recommended Framing", and the gap identifiers G1–G6 are consistent
and remove the stale sentence so readers aren’t pointed in two directions.

In `@src/synthorg/api/app.py`:
- Around line 540-547: The Prometheus bootstrap (checking
app_state.has_prometheus_collector, importing PrometheusCollector, and calling
app_state.set_prometheus_collector) can raise on import/construct and currently
will crash startup; wrap the import + PrometheusCollector() construction in a
try/except Exception as e, log the failure at WARNING or ERROR with contextual
text and the exception, and do NOT re-raise so that app_state remains without a
collector (allowing the degraded /metrics 503 behavior); keep using the same
symbols: app_state.has_prometheus_collector, PrometheusCollector, and
app_state.set_prometheus_collector.

In `@src/synthorg/observability/config.py`:
- Around line 243-270: LogConfig currently only dedupes SYSLOG and HTTP sinks;
add OTLP sinks to that duplicate-detection logic by treating the OTLP identity
as the tuple (otlp_endpoint, otlp_protocol, normalized_otlp_headers). Update the
code that builds the dedupe key (where SYSLOG/HTTP keys are created) to include
an OTLP branch that returns None for missing endpoint or a key like ("otlp",
otlp_endpoint, otlp_protocol.value or name, tuple(sorted(otlp_headers))) so
header order differences don't bypass dedupe; apply the same change where
LogConfig validates/aggregates sinks (including the other OTLP field group using
otlp_export_interval_seconds/otlp_batch_size/otlp_timeout_seconds) to ensure
identical OTLP destinations are rejected.

---

Duplicate comments:
In `@docs/design/operations.md`:
- Line 1304: The docs row describes the Prometheus metric as
synthorg_active_agents_total but the design `#1122` defines
synthorg_agents_total{status,trust_level}; update the documentation to match the
approved contract by renaming synthorg_active_agents_total to
synthorg_agents_total{status,trust_level} (and adjust any accompanying label
descriptions), or if the implementation intentionally deviated, add an explicit
“approved deviation” note explaining why synthorg_active_agents_total was chosen
instead and reference the design item (`#1122`) and the responsible change (metric
name change).

In `@src/synthorg/api/controllers/metrics.py`:
- Around line 41-47: The condition is checking the bound method object
app_state.has_prometheus_collector instead of invoking it, so the 503 branch
never runs; change the guard to call the method (use
app_state.has_prometheus_collector()) when deciding to log METRICS_SCRAPE_FAILED
and return the 503 Response, ensuring you still reference
app_state.prometheus_collector only when the method returns True.

In `@src/synthorg/observability/config.py`:
- Around line 122-127: The try/except around socket.getaddrinfo (hostname,
addrs) swallows socket.gaierror and returns early, which lets unresolved
hostnames bypass the private/loopback checks; change the behavior to fail-closed
by treating DNS resolution failures as configuration errors: when
socket.getaddrinfo raises socket.gaierror for hostname, propagate/raise a clear
exception (e.g., ValueError or a ConfigValidationError) or return an explicit
failure so the caller rejects the config instead of returning; keep the rest of
the validation (the addrs loop and private/loopback checks) unchanged so
unresolved hostnames cannot skip those checks.

In `@src/synthorg/observability/enums.py`:
- Around line 52-61: The OtlpProtocol enum currently defines HTTP_JSON =
"http/json" which conflicts with the PR objective of using HTTP/protobuf; update
the implementation to match the intended transport or explicitly document the
scope change: either set OtlpProtocol.HTTP_JSON (or better rename to
HTTP_PROTOBUF) to the protocol string "http/protobuf" and update the class
docstring and any references (configs, tests, serializers) that use OtlpProtocol
or the string literal, or if you intend to keep JSON, change the docstring and
tests to state HTTP/JSON and get explicit approval for the scope change; search
for uses of OtlpProtocol.HTTP_JSON, OtlpProtocol, and "http/json" to align all
usages.

In `@src/synthorg/observability/events/metrics.py`:
- Around line 16-20: The OTLP lifecycle constants METRICS_OTLP_EXPORT_COMPLETED,
METRICS_OTLP_EXPORT_FAILED, METRICS_OTLP_FLUSHER_STARTED and
METRICS_OTLP_FLUSHER_STOPPED belong in a dedicated OTLP events module, not in
events.metrics; create a new observability events module for OTLP (e.g.,
events.otlp) and move these constants there, renaming both the constant
identifiers and their string values to use the OTLP domain (e.g.,
OTLP_EXPORT_COMPLETED -> OTLP_EXPORT_COMPLETED with value
"otlp.export.completed", OTLP_EXPORT_FAILED -> "otlp.export.failed",
OTLP_FLUSHER_STARTED -> "otlp.flusher.started", OTLP_FLUSHER_STOPPED ->
"otlp.flusher.stopped"), update any imports/usages to the new names and module,
and remove the original definitions from events.metrics so the dotted event
taxonomy remains consistent.

In `@src/synthorg/observability/otlp_handler.py`:
- Around line 177-192: _draining_and_flush currently pulls the entire _queue
into a records list then slices into batches, which can allocate a very large
temporary list under backlog; change _drain_and_flush to read and process in
incremental batches (loop: build a single batch by calling
self._queue.get_nowait up to self._batch_size or until queue.Empty, then call
self._export_batch(batch)), update self._pending_count under self._pending_lock
by subtracting the number of records actually processed (not the length of a
full list), and avoid creating a full records list; keep use of _export_batch,
_queue, _batch_size, _pending_lock, and _pending_count to locate and modify the
implementation.
- Line 84: The current use of queue.SimpleQueue for self._queue allows unbounded
growth if the OTLP collector is down; change self._queue to a bounded queue
(e.g., queue.Queue(maxsize=...)) and make the behavior explicit in emit() and
_export_batch(): in emit(), attempt to enqueue with non-blocking put
(put_nowait) or put with a short timeout and handle queue.Full by dropping the
record and optionally incrementing a dropped-log metric or logging a warning; in
_export_batch(), consume items using get_nowait or get with a timeout until the
batch is full or the queue is empty; update type hints for self._queue and
ensure any existing blocking assumptions are adjusted accordingly.

In `@src/synthorg/observability/prometheus_collector.py`:
- Around line 47-114: The __init__ method in PrometheusCollector exceeds the
50-line limit; refactor it to a data-driven registration using a class-level
metric spec tuple (e.g., _METRIC_SPECS) that lists metric suffix, metric class
(Gauge/PromCounter), help text, and label names, then in __init__ iterate over
_METRIC_SPECS to instantiate MetricClass(f"{prefix}_{suffix}", help, labels,
registry=self.registry) and setattr it to the instance (use a consistent
attribute name pattern like f"_{suffix}"); keep the Info metric initialization
(self._info) and the security counter (self._security_evaluations) as special
cases if needed (e.g., to pass _VALID_VERDICTS), and finish with the existing
logger.debug(METRICS_COLLECTOR_INITIALIZED, prefix=prefix).
- Around line 196-199: The _refresh_cost_gauge currently leaves self._cost_total
unchanged when total_cost is None, causing stale values to persist; update
_refresh_cost_gauge so that when total_cost is None it explicitly clears the
metric (e.g., call self._cost_total.set(0.0) or remove/reset the gauge) and
otherwise set the provided value as before, referring to the _refresh_cost_gauge
method and the _cost_total gauge.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b188354-4bab-49e7-bd29-244fd92c11db

📥 Commits

Reviewing files that changed from the base of the PR and between 1cfe8e1 and 68f1fa4.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (22)

CLAUDE.md
docs/DESIGN_SPEC.md
docs/architecture/acg-glossary.md
docs/design/operations.md
src/synthorg/api/app.py
src/synthorg/api/controllers/__init__.py
src/synthorg/api/controllers/metrics.py
src/synthorg/api/state.py
src/synthorg/observability/config.py
src/synthorg/observability/enums.py
src/synthorg/observability/events/metrics.py
src/synthorg/observability/otlp_handler.py
src/synthorg/observability/prometheus_collector.py
src/synthorg/observability/sinks.py
tests/unit/api/controllers/test_metrics.py
tests/unit/observability/conftest.py
tests/unit/observability/test_config.py
tests/unit/observability/test_enums.py
tests/unit/observability/test_events.py
tests/unit/observability/test_otlp_handler.py
tests/unit/observability/test_prometheus_collector.py
tests/unit/observability/test_sinks.py

coderabbitai · 2026-04-08T14:49:43Z

docs/design/operations.md

+In addition to the 11 default sinks, three shipping sink types are available for centralized
+log aggregation and telemetry export:

 | Sink Type | Transport | Format | Description |
 |-----------|-----------|--------|-------------|
 | Syslog | UDP or TCP to a configurable endpoint | JSON | Ship structured logs to rsyslog, syslog-ng, or Graylog |
 | HTTP | Batched POST to a configurable URL | JSON array | Ship log batches to any JSON-accepting endpoint |
+| OTLP | HTTP POST to an OpenTelemetry collector | OTLP JSON | Map structlog events to OTLP log records with correlation IDs as trace context |


⚠️ Potential issue | 🟠 Major

Document how operators actually configure OTLP sinks.

This section now announces OTLP as a supported sink, but the custom_sinks schema later in the page still only documents file, syslog, and http. Please update that runtime-settings block with the OTLP fields as part of the same doc change, otherwise the new sink is still effectively undocumented for operators.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/operations.md` around lines 1545 - 1552, The docs mention OTLP as a supported sink but the runtime-settings/custom_sinks schema still omits it; update the custom_sinks schema block to include an "otlp" sink entry (alongside file, syslog, http) that documents required fields such as transport (http), endpoint/url, headers, format (OTLP JSON), batch/window settings, timeout/retry, authentication options, and the mapping options for structlog->OTLP (correlation_id -> trace/span context); also add a short example config snippet and validation notes so operators can configure OTLP sinks via the same runtime-settings/custom_sinks structure.

coderabbitai · 2026-04-08T14:49:43Z

docs/design/operations.md

+All gaps G1-G5 are now closed. G6 (budget history granularity) remains low-priority.

 ### Recommended Framing

 SynthOrg should be positioned as an **orchestrated agent control plane**: policy-as-code,
 metered coordination, and observable agent behavior -- all enforced from a single control
 surface. This framing is accurate today for inventory, policy enforcement, and token
-metering. Telemetry export (G1) is the primary gap between internal capability and the
-external claim.
+metering. With G1 closed (Prometheus `/metrics` + OTLP HTTP/JSON), the remaining gaps
+are G3 (policy-as-code) and G4 (coordination metrics API) for full control-plane coverage.


⚠️ Potential issue | 🟡 Minor

Resolve the contradictory gap summary.

Line 1859 says G1-G5 are closed, but the concluding paragraph immediately says the remaining gaps are G3 and G4. One of those statements is stale, so the page currently points readers in two different directions.

📝 Possible cleanup

- metering. With G1 closed (Prometheus `/metrics` + OTLP HTTP/JSON), the remaining gaps - are G3 (policy-as-code) and G4 (coordination metrics API) for full control-plane coverage. + metering. With G1-G5 closed, the only remaining low-priority gap is G6 + (budget history granularity).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/operations.md` around lines 1859 - 1867, The opening sentence "All gaps G1-G5 are now closed. G6 (budget history granularity) remains low-priority." contradicts the later statement that remaining gaps are G3 and G4; update the text so both statements align: either change the opening sentence to list the actual remaining gaps (e.g., G3 and G4 with G6 low-priority) or change the concluding paragraph to reflect that G1–G5 are closed and only G6 remains; ensure references to SynthOrg, "Recommended Framing", and the gap identifiers G1–G6 are consistent and remove the stale sentence so readers aren’t pointed in two directions.

coderabbitai · 2026-04-08T14:49:43Z

src/synthorg/api/app.py

+        # Wire Prometheus collector (no dependencies, runs in-process).
+        if not app_state.has_prometheus_collector:
+            from synthorg.observability.prometheus_collector import (  # noqa: PLC0415
+                PrometheusCollector,
+            )
+
+            app_state.set_prometheus_collector(PrometheusCollector())
+


⚠️ Potential issue | 🟠 Major

Prometheus collector bootstrapping can hard-fail app startup.

At Lines 540-547, import/constructor exceptions are unhandled. That bypasses the intended degraded mode (/metrics returning 503 when unconfigured) and can take the whole API down on startup.

Resilient initialization pattern

# Wire Prometheus collector (no dependencies, runs in-process). if not app_state.has_prometheus_collector: - from synthorg.observability.prometheus_collector import ( # noqa: PLC0415 - PrometheusCollector, - ) - - app_state.set_prometheus_collector(PrometheusCollector()) + try: + from synthorg.observability.prometheus_collector import ( # noqa: PLC0415 + PrometheusCollector, + ) + app_state.set_prometheus_collector(PrometheusCollector()) + except MemoryError, RecursionError: + raise + except Exception: + logger.warning( + API_APP_STARTUP, + note="Prometheus collector unavailable; /metrics will return 503", + exc_info=True, + )

As per coding guidelines: "Handle errors explicitly, never silently swallow. All error paths must log at WARNING or ERROR with context before raising."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/api/app.py` around lines 540 - 547, The Prometheus bootstrap (checking app_state.has_prometheus_collector, importing PrometheusCollector, and calling app_state.set_prometheus_collector) can raise on import/construct and currently will crash startup; wrap the import + PrometheusCollector() construction in a try/except Exception as e, log the failure at WARNING or ERROR with contextual text and the exception, and do NOT re-raise so that app_state remains without a collector (allowing the degraded /metrics 503 behavior); keep using the same symbols: app_state.has_prometheus_collector, PrometheusCollector, and app_state.set_prometheus_collector.

coderabbitai · 2026-04-08T14:49:43Z

src/synthorg/observability/config.py

+    # OTLP fields
+    otlp_endpoint: str | None = Field(
+        default=None,
+        description="OTLP collector endpoint URL",
+    )
+    otlp_protocol: OtlpProtocol = Field(
+        default=OtlpProtocol.HTTP_JSON,
+        description="OTLP transport protocol",
+    )
+    otlp_headers: tuple[tuple[str, str], ...] = Field(
+        default=(),
+        description="Extra OTLP headers as (name, value) pairs",
+    )
+    otlp_export_interval_seconds: float = Field(
+        default=5.0,
+        gt=0,
+        description="Seconds between OTLP export batches",
+    )
+    otlp_batch_size: int = Field(
+        default=_DEFAULT_OTLP_BATCH_SIZE,
+        gt=0,
+        description="Records per OTLP export batch",
+    )
+    otlp_timeout_seconds: float = Field(
+        default=_DEFAULT_OTLP_TIMEOUT,
+        gt=0,
+        description="HTTP request timeout in seconds for OTLP export",
+    )


⚠️ Potential issue | 🟠 Major

Add OTLP endpoint deduping to LogConfig.

OTLP is another catch-all shipping sink, but LogConfig still only rejects duplicate SYSLOG and HTTP destinations. Two identical OTLP sinks will export every record twice, skewing downstream counts and doubling ingestion cost.

➕ Follow-up outside this hunk

+ `@model_validator`(mode="after") + def _validate_no_duplicate_otlp_endpoints(self) -> Self: + """Ensure no duplicate OTLP endpoints.""" + endpoints = [ + s.otlp_endpoint + for s in self.sinks + if s.sink_type == SinkType.OTLP and s.otlp_endpoint is not None + ] + counts = Counter(endpoints) + dupes = sorted(endpoint for endpoint, count in counts.items() if count > 1) + if dupes: + msg = f"Duplicate OTLP endpoints: {dupes}" + raise ValueError(msg) + return self

Also applies to: 301-306

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/synthorg/observability/config.py` around lines 243 - 270, LogConfig currently only dedupes SYSLOG and HTTP sinks; add OTLP sinks to that duplicate-detection logic by treating the OTLP identity as the tuple (otlp_endpoint, otlp_protocol, normalized_otlp_headers). Update the code that builds the dedupe key (where SYSLOG/HTTP keys are created) to include an OTLP branch that returns None for missing endpoint or a key like ("otlp", otlp_endpoint, otlp_protocol.value or name, tuple(sorted(otlp_headers))) so header order differences don't bypass dedupe; apply the same change where LogConfig validates/aggregates sinks (including the other OTLP field group using otlp_export_interval_seconds/otlp_batch_size/otlp_timeout_seconds) to ensure identical OTLP destinations are rejected.

…tion

🤖 I have created a release *beep* *boop* --- ## [0.6.5](v0.6.4...v0.6.5) (2026-04-09) ### Features * add control-plane API endpoints batch ([#1118](#1118), [#1119](#1119), [#1120](#1120), [#1121](#1121)) ([#1138](#1138)) ([af11f0a](af11f0a)) * engine intelligence v2 -- trace enrichment, compaction, versioning eval ([#1139](#1139)) ([ed57dfa](ed57dfa)), closes [#1123](#1123) [#1125](#1125) [#1113](#1113) * generalize versioning to VersionSnapshot[T] for all entity types ([#1155](#1155)) ([5f563ce](5f563ce)), closes [#1131](#1131) [#1132](#1132) [#1133](#1133) * implement auxiliary tool categories -- design, communication, analytics ([#1152](#1152)) ([b506ba4](b506ba4)) * implement multi-project support -- engine orchestration ([#242](#242)) ([#1153](#1153)) ([74f1362](74f1362)) * implement SharedKnowledgeStore append-only + MVCC consistency model (Phase 1.5) ([#1134](#1134)) ([965d3a1](965d3a1)), closes [#1130](#1130) * implement shutdown strategies and SUSPENDED task status ([#1151](#1151)) ([6a0db11](6a0db11)) * persistent cost aggregation for project-lifetime budgets ([#1173](#1173)) ([5c212c5](5c212c5)), closes [#1156](#1156) * Prometheus /metrics endpoint and OTLP exporter ([#1122](#1122)) ([#1135](#1135)) ([aaeaae9](aaeaae9)), closes [#1124](#1124) * Prometheus metrics -- daily budget %, per-agent cost, per-agent budget % ([#1154](#1154)) ([581c494](581c494)), closes [#1148](#1148) ### Bug Fixes * communication hardening -- meeting cooldown, circuit breaker backoff, debate fallback ([#1140](#1140)) ([fe82894](fe82894)), closes [#1115](#1115) [#1116](#1116) [#1117](#1117) ### CI/CD * bump wrangler from 4.80.0 to 4.81.0 in /.github in the all group ([#1144](#1144)) ([b7c0945](b7c0945)) ### Maintenance * bump python from `6869258` to `5e59aae` in /docker/backend in the all group ([#1141](#1141)) ([01e99c2](01e99c2)) * bump python from `6869258` to `5e59aae` in /docker/sandbox in the all group ([#1143](#1143)) ([ea755bd](ea755bd)) * bump python from `6869258` to `5e59aae` in /docker/web in the all group ([#1142](#1142)) ([5416dd9](5416dd9)) * bump the all group across 1 directory with 2 updates ([#1181](#1181)) ([d3d5adf](d3d5adf)) * bump the all group across 1 directory with 3 updates ([#1146](#1146)) ([c609e6c](c609e6c)) * bump the all group in /cli with 2 updates ([#1177](#1177)) ([afd9cde](afd9cde)) * bump the all group in /site with 3 updates ([#1178](#1178)) ([7cff82a](7cff82a)) * bump the all group with 2 updates ([#1180](#1180)) ([199a1a8](199a1a8)) * bump vitest from 4.1.2 to 4.1.3 in /site in the all group ([#1145](#1145)) ([a8c1194](a8c1194)) * consolidated web deps (11 packages + hono security + test fixes) ([#1150](#1150)) ([63a9390](63a9390)), closes [#1147](#1147) [#1136](#1136) [#1137](#1137) * pin Docker Python base image to 3.14.x ([#1182](#1182)) ([8ffdd86](8ffdd86)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 7, 2026 23:45

Aureliolo temporarily deployed to ci April 7, 2026 23:45 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Aureliolo April 7, 2026 23:46 View session

Aureliolo temporarily deployed to cloudflare-preview April 7, 2026 23:46 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Copilot AI reviewed Apr 7, 2026

View reviewed changes

coderabbitai bot requested changes Apr 7, 2026

View reviewed changes

Aureliolo temporarily deployed to ci April 8, 2026 06:15 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 06:16 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo mentioned this pull request Apr 8, 2026

Prometheus metrics: daily budget %, per-agent budget, per-agent task counts #1148

Closed

Aureliolo temporarily deployed to ci April 8, 2026 08:18 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 08:19 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo force-pushed the feat/prometheus-otlp branch from 846a0d6 to 3b04890 Compare April 8, 2026 11:43

Aureliolo temporarily deployed to ci April 8, 2026 11:43 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 11:44 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo temporarily deployed to ci April 8, 2026 12:08 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 12:09 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo force-pushed the feat/prometheus-otlp branch from 3971791 to 3f738aa Compare April 8, 2026 13:20

Aureliolo temporarily deployed to ci April 8, 2026 13:20 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 13:21 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo temporarily deployed to ci April 8, 2026 14:11 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 14:12 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Aureliolo added 11 commits April 8, 2026 16:31

docs: add ACG glossary bidirectional mapping (#1124)

577d922

fix: remove unused type-ignore comments in OTLP handler tests

5539ab6

fix: round-4 review -- scrape error handling, output_scan verdict, st…

988c228

…ale G1 docs

fix: update verdict docstring to list all 4 valid values

be8f766

fix: round-5 -- tasks_total agent label in docs, remove redundant dep…

d82680f

…, handler docstring, stub drain test

fix: round-6 -- sink count 3, G1 framing closed

68f1fa4

Aureliolo force-pushed the feat/prometheus-otlp branch from 1cfe8e1 to 68f1fa4 Compare April 8, 2026 14:36

Aureliolo temporarily deployed to ci April 8, 2026 14:36 — with GitHub Actions Inactive

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 14:37 — with GitHub Actions Inactive

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

fix: wrap PrometheusCollector init in try/except for graceful degrada…

9cc33ab

…tion

Aureliolo temporarily deployed to ci April 8, 2026 15:01 — with GitHub Actions Inactive

Aureliolo merged commit aaeaae9 into main Apr 8, 2026
28 of 29 checks passed

Aureliolo deleted the feat/prometheus-otlp branch April 8, 2026 15:02

Aureliolo had a problem deploying to cloudflare-preview April 8, 2026 15:02 — with GitHub Actions Error

Aureliolo temporarily deployed to cloudflare-preview April 8, 2026 15:03 — with GitHub Actions Inactive

Aureliolo mentioned this pull request Apr 8, 2026

chore(main): release 0.6.5 #1149

Merged

		url = self._endpoint.rstrip("/") + "/v1/logs"
		request = urllib.request.Request(url, data=body, method="POST") # noqa: S310

		for status, count in status_counts.items():
		self._tasks_total.labels(status=status).set(count)

	status_counts[str(task.status)] += 1
	status_counts[str(task.status)] += 1
	self._tasks_total.clear()

	"opentelemetry-api==1.34.1",
	"opentelemetry-exporter-otlp-proto-http==1.34.1",
	"opentelemetry-sdk==1.34.1",

Conversation

Aureliolo commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1: Prometheus /metrics endpoint

Phase 2: OTLP exporter

Security hardening

Documentation

Test plan

Review coverage

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Snapshot Warnings

OpenSSF Scorecard

Scanned Files

Uh oh!

codecov bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Aureliolo commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading

codecov bot commented Apr 7, 2026 •

edited

Loading