Skip to content

Implement enterprise logging system with comprehensive configuration #72

@Aureliolo

Description

@Aureliolo

Summary

Implement a foundational enterprise-grade logging system that all subsequent code must use. This is a core infrastructure requirement — once merged, CLAUDE.md and contributing guidelines must mandate its use in every module.

The design spec lists Observable as a core design principle: "Every agent action, communication, and decision is logged and visible." Currently the codebase has zero logging — no imports, no loggers, no configuration. This issue fills that gap before M2+ modules start building on top.

Requirements

1. Structured Logging Library

  • Use structlog for structured, key-value logging (JSON in production, colored console in dev)
  • Wrap stdlib logging so third-party libraries (uvicorn, FastAPI, httpx, etc.) are captured too
  • All log entries must be machine-parseable (JSON lines format for files/streams)

2. Log Categories (Named Loggers)

Every subsystem gets its own named logger with independent configuration:

Logger Name Purpose Default Level
ai_company.core Domain model lifecycle events INFO
ai_company.engine Agent execution, task lifecycle DEBUG
ai_company.communication Inter-agent messages, bus events INFO
ai_company.providers LLM API calls, tokens, latency INFO
ai_company.budget Cost tracking, limit alerts, spending INFO
ai_company.security Audit trail, approvals, denials INFO
ai_company.memory Memory reads/writes, consolidation DEBUG
ai_company.tools Tool invocations, results, permissions INFO
ai_company.api HTTP requests, responses, middleware INFO
ai_company.cli CLI command invocations INFO
ai_company.config Config loading, validation, changes INFO
ai_company.templates Template loading, company building INFO

3. Multiple Output Sinks (Handlers)

All configurable, all toggleable independently:

Sink Format Purpose
Console (stderr) Colored, human-readable Development / interactive use
Main log file JSON lines General application log
Audit log file JSON lines Security events only (approvals, denials, permission checks)
Error log file JSON lines WARNING+ only, for quick triage
Agent activity log JSON lines Per-agent actions, task state changes
Cost/usage log JSON lines LLM calls, token counts, costs (feeds analytics)
Debug log file JSON lines ALL levels including DEBUG/TRACE (rotated aggressively)

4. Configuration System

Logging must be fully configurable via:

a) Python API (programmatic)

from ai_company.logging import configure_logging, LogConfig

configure_logging(LogConfig(
    level="DEBUG",
    console_enabled=True,
    console_format="colored",  # or "json"
    file_enabled=True,
    file_path="logs/ai_company.log",
    file_rotation="10 MB",
    file_retention=30,  # days
    audit_log_enabled=True,
    audit_log_path="logs/audit.log",
    error_log_enabled=True,
    cost_log_enabled=True,
    per_logger_levels={
        "ai_company.engine": "DEBUG",
        "ai_company.providers": "WARNING",
    },
    json_indent=None,  # compact in production
    include_caller=True,  # file:line info
    include_timestamp=True,
    timestamp_format="iso",  # or "unix" or custom
    correlation_id_enabled=True,  # request/task correlation
))

b) YAML configuration file

logging:
  level: INFO
  console:
    enabled: true
    format: colored  # colored | json | plain
    level: null  # inherits from root
  files:
    main:
      enabled: true
      path: logs/ai_company.log
      format: json
      rotation: "10 MB"
      retention_days: 30
      level: null
    audit:
      enabled: true
      path: logs/audit.log
      format: json
      rotation: "50 MB"
      retention_days: 365
    errors:
      enabled: true
      path: logs/errors.log
      format: json
      level: WARNING
    debug:
      enabled: false
      path: logs/debug.log
      format: json
      rotation: "50 MB"
      retention_days: 7
    cost:
      enabled: true
      path: logs/cost.log
      format: json
  loggers:
    ai_company.engine: DEBUG
    ai_company.security: INFO
    ai_company.providers: WARNING
  context:
    include_caller: true
    include_timestamp: true
    timestamp_format: iso
    correlation_id: true

c) Environment variable overrides

AI_COMPANY_LOG_LEVEL=DEBUG
AI_COMPANY_LOG_CONSOLE_FORMAT=json
AI_COMPANY_LOG_FILE_ENABLED=true
AI_COMPANY_LOG_FILE_PATH=logs/app.log
AI_COMPANY_LOG_AUDIT_ENABLED=true
AI_COMPANY_LOG_PER_LOGGER="engine=DEBUG,providers=WARNING"

5. Contextual Enrichment (Automatic Fields)

Every log entry automatically includes (when available):

  • timestamp — ISO 8601
  • level — DEBUG/INFO/WARNING/ERROR/CRITICAL
  • logger — dotted logger name
  • event — the log message
  • correlation_id — ties related operations together (e.g., a single task execution)
  • agent_id — which agent produced this log
  • task_id — which task is being worked on
  • callerfile:line of the log call
  • thread / async_task — concurrency context

6. Log Rotation & Retention

  • File rotation by size (default 10 MB) and/or time
  • Configurable retention period per sink
  • Compressed archived logs (.gz)
  • Audit logs: long retention (365 days default)
  • Debug logs: short retention (7 days default)

7. Performance Requirements

  • Logging must not block the main event loop (async-safe)
  • File I/O on background thread (via QueueHandler or structlog async)
  • Minimal overhead when a log level is disabled (lazy evaluation)
  • Support for sampling high-volume debug logs in production

8. Testing & Development Support

  • caplog / log_capture fixture for testing log output in pytest
  • Ability to assert on structured log fields (not just message text)
  • configure_logging(testing=True) preset that captures to list
  • Log output in test runs configurable via --log-cli-level

Deliverables

  • src/ai_company/logging/ module with all components
  • Pydantic models for LogConfig and sub-configs
  • configure_logging() function (programmatic API)
  • YAML config integration (loads from company config file)
  • Environment variable override support
  • All 6+ output sinks implemented and toggleable
  • Per-logger level configuration
  • Contextual enrichment (correlation ID, agent ID, task ID, caller info)
  • File rotation and retention
  • Async-safe implementation
  • Pytest fixtures for log testing
  • Unit tests with 80%+ coverage
  • Update CLAUDE.md: all new code must use the logging system (no bare print(), no raw logging)
  • Update contributing guide with logging usage examples
  • Backfill logging into existing M1 domain models (core/, budget/)

Post-Implementation Enforcement

Once merged, add to CLAUDE.md:

## Logging (MANDATORY)

- **Every module** must obtain a logger: `logger = get_logger(__name__)`
- **Never** use `print()` or raw `logging.getLogger()` — always use `ai_company.logging`
- **All public functions** must log entry/exit at DEBUG level
- **All error paths** must log at ERROR/WARNING with context
- **All state transitions** (task status, agent lifecycle) must log at INFO
- **Security events** (approvals, denials, permission checks) must use the audit logger
- **Cost events** (LLM calls, token usage) must include cost metadata

Dependencies

  • structlog — structured logging
  • Potentially python-json-logger or rely on structlog's JSON renderer
  • No other external deps needed (rotation via stdlib logging.handlers)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:criticalBlocks other work, must do firstscope:large3+ days of workspec:architectureDESIGN_SPEC Section 15 - Technical Architecturespec:budgetDESIGN_SPEC Section 10 - Cost & Budget Managementspec:providersDESIGN_SPEC Section 9 - Model Provider Layertype:featureNew feature implementationtype:infraCI/CD, tooling, project setup

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions