Skip to content

feat: add remote runtime investigation with Slack thread context (#301)#630

Merged
davincios merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:issue/301-remote-runtime-investigation
Apr 17, 2026
Merged

feat: add remote runtime investigation with Slack thread context (#301)#630
davincios merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:issue/301-remote-runtime-investigation

Conversation

@hamzzaaamalik

@hamzzaaamalik hamzzaaamalik commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds opensre investigate --service <name> — a runtime investigation workflow for deployed services. Instead of passing an alert file, OpenSRE pulls live signals (deployment status, recent logs, health probe) from the configured remote ops provider and feeds them into the existing RCA pipeline as evidence. Optionally incorporates Slack thread context via --slack-thread.

Fixes #301.

What changed

New capability: opensre investigate --service <name>

Gathers per-service evidence and runs the standard RCA pipeline against it:

  1. Resolves <name> against the existing named-remote registry (load_named_remotes())
  2. Fetches deployment status via RemoteOpsProvider.status()
  3. Pulls the most recent ~100 log lines via the new fetch_logs() method
  4. Probes /health / /ok via existing poll_deployment_health()
  5. Packages everything into a rich raw_alert dict the investigation graph already knows how to consume

New capability: --slack-thread CHANNEL/TS

Optional flag that pulls a specific Slack thread via conversations.replies and includes the messages as investigation context. Requires SLACK_BOT_TOKEN in the environment. Graceful: Slack fetch failures are captured in the payload instead of failing the investigation.

Extensible provider interface

Added fetch_logs(scope, *, lines) -> str as an abstract method on RemoteOpsProvider. Unlike the existing logs() (which streams to stdout for interactive use), fetch_logs() captures output so it can feed the agent. Implemented for Railway; other providers follow the same pattern.

Graceful fallbacks

  • Logs unavailable → (logs unavailable: <reason>) in the payload
  • Health probe timeout → error captured in health_probe
  • Missing service URL → health probe skipped
  • Slack fetch errors → captured in slack_thread.error
  • Unsupported provider → friendly OpenSREError with remediation hint
  • Mutual exclusion with --input / --input-json / --interactive / --print-template → rejected with clear message

Acceptance criteria coverage (from #301)

# Criterion Implementation
1 User can start remote investigation for deployed service opensre investigate --service <name>
2 Pulls logs, health, deploy metadata from hosting provider fetch_logs, poll_deployment_health, provider.status
3 Incorporates Slack context --slack-thread CHANNEL/TS + SLACK_BOT_TOKEN
4 Output highlights likely causes, recent changes, next steps Standard RCA pipeline (now with richer evidence)
5 Initial implementation documented and designed to extend New docs page + nav entry; RemoteOpsProvider ABC extensible

Design notes

  • alert_source is deliberately left unset in the payload. detect_sources.py filters core integrations (Grafana/Datadog/Honeycomb/Coralogix) by alert_source, so setting it to "remote_runtime" would silently suppress all four. The LLM in extract_alert may infer an alert_source from the log contents, which is the correct behavior.
  • No new state fields: extra context (service, recent_logs, health_probe, slack_thread) rides on raw_alert. Avoids the AgentState TypedDict / Pydantic drift-test churn and keeps the change local.
  • investigation_origin = "remote_runtime": purely informational marker for future tooling; no routing impact.
  • Narrow Slack approach: this PR does not add a full Slack read integration (config + catalog + detect_sources + agent tool). It exposes exactly what Add remote runtime investigation workflows using logs and Slack context #301 calls for — the investigation workflow can incorporate a specific Slack thread — via a CLI flag and env var. A full Slack integration (so the agent can dynamically search/fetch more Slack context) is a natural follow-up.

Files changed

New (7):

  • app/remote/runtime_alert.pybuild_runtime_alert_payload() helper
  • app/remote/slack_context.pyparse_slack_thread_ref() + fetch_slack_thread()
  • docs/remote-runtime-investigation.mdx — user-facing documentation
  • tests/remote/test_ops_fetch_logs.py — 6 tests
  • tests/remote/test_runtime_alert.py — 11 tests
  • tests/remote/test_slack_context.py — 13 tests
  • tests/cli/test_investigate_service_flag.py — 10 tests

Modified (4):

  • app/remote/ops.py — new fetch_logs() abstract + Railway impl
  • app/cli/commands/general.py--service and --slack-thread flags + handler
  • docs/docs.json — nav entry under "Investigations"
  • .env.exampleSLACK_BOT_TOKEN documented

Test plan

  • ruff check app/ tests/ — clean
  • python -m mypy app/ — 0 issues in 356 files
  • python -m pytest tests/remote/test_ops_fetch_logs.py tests/remote/test_runtime_alert.py tests/remote/test_slack_context.py tests/cli/test_investigate_service_flag.py -v40 / 40 passing
  • Full suite with coverage — zero regressions (same 6 failed / 176 errors as main, all pre-existing TypedDict issues on Python 3.11)
  • Manual smoke: opensre investigate --service <deployed-railway-service> against a real Railway deployment (local env required; reviewers with access please confirm)
  • Manual smoke: --slack-thread CHANNEL/TS with a real SLACK_BOT_TOKEN + channels:history scope

Follow-ups

  • Add fetch_logs() implementations for other providers (EC2, ECS, Vercel) — currently Railway-only.
  • Full Slack read integration (bot token in integration store, agent tool for dynamic thread/search queries) if the team wants the agent to pull Slack context proactively rather than only when the user specifies a thread.

@greptile-apps

greptile-apps Bot commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds opensre investigate --service <name> — a runtime investigation flow that fetches live signals (deployment status, logs, health probe) from the configured remote ops provider and feeds them into the existing RCA pipeline. An optional --slack-thread CHANNEL/TS flag pulls Slack thread context via conversations.replies when SLACK_BOT_TOKEN is set. All previously flagged issues (stderr discarding in fetch_logs, env isolation in tests, dead and not service clause) have been addressed in this revision.

Confidence Score: 5/5

Safe to merge — all previously flagged issues resolved, no new P0/P1 findings.

All three concerns from the prior review round (stderr discarding in fetch_logs, env isolation gap in tests, dead and not service clause) have been addressed. poll_deployment_health only ever raises TimeoutError so the narrow except in _probe_health is correct. OpenSREError extends click.ClickException so all user-facing errors render cleanly without tracebacks. The single concrete provider (Railway) fully implements the new abstract method. 40 tests pass with good branch coverage across happy path, graceful fallbacks, and mutual-exclusion guards.

No files require special attention.

Important Files Changed

Filename Overview
app/cli/commands/general.py Adds --service/--slack-thread flags and _run_service_investigation(); mutual exclusion, analytics, and error handling are all correct; OpenSREError extends click.ClickException so errors render cleanly.
app/remote/ops.py Adds fetch_logs() as @AbstractMethod on RemoteOpsProvider and implements it in RailwayRemoteOpsProvider with proper stdout/stderr merging; only one concrete provider exists so the abstract method is non-breaking.
app/remote/runtime_alert.py New helper that resolves the named remote, fetches status/logs/health, and packages them into a raw_alert dict; graceful fallbacks for all three failure modes are present and tested.
app/remote/slack_context.py New Slack thread helper; correctly validates CHANNEL/TS format, caps limit at 100, handles HTTP errors and API ok=false gracefully, and does not log the bot token.
tests/cli/test_investigate_service_flag.py 10 tests covering happy path, output file, mutual exclusion (parametrised), error propagation, and Slack token validation; env isolation added via monkeypatch.delenv where needed.
tests/remote/test_ops_fetch_logs.py 6 tests for RailwayRemoteOpsProvider.fetch_logs() covering stdout-only, stderr-only, mixed, whitespace stripping, non-zero exit, and --tail argument passthrough.
tests/remote/test_runtime_alert.py 11 tests for build_runtime_alert_payload() covering happy path, missing name, unknown service, status failure, log unavailability, health timeout, no URL, and Slack thread inclusion/error capture.
tests/remote/test_slack_context.py 13 tests for parse_slack_thread_ref() and fetch_slack_thread() covering parsing, malformed refs, missing token, success, ok=false, HTTP error, unexpected exceptions, and limit capping.
.env.example Documents the new SLACK_BOT_TOKEN env var with correct scope requirements; placed appropriately alongside the existing SLACK_WEBHOOK_URL.
docs/remote-runtime-investigation.mdx New user-facing doc covering prerequisites, usage, Slack thread context, mutual exclusion, provider extension, and known limitations; accurate and complete.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as investigate_command
    participant SvcInv as _run_service_investigation
    participant Builder as build_runtime_alert_payload
    participant Provider as RailwayRemoteOpsProvider
    participant Health as poll_deployment_health
    participant Slack as fetch_slack_thread
    participant RCA as run_investigation_cli

    User->>CLI: opensre investigate --service api-backend --slack-thread C01/1234
    CLI->>SvcInv: service=api-backend, slack_thread=C01/1234
    SvcInv->>SvcInv: validate no conflicting flags
    SvcInv->>SvcInv: read SLACK_BOT_TOKEN from env
    SvcInv->>Builder: build_runtime_alert_payload(service, slack_thread_ref, token)
    Builder->>Builder: load_named_remotes() — validate name
    Builder->>Builder: load_remote_ops_config() — resolve provider/project/service
    Builder->>Provider: status(scope)
    Provider-->>Builder: ServiceStatus(url, health, deployment_status, ...)
    Builder->>Provider: fetch_logs(scope, lines=100)
    Provider-->>Builder: log lines or (logs unavailable: ...)
    Builder->>Health: poll_deployment_health(url, max_attempts=2)
    Health-->>Builder: HealthPollStatus(status_code=200) or TimeoutError captured
    Builder->>Slack: fetch_slack_thread(channel, ts, token)
    Slack-->>Builder: messages dict or error dict
    Builder-->>SvcInv: raw_alert dict
    SvcInv->>RCA: run_investigation_cli(raw_alert=..., alert_name=..., ...)
    RCA-->>SvcInv: result dict
    SvcInv->>User: write_json(result, output) + SystemExit(0)
Loading

Reviews (4): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile

Comment thread app/cli/commands/general.py Outdated
Comment thread app/remote/ops.py Outdated
Comment thread tests/cli/test_investigate_service_flag.py
@hamzzaaamalik

Copy link
Copy Markdown
Collaborator Author

Noticed test_validate_provider_credentials_returns_success_for_valid_anthropic_key in tests/cli/wizard/test_validation.py is failing on multiple open PRs (including mine #630-follow-up on #301).

Root cause: _load_anthropic_client() in app/cli/wizard/validation.py re-imports the real Anthropic class whenever AnthropicAuthError is None, which silently overrides the test's monkeypatch.setattr(...). When the test runs first in a pytest-xdist worker (i.e. before any other test triggered _load_anthropic_client), the real Anthropic SDK picks up CI's ANTHROPIC_API_KEY secret and gets rejected. It's flaky based on xdist worker distribution adding new tests in any PR can shift the distribution and trigger it.

Repro on clean main:
git checkout main
ANTHROPIC_API_KEY="anything" pytest tests/cli/wizard/test_validation.py::test_validate_provider_credentials_returns_success_for_valid_anthropic_key
FAILED ... ValidationResult(ok=False, detail='Anthropic rejected the API key.')
Fix: ~2 lines per test — also monkeypatch AnthropicAuthError (and OpenAIAuthError for symmetry) so _load_anthropic_client doesn't re-import.

Opening a small PR now to fix it, should unblock everyone's CI.

@hamzzaaamalik

Copy link
Copy Markdown
Collaborator Author

Please merge it first: #632

@davincios davincios merged commit afff9b4 into Tracer-Cloud:main Apr 17, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add remote runtime investigation workflows using logs and Slack context

2 participants