feat: add remote runtime investigation with Slack thread context (#301) by hamzzaaamalik · Pull Request #630 · Tracer-Cloud/opensre

hamzzaaamalik · 2026-04-17T10:10:20Z

Summary

Adds opensre investigate --service <name> — a runtime investigation workflow for deployed services. Instead of passing an alert file, OpenSRE pulls live signals (deployment status, recent logs, health probe) from the configured remote ops provider and feeds them into the existing RCA pipeline as evidence. Optionally incorporates Slack thread context via --slack-thread.

Fixes #301.

What changed

New capability: `opensre investigate --service <name>`

Gathers per-service evidence and runs the standard RCA pipeline against it:

Resolves <name> against the existing named-remote registry (load_named_remotes())
Fetches deployment status via RemoteOpsProvider.status()
Pulls the most recent ~100 log lines via the new fetch_logs() method
Probes /health / /ok via existing poll_deployment_health()
Packages everything into a rich raw_alert dict the investigation graph already knows how to consume

New capability: `--slack-thread CHANNEL/TS`

Optional flag that pulls a specific Slack thread via conversations.replies and includes the messages as investigation context. Requires SLACK_BOT_TOKEN in the environment. Graceful: Slack fetch failures are captured in the payload instead of failing the investigation.

Extensible provider interface

Added fetch_logs(scope, *, lines) -> str as an abstract method on RemoteOpsProvider. Unlike the existing logs() (which streams to stdout for interactive use), fetch_logs() captures output so it can feed the agent. Implemented for Railway; other providers follow the same pattern.

Graceful fallbacks

Logs unavailable → (logs unavailable: <reason>) in the payload
Health probe timeout → error captured in health_probe
Missing service URL → health probe skipped
Slack fetch errors → captured in slack_thread.error
Unsupported provider → friendly OpenSREError with remediation hint
Mutual exclusion with --input / --input-json / --interactive / --print-template → rejected with clear message

Acceptance criteria coverage (from #301)

#	Criterion	Implementation
1	User can start remote investigation for deployed service	`opensre investigate --service <name>`
2	Pulls logs, health, deploy metadata from hosting provider	`fetch_logs`, `poll_deployment_health`, `provider.status`
3	Incorporates Slack context	`--slack-thread CHANNEL/TS` + `SLACK_BOT_TOKEN`
4	Output highlights likely causes, recent changes, next steps	Standard RCA pipeline (now with richer evidence)
5	Initial implementation documented and designed to extend	New docs page + nav entry; `RemoteOpsProvider` ABC extensible

Design notes

alert_source is deliberately left unset in the payload. detect_sources.py filters core integrations (Grafana/Datadog/Honeycomb/Coralogix) by alert_source, so setting it to "remote_runtime" would silently suppress all four. The LLM in extract_alert may infer an alert_source from the log contents, which is the correct behavior.
No new state fields: extra context (service, recent_logs, health_probe, slack_thread) rides on raw_alert. Avoids the AgentState TypedDict / Pydantic drift-test churn and keeps the change local.
investigation_origin = "remote_runtime": purely informational marker for future tooling; no routing impact.
Narrow Slack approach: this PR does not add a full Slack read integration (config + catalog + detect_sources + agent tool). It exposes exactly what Add remote runtime investigation workflows using logs and Slack context #301 calls for — the investigation workflow can incorporate a specific Slack thread — via a CLI flag and env var. A full Slack integration (so the agent can dynamically search/fetch more Slack context) is a natural follow-up.

Files changed

New (7):

app/remote/runtime_alert.py — build_runtime_alert_payload() helper
app/remote/slack_context.py — parse_slack_thread_ref() + fetch_slack_thread()
docs/remote-runtime-investigation.mdx — user-facing documentation
tests/remote/test_ops_fetch_logs.py — 6 tests
tests/remote/test_runtime_alert.py — 11 tests
tests/remote/test_slack_context.py — 13 tests
tests/cli/test_investigate_service_flag.py — 10 tests

Modified (4):

app/remote/ops.py — new fetch_logs() abstract + Railway impl
app/cli/commands/general.py — --service and --slack-thread flags + handler
docs/docs.json — nav entry under "Investigations"
.env.example — SLACK_BOT_TOKEN documented

Test plan

ruff check app/ tests/ — clean
python -m mypy app/ — 0 issues in 356 files
python -m pytest tests/remote/test_ops_fetch_logs.py tests/remote/test_runtime_alert.py tests/remote/test_slack_context.py tests/cli/test_investigate_service_flag.py -v — 40 / 40 passing
Full suite with coverage — zero regressions (same 6 failed / 176 errors as main, all pre-existing TypedDict issues on Python 3.11)
Manual smoke: opensre investigate --service <deployed-railway-service> against a real Railway deployment (local env required; reviewers with access please confirm)
Manual smoke: --slack-thread CHANNEL/TS with a real SLACK_BOT_TOKEN + channels:history scope

Follow-ups

Add fetch_logs() implementations for other providers (EC2, ECS, Vercel) — currently Railway-only.
Full Slack read integration (bot token in integration store, agent tool for dynamic thread/search queries) if the team wants the agent to pull Slack context proactively rather than only when the user specifies a thread.

…cer-Cloud#301)

greptile-apps · 2026-04-17T10:14:36Z

Greptile Summary

This PR adds opensre investigate --service <name> — a runtime investigation flow that fetches live signals (deployment status, logs, health probe) from the configured remote ops provider and feeds them into the existing RCA pipeline. An optional --slack-thread CHANNEL/TS flag pulls Slack thread context via conversations.replies when SLACK_BOT_TOKEN is set. All previously flagged issues (stderr discarding in fetch_logs, env isolation in tests, dead and not service clause) have been addressed in this revision.

Confidence Score: 5/5

Safe to merge — all previously flagged issues resolved, no new P0/P1 findings.

All three concerns from the prior review round (stderr discarding in fetch_logs, env isolation gap in tests, dead and not service clause) have been addressed. poll_deployment_health only ever raises TimeoutError so the narrow except in _probe_health is correct. OpenSREError extends click.ClickException so all user-facing errors render cleanly without tracebacks. The single concrete provider (Railway) fully implements the new abstract method. 40 tests pass with good branch coverage across happy path, graceful fallbacks, and mutual-exclusion guards.

No files require special attention.

Important Files Changed

Filename	Overview
app/cli/commands/general.py	Adds --service/--slack-thread flags and _run_service_investigation(); mutual exclusion, analytics, and error handling are all correct; OpenSREError extends click.ClickException so errors render cleanly.
app/remote/ops.py	Adds fetch_logs() as @AbstractMethod on RemoteOpsProvider and implements it in RailwayRemoteOpsProvider with proper stdout/stderr merging; only one concrete provider exists so the abstract method is non-breaking.
app/remote/runtime_alert.py	New helper that resolves the named remote, fetches status/logs/health, and packages them into a raw_alert dict; graceful fallbacks for all three failure modes are present and tested.
app/remote/slack_context.py	New Slack thread helper; correctly validates CHANNEL/TS format, caps limit at 100, handles HTTP errors and API ok=false gracefully, and does not log the bot token.
tests/cli/test_investigate_service_flag.py	10 tests covering happy path, output file, mutual exclusion (parametrised), error propagation, and Slack token validation; env isolation added via monkeypatch.delenv where needed.
tests/remote/test_ops_fetch_logs.py	6 tests for RailwayRemoteOpsProvider.fetch_logs() covering stdout-only, stderr-only, mixed, whitespace stripping, non-zero exit, and --tail argument passthrough.
tests/remote/test_runtime_alert.py	11 tests for build_runtime_alert_payload() covering happy path, missing name, unknown service, status failure, log unavailability, health timeout, no URL, and Slack thread inclusion/error capture.
tests/remote/test_slack_context.py	13 tests for parse_slack_thread_ref() and fetch_slack_thread() covering parsing, malformed refs, missing token, success, ok=false, HTTP error, unexpected exceptions, and limit capping.
.env.example	Documents the new SLACK_BOT_TOKEN env var with correct scope requirements; placed appropriately alongside the existing SLACK_WEBHOOK_URL.
docs/remote-runtime-investigation.mdx	New user-facing doc covering prerequisites, usage, Slack thread context, mutual exclusion, provider extension, and known limitations; accurate and complete.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as investigate_command
    participant SvcInv as _run_service_investigation
    participant Builder as build_runtime_alert_payload
    participant Provider as RailwayRemoteOpsProvider
    participant Health as poll_deployment_health
    participant Slack as fetch_slack_thread
    participant RCA as run_investigation_cli

    User->>CLI: opensre investigate --service api-backend --slack-thread C01/1234
    CLI->>SvcInv: service=api-backend, slack_thread=C01/1234
    SvcInv->>SvcInv: validate no conflicting flags
    SvcInv->>SvcInv: read SLACK_BOT_TOKEN from env
    SvcInv->>Builder: build_runtime_alert_payload(service, slack_thread_ref, token)
    Builder->>Builder: load_named_remotes() — validate name
    Builder->>Builder: load_remote_ops_config() — resolve provider/project/service
    Builder->>Provider: status(scope)
    Provider-->>Builder: ServiceStatus(url, health, deployment_status, ...)
    Builder->>Provider: fetch_logs(scope, lines=100)
    Provider-->>Builder: log lines or (logs unavailable: ...)
    Builder->>Health: poll_deployment_health(url, max_attempts=2)
    Health-->>Builder: HealthPollStatus(status_code=200) or TimeoutError captured
    Builder->>Slack: fetch_slack_thread(channel, ts, token)
    Slack-->>Builder: messages dict or error dict
    Builder-->>SvcInv: raw_alert dict
    SvcInv->>RCA: run_investigation_cli(raw_alert=..., alert_name=..., ...)
    RCA-->>SvcInv: result dict
    SvcInv->>User: write_json(result, output) + SystemExit(0)

_{Reviews (4): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile}

…ch_logs

hamzzaaamalik · 2026-04-17T12:42:43Z

Noticed test_validate_provider_credentials_returns_success_for_valid_anthropic_key in tests/cli/wizard/test_validation.py is failing on multiple open PRs (including mine #630-follow-up on #301).

Root cause: _load_anthropic_client() in app/cli/wizard/validation.py re-imports the real Anthropic class whenever AnthropicAuthError is None, which silently overrides the test's monkeypatch.setattr(...). When the test runs first in a pytest-xdist worker (i.e. before any other test triggered _load_anthropic_client), the real Anthropic SDK picks up CI's ANTHROPIC_API_KEY secret and gets rejected. It's flaky based on xdist worker distribution adding new tests in any PR can shift the distribution and trigger it.

Repro on clean main:
git checkout main
ANTHROPIC_API_KEY="anything" pytest tests/cli/wizard/test_validation.py::test_validate_provider_credentials_returns_success_for_valid_anthropic_key
FAILED ... ValidationResult(ok=False, detail='Anthropic rejected the API key.')
Fix: ~2 lines per test — also monkeypatch AnthropicAuthError (and OpenAIAuthError for symmetry) so _load_anthropic_client doesn't re-import.

Opening a small PR now to fix it, should unblock everyone's CI.

hamzzaaamalik · 2026-04-17T12:53:03Z

Please merge it first: #632

…ntime-investigation

feat: add remote runtime investigation with Slack thread context (Tra…

0a0212b

…cer-Cloud#301)

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread app/cli/commands/general.py Outdated

Comment thread app/remote/ops.py Outdated

fix: address Greptile review — remove dead guard, merge stderr in fet…

4142a22

…ch_logs

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread tests/cli/test_investigate_service_flag.py

fix(tests): isolate SLACK_BOT_TOKEN env var in service-flag CLI tests

77f000a

Merge remote-tracking branch 'upstream/main' into issue/301-remote-ru…

9f61ccf

…ntime-investigation

davincios merged commit afff9b4 into Tracer-Cloud:main Apr 17, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add remote runtime investigation with Slack thread context (#301)#630

feat: add remote runtime investigation with Slack thread context (#301)#630
davincios merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:issue/301-remote-runtime-investigation

hamzzaaamalik commented Apr 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamzzaaamalik commented Apr 17, 2026

Uh oh!

hamzzaaamalik commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamzzaaamalik commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

New capability: opensre investigate --service <name>

New capability: --slack-thread CHANNEL/TS

Extensible provider interface

Graceful fallbacks

Acceptance criteria coverage (from #301)

Design notes

Files changed

Test plan

Follow-ups

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamzzaaamalik commented Apr 17, 2026

Uh oh!

hamzzaaamalik commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamzzaaamalik commented Apr 17, 2026 •

edited

Loading

New capability: `opensre investigate --service <name>`

New capability: `--slack-thread CHANNEL/TS`

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading