feat: add remote runtime investigation with Slack thread context (#301)#630
Conversation
Greptile SummaryThis PR adds Confidence Score: 5/5Safe to merge — all previously flagged issues resolved, no new P0/P1 findings. All three concerns from the prior review round (stderr discarding in fetch_logs, env isolation gap in tests, dead No files require special attention. Important Files Changed
|
|
Noticed test_validate_provider_credentials_returns_success_for_valid_anthropic_key in tests/cli/wizard/test_validation.py is failing on multiple open PRs (including mine #630-follow-up on #301). Root cause: _load_anthropic_client() in app/cli/wizard/validation.py re-imports the real Anthropic class whenever AnthropicAuthError is None, which silently overrides the test's monkeypatch.setattr(...). When the test runs first in a pytest-xdist worker (i.e. before any other test triggered _load_anthropic_client), the real Anthropic SDK picks up CI's ANTHROPIC_API_KEY secret and gets rejected. It's flaky based on xdist worker distribution adding new tests in any PR can shift the distribution and trigger it. Repro on clean main: Opening a small PR now to fix it, should unblock everyone's CI. |
|
Please merge it first: #632 |
…ntime-investigation
Summary
Adds
opensre investigate --service <name>— a runtime investigation workflow for deployed services. Instead of passing an alert file, OpenSRE pulls live signals (deployment status, recent logs, health probe) from the configured remote ops provider and feeds them into the existing RCA pipeline as evidence. Optionally incorporates Slack thread context via--slack-thread.Fixes #301.
What changed
New capability:
opensre investigate --service <name>Gathers per-service evidence and runs the standard RCA pipeline against it:
<name>against the existing named-remote registry (load_named_remotes())RemoteOpsProvider.status()fetch_logs()method/health//okvia existingpoll_deployment_health()raw_alertdict the investigation graph already knows how to consumeNew capability:
--slack-thread CHANNEL/TSOptional flag that pulls a specific Slack thread via
conversations.repliesand includes the messages as investigation context. RequiresSLACK_BOT_TOKENin the environment. Graceful: Slack fetch failures are captured in the payload instead of failing the investigation.Extensible provider interface
Added
fetch_logs(scope, *, lines) -> stras an abstract method onRemoteOpsProvider. Unlike the existinglogs()(which streams to stdout for interactive use),fetch_logs()captures output so it can feed the agent. Implemented for Railway; other providers follow the same pattern.Graceful fallbacks
(logs unavailable: <reason>)in the payloadhealth_probeslack_thread.errorOpenSREErrorwith remediation hint--input/--input-json/--interactive/--print-template→ rejected with clear messageAcceptance criteria coverage (from #301)
opensre investigate --service <name>fetch_logs,poll_deployment_health,provider.status--slack-thread CHANNEL/TS+SLACK_BOT_TOKENRemoteOpsProviderABC extensibleDesign notes
alert_sourceis deliberately left unset in the payload.detect_sources.pyfilters core integrations (Grafana/Datadog/Honeycomb/Coralogix) byalert_source, so setting it to"remote_runtime"would silently suppress all four. The LLM inextract_alertmay infer analert_sourcefrom the log contents, which is the correct behavior.service,recent_logs,health_probe,slack_thread) rides onraw_alert. Avoids theAgentStateTypedDict / Pydantic drift-test churn and keeps the change local.investigation_origin = "remote_runtime": purely informational marker for future tooling; no routing impact.Files changed
New (7):
app/remote/runtime_alert.py—build_runtime_alert_payload()helperapp/remote/slack_context.py—parse_slack_thread_ref()+fetch_slack_thread()docs/remote-runtime-investigation.mdx— user-facing documentationtests/remote/test_ops_fetch_logs.py— 6 teststests/remote/test_runtime_alert.py— 11 teststests/remote/test_slack_context.py— 13 teststests/cli/test_investigate_service_flag.py— 10 testsModified (4):
app/remote/ops.py— newfetch_logs()abstract + Railway implapp/cli/commands/general.py—--serviceand--slack-threadflags + handlerdocs/docs.json— nav entry under "Investigations".env.example—SLACK_BOT_TOKENdocumentedTest plan
ruff check app/ tests/— cleanpython -m mypy app/— 0 issues in 356 filespython -m pytest tests/remote/test_ops_fetch_logs.py tests/remote/test_runtime_alert.py tests/remote/test_slack_context.py tests/cli/test_investigate_service_flag.py -v— 40 / 40 passingTypedDictissues on Python 3.11)opensre investigate --service <deployed-railway-service>against a real Railway deployment (local env required; reviewers with access please confirm)--slack-thread CHANNEL/TSwith a realSLACK_BOT_TOKEN+channels:historyscopeFollow-ups
fetch_logs()implementations for other providers (EC2, ECS, Vercel) — currently Railway-only.