Add NATS critical upstream failures detection rules #86
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a comprehensive CRE rule for detecting high-severity failures in NATS clusters. The rule is based on the investigation and reproduction of three real-world issues affecting data availability and HA guarantees
Risk & Impact
These are high-severity failure modes that can:
NATS#6890,NATS#6921)NATS#6929)Detection of these errors is not possible via standard health checks alone, making this CRE crucial for production observability.
Test Environment
Reproducible test setup (Maintainers invited) : NATS REPRO
Live CRE link: CRE Playground Link
REPRODUCED SCENARIOS:
health.logmatching:"JetStream stream 'app > some-stream' is not current: monitor goroutine not running"DeliverPolicy=LastPerSubject,AckPolicy=Explicit, andMaxAckPending=1nats: latestwith a Go pull-based consumer"Fetch error: context deadline exceeded"nats stream-check --unsyncedand raftz/jsz output"OBJ_.+│.*UNSYNCED.*"DEMO
https://asciinema.org/a/gRDKu5PNdw1kdOFXRrhvvS4CN
https://asciinema.org/a/5fVpviwvENVCTvq1vRkUO0zqT
closes #77
/claim #77