Add NATS critical upstream failures detection rules #86

varshith257 · 2025-06-15T23:49:39Z

This PR adds a comprehensive CRE rule for detecting high-severity failures in NATS clusters. The rule is based on the investigation and reproduction of three real-world issues affecting data availability and HA guarantees

Risk & Impact

These are high-severity failure modes that can:

Cause data unavailability (NATS#6890, NATS#6921)
Result in silent data loss across nodes (NATS#6929)

Detection of these errors is not possible via standard health checks alone, making this CRE crucial for production observability.

Test Environment

Reproducible test setup (Maintainers invited) : NATS REPRO
Live CRE link: CRE Playground Link

REPRODUCED SCENARIOS:

Stream leader failure after node restart due to missing monitor goroutine

Reproduced via custom Go server cluster with JetStream enabled and stream auto-creation
Detected via health.log matching:
"JetStream stream 'app > some-stream' is not current: monitor goroutine not running"

Consumer deadlock when using DeliverPolicy=LastPerSubject, AckPolicy=Explicit, and MaxAckPending=1

Reproduced using nats: latest with a Go pull-based consumer
Detected via fetch failure logs:
"Fetch error: context deadline exceeded"

Object Store replication drift with no visible errors, but silently inconsistent data

Verified using nats stream-check --unsynced and raftz/jsz output
Detected via stream-check logs matching:
"OBJ_.+│.*UNSYNCED.*"

DEMO

https://asciinema.org/a/gRDKu5PNdw1kdOFXRrhvvS4CN

https://asciinema.org/a/5fVpviwvENVCTvq1vRkUO0zqT

closes #77
/claim #77

varshith257 · 2025-06-16T00:55:46Z

Note

This PR reproduces high-severity upstream failures observed in the latest NATS release. These issues are actively being discussed in the official NATS GitHub repository.
Please refer to the CRE References section for detailed.

rules/cre-2025-0082/nats-jetstream-ha.yaml

rules/cre-2025-0082/test.log

varshith257 added 2 commits June 16, 2025 05:16

feat: add NATS Jetstream CRE

169874c

fix format

f85a27b

algora-pbc bot added the 🙋 Bounty claim label Jun 15, 2025

algora-pbc bot mentioned this pull request Jun 15, 2025

[Multiple Winners] NATS: Reproduce A High-Severity Failure & Write a Detection Rule [Submit by June 15 11:59 pm ET] #77

Closed

Merge branch 'main' into cre/NATS-Jetstream

a1dc181

tonymeehan reviewed Jun 17, 2025

View reviewed changes

rules/cre-2025-0082/nats-jetstream-ha.yaml Outdated Show resolved Hide resolved

varshith257 commented Jun 17, 2025

View reviewed changes

rules/cre-2025-0082/test.log Show resolved Hide resolved

varshith257 added 4 commits June 21, 2025 22:55

Merge branch 'main' into cre/NATS-Jetstream

da6d4d3

add negate conditions

b51e1a8

Update nats-jetstream-ha.yaml

727c9ad

Update tags.yaml

c521c5f

varshith257 requested a review from tonymeehan June 21, 2025 17:46

tonymeehan merged commit e0cf716 into prequel-dev:main Jun 25, 2025
2 checks passed

This was referenced Sep 1, 2025

Supabase (self-hosted): Reproduce High-Severity Failures from the Troubleshooting Guide & Write a CRE Rule [Submit by September 3 11:59 pm ET] #131

Closed

Redis Troubleshooting Rules #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NATS critical upstream failures detection rules #86

Add NATS critical upstream failures detection rules #86

Uh oh!

varshith257 commented Jun 15, 2025 •

edited

Loading

Uh oh!

varshith257 commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add NATS critical upstream failures detection rules #86

Add NATS critical upstream failures detection rules #86

Uh oh!

Conversation

varshith257 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Risk & Impact

Test Environment

REPRODUCED SCENARIOS:

DEMO

Uh oh!

varshith257 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varshith257 commented Jun 15, 2025 •

edited

Loading

varshith257 commented Jun 16, 2025 •

edited

Loading