Skip to content

Conversation

@varshith257
Copy link
Contributor

@varshith257 varshith257 commented Jun 15, 2025

This PR adds a comprehensive CRE rule for detecting high-severity failures in NATS clusters. The rule is based on the investigation and reproduction of three real-world issues affecting data availability and HA guarantees

Risk & Impact

These are high-severity failure modes that can:

  • Cause data unavailability (NATS#6890, NATS#6921)
  • Result in silent data loss across nodes (NATS#6929)

Detection of these errors is not possible via standard health checks alone, making this CRE crucial for production observability.

Test Environment

Reproducible test setup (Maintainers invited) : NATS REPRO
Live CRE link: CRE Playground Link

REPRODUCED SCENARIOS:

  1. Stream leader failure after node restart due to missing monitor goroutine
  • Reproduced via custom Go server cluster with JetStream enabled and stream auto-creation
  • Detected via health.log matching:
    "JetStream stream 'app > some-stream' is not current: monitor goroutine not running"
  1. Consumer deadlock when using DeliverPolicy=LastPerSubject, AckPolicy=Explicit, and MaxAckPending=1
  • Reproduced using nats: latest with a Go pull-based consumer
  • Detected via fetch failure logs:
    "Fetch error: context deadline exceeded"
  1. Object Store replication drift with no visible errors, but silently inconsistent data
  • Verified using nats stream-check --unsynced and raftz/jsz output
  • Detected via stream-check logs matching:
    "OBJ_.+│.*UNSYNCED.*"

DEMO

https://asciinema.org/a/gRDKu5PNdw1kdOFXRrhvvS4CN

https://asciinema.org/a/5fVpviwvENVCTvq1vRkUO0zqT

closes #77
/claim #77

@varshith257
Copy link
Contributor Author

varshith257 commented Jun 16, 2025

Note

This PR reproduces high-severity upstream failures observed in the latest NATS release. These issues are actively being discussed in the official NATS GitHub repository.
Please refer to the CRE References section for detailed.

@varshith257 varshith257 requested a review from tonymeehan June 21, 2025 17:46
@tonymeehan tonymeehan merged commit e0cf716 into prequel-dev:main Jun 25, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Multiple Winners] NATS: Reproduce A High-Severity Failure & Write a Detection Rule [Submit by June 15 11:59 pm ET]

2 participants