Skip to content

Conversation

@asr2003
Copy link
Contributor

@asr2003 asr2003 commented Jun 9, 2025

/claim #69
Closes #69

This PR adds a new Prequel rule CRE-2025-0080 to detect critical Redpanda startup/runtime issues that can lead to degraded performance, unavailability, or data loss. The rule targets system log patterns related to disk errors, RPC failures, Raft instability and snapshot corruption.

Based on Redpanda Official HA Recommendations

The following failure scopes are directly drawn from Redpanda’s official documentation on high availability:

https://docs.redpanda.com/current/deploy/deployment-option/self-hosted/manual/high-availability/#multi-broker-deployment

  • Broker failure - Loss of function for an individual broker or its VM
  • Rack / switch failure - Brokers in the same rack become unreachable
  • Data center failure - Brokers or hosts across a full AZ become unavailable

Additional Failure Modes Covered by This Rule

In addition to the HA docs, this rule also detects:

  • Disk full errors - Blocking new produce requests or snapshots
  • Missing or unreadable snapshots - Cold-start issues from empty kvstore or bad mounts
  • Segment CRC mismatches - Possible early corruption detection via parser.cc logs
  • Raft and transport instability - Timeout patterns in node status and append entry logs

Why It Matters

These patterns are frequently encountered in real-world outages and cluster stalls. This CRE helps teams proactively detect:

  • Misconfigured volume mounts
  • Underprovisioned disk scenarios
  • Quorum-threatening network issues
  • CRC-level anomalies during recovery or startup

Playground Link - Check here|

Sample logs: test.log

2025-06-10.03-59-15-VEED.mp4

@Harsh9485
Copy link
Contributor

Hey @asr2003, your PR is very good! However, it should include a working reproduction project and a demo video. If you add these, your PR will be strong enough to be merged, I hope. 😁

@asr2003
Copy link
Contributor Author

asr2003 commented Jun 9, 2025

Thanks for the reminder! I have added a link to the reproducer repository in the PR description. Due to device memory constraints, I am unable to submit a video recording, attempting to capture all reproducers would cause system instability. However, the repo includes detailed logs and reproduction instructions

@Harsh9485
Copy link
Contributor

If you’re unable to upload a demo video, it might be problematic for the reviewers. I’m happy to help—if you send me an invite to your repository, I can run it on my machine and check whether the reproduction is valid.

@Lyndon-prequel
Copy link
Contributor

@asr2003 we do require a video for all submissions. Please submit.
@Harsh9485 very kind of you to provide feedback and support.

@asr2003
Copy link
Contributor Author

asr2003 commented Jun 9, 2025

@Lyndon-prequel Okay. I will take help of Prequel Community member

@asr2003
Copy link
Contributor Author

asr2003 commented Jun 11, 2025 via email

@asr2003 asr2003 requested a review from tonymeehan June 11, 2025 17:30
@asr2003
Copy link
Contributor Author

asr2003 commented Jun 11, 2025

Done!

@tonymeehan tonymeehan merged commit 3c4cfde into prequel-dev:main Jun 13, 2025
2 checks passed
@asr2003 asr2003 deleted the cre-redpanda branch June 13, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Rule] Redpanda: Reproduce A High-Severity Failure & Write a Detection Rule

4 participants