[TEST] Chaos testing - Scenario 02 Kafka Broker Failure#42
Merged
oskarszoon merged 3 commits intoOct 23, 2025
Merged
Conversation
…in#40) Implement comprehensive chaos testing for Kafka broker failures to validate system resilience and recovery capabilities. ## Changes ### New Test Implementation - **test/chaos/scenario_02_kafka_broker_failure_test.go**: Complete chaos test for Kafka broker failure scenarios with 9 test phases: 1. Baseline performance validation 2. Latency injection (3s via toxiproxy) 3. Sync producer behavior under latency 4. Async producer behavior under latency 5. Complete broker failure simulation (100% connection drops) 6. Producer failure handling validation 7. Consumer failure handling validation 8. System recovery verification 9. Message consistency validation ### Test Automation - **test/chaos/run_scenario_02.sh**: Automated test runner with: - Pre-flight checks for Kafka and toxiproxy services - Auto-start docker compose if needed - Service connectivity verification - Automatic cleanup after test completion - Colored output for better readability ### Infrastructure Fixes - **compose/docker-compose-ss.yml**: - Pin PostgreSQL to version 16 (prevent breaking upgrades) - Expose Kafka ports 9092 and 9093 to host - Update Kafka advertise listener to localhost for external access - Fixes PostgreSQL restart loop and Kafka connectivity issues ### Documentation - **test/chaos/README.md**: Updated with: - Complete Scenario 2 documentation - Test phases and expected results - Usage instructions with helper scripts - Updated test duration estimates - Marked Scenario 2 as implemented ## Test Results All tests passing (132.26 seconds): - ✅ Baseline Performance - ✅ Latency Injection - ✅ Producer With Latency (sync and async) - ✅ Broker Failure Injection - ✅ Producer/Consumer Under Failure - ✅ Recovery Verification - ✅ Message Consistency ## Testing ```bash # Run Scenario 2 only ./test/chaos/run_scenario_02.sh # Run all chaos tests go test -v ./test/chaos/... ``` ## Prerequisites - Docker compose with toxiproxy-kafka running - Kafka accessible on localhost:9092 (direct) and localhost:19092 (via toxiproxy) - Toxiproxy API on localhost:8475 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Update GitHub Actions workflow to include Kafka broker failure chaos testing: - Start kafka-shared and toxiproxy-kafka services alongside PostgreSQL services - Add Scenario 02 test execution step with 5-minute timeout - Update artifact upload to capture both scenario results - Add additional wait time for Kafka to be fully ready Both chaos test scenarios now run automatically on pull requests: - Scenario 01: Database Latency - Scenario 02: Kafka Broker Failure Test results are uploaded as artifacts for review.
Apply gci formatting to align constant declarations properly. No functional changes, only code formatting improvements.
Contributor
|
🤖 Claude Code Review Status: Complete Summary: Issues Found:
Strengths:
Notes:
|
| } | ||
| defer producer.Close() | ||
|
|
||
| // Try to send message - should fail |
Contributor
There was a problem hiding this comment.
The testConsumerGroup constant is declared but never used in the test. Consider removing it or implementing consumer group tests if they were intended to be part of this chaos test scenario.
|
oskarszoon
approved these changes
Oct 23, 2025
torrejonv
pushed a commit
to torrejonv/teranode
that referenced
this pull request
Oct 26, 2025
…in#42) Co-authored-by: Claude <noreply@anthropic.com>
oskarszoon
added a commit
to oskarszoon/teranode
that referenced
this pull request
Jun 2, 2026
Address the open CodeQL code-scanning alerts: - httpimpl: bound-check int->uint32 conversions before casting in GetBlocks (offset) and GetNearestForkHeights (range) to prevent truncation/wrap-around (go/incorrect-integer-conversion, bsv-blockchain#90 bsv-blockchain#91 bsv-blockchain#117) - daemon: change formatBytes to take uint64, removing the unchecked int64(limit) narrowing of the cgroup memory limit (bsv-blockchain#113) - dashboard p2pStore: reject __proto__/constructor/prototype peer_id keys before any plain-object write to prevent prototype pollution from WebSocket data (bsv-blockchain#79 bsv-blockchain#81 bsv-blockchain#83) - dashboard urlUtils: remove the dead sanitizeUrl function entirely (no callers, no tests); its broken script-tag regex was the source of the alerts (js/bad-tag-filter, js/incomplete-multi-character-sanitization, bsv-blockchain#2 bsv-blockchain#4) - centrifuge client: use textContent instead of innerHTML in drawText to fix DOM XSS from server-controlled push data (bsv-blockchain#1) - grpc_helper: document that SecurityLevel 1 leaves the server cert unverified (MITM); behaviour unchanged, bsv-blockchain#42 is an intentional config-gated mode dismissed on GitHub
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Overview
Implement comprehensive chaos testing for Kafka broker failures to validate system resilience and recovery capabilities.
Changes Summary
🧪 New Test Implementation
🔧 Test Automation
🚀 CI/CD Integration
🔨 Infrastructure Fixes
localhost:9092for external access📚 Documentation
Test Phases (9 Total)
Test Results
✅ All tests passing (132.26 seconds)
--- PASS: TestScenario02_KafkaBrokerFailure (132.07s) --- PASS: TestScenario02_KafkaBrokerFailure/Baseline_Performance (0.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Inject_Latency (0.00s) --- PASS: TestScenario02_KafkaBrokerFailure/Producer_With_Latency (3.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Async_Producer_With_Latency (6.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Inject_Broker_Failure (0.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Producer_Under_Failure (91.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Consumer_Under_Failure (30.00s) --- PASS: TestScenario02_KafkaBrokerFailure/Recovery (2.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Message_Consistency (0.00s) PASS
What This Tests
✅ Latency Handling - Producers continue working under high latency
✅ Failure Detection - System correctly detects broker failures
✅ Error Handling - Producers and consumers fail gracefully with appropriate errors
✅ Recovery - Full system recovery after broker restoration
✅ Data Integrity - No message loss during chaos events
CI Workflow Integration
The GitHub Actions workflow now runs both chaos test scenarios on every PR:
Starts Services
Runs Tests
Uploads Results
Local Testing