Skip to content

[TEST] Chaos testing - Scenario 02 Kafka Broker Failure#42

Merged
oskarszoon merged 3 commits into
bsv-blockchain:mainfrom
rid3thespiral:feature/chaos-test-kafka-broker-failure
Oct 23, 2025
Merged

[TEST] Chaos testing - Scenario 02 Kafka Broker Failure#42
oskarszoon merged 3 commits into
bsv-blockchain:mainfrom
rid3thespiral:feature/chaos-test-kafka-broker-failure

Conversation

@rid3thespiral

Copy link
Copy Markdown
Contributor

Overview

Implement comprehensive chaos testing for Kafka broker failures to validate system resilience and recovery capabilities.

Changes Summary

🧪 New Test Implementation

  • test/chaos/scenario_02_kafka_broker_failure_test.go (470 lines)
    • Complete chaos test with 9 comprehensive test phases
    • Tests sync and async Kafka producers under various failure conditions
    • Validates consumer behavior during broker failures
    • Verifies message consistency and no data loss

🔧 Test Automation

  • test/chaos/run_scenario_02.sh (executable script)
    • Pre-flight checks for Kafka and toxiproxy services
    • Auto-start docker compose if services not running
    • Service connectivity verification
    • Automatic cleanup after test completion
    • Colored output for better readability

🚀 CI/CD Integration

  • .github/workflows/teranode_pr_chaostests.yaml
    • Added Scenario 02 to automated PR testing
    • Starts Kafka services alongside PostgreSQL
    • Runs both chaos test scenarios on every PR
    • Uploads test results as artifacts

🔨 Infrastructure Fixes

  • compose/docker-compose-ss.yml
    • Pin PostgreSQL to version 16 (prevents breaking upgrades)
    • Expose Kafka ports 9092 and 9093 to host
    • Update Kafka advertise listener to localhost:9092 for external access
    • Fixes PostgreSQL restart loop and Kafka connectivity issues

📚 Documentation

  • test/chaos/README.md
    • Complete Scenario 2 documentation
    • Test phases and expected results
    • Usage instructions with helper scripts
    • Updated test duration estimates
    • Marked Scenario 2 as implemented

Test Phases (9 Total)

  1. Baseline Performance - Validates normal Kafka operations
  2. Latency Injection - Injects 3000ms latency via toxiproxy
  3. Sync Producer With Latency - Tests producer behavior under latency
  4. Async Producer With Latency - Tests async producer with latency
  5. Broker Failure Injection - Simulates 100% connection drops
  6. Producer Under Failure - Validates producer error handling
  7. Consumer Under Failure - Validates consumer error handling
  8. Recovery Verification - Tests full system recovery
  9. Message Consistency - Confirms no message loss

Test Results

All tests passing (132.26 seconds)

--- PASS: TestScenario02_KafkaBrokerFailure (132.07s) --- PASS: TestScenario02_KafkaBrokerFailure/Baseline_Performance (0.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Inject_Latency (0.00s) --- PASS: TestScenario02_KafkaBrokerFailure/Producer_With_Latency (3.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Async_Producer_With_Latency (6.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Inject_Broker_Failure (0.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Producer_Under_Failure (91.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Consumer_Under_Failure (30.00s) --- PASS: TestScenario02_KafkaBrokerFailure/Recovery (2.01s) --- PASS: TestScenario02_KafkaBrokerFailure/Message_Consistency (0.00s) PASS

What This Tests

Latency Handling - Producers continue working under high latency
Failure Detection - System correctly detects broker failures
Error Handling - Producers and consumers fail gracefully with appropriate errors
Recovery - Full system recovery after broker restoration
Data Integrity - No message loss during chaos events

CI Workflow Integration

The GitHub Actions workflow now runs both chaos test scenarios on every PR:

  1. Starts Services

    • PostgreSQL + toxiproxy-postgres
    • Kafka + toxiproxy-kafka
  2. Runs Tests

    • Scenario 01: Database Latency (~45s)
    • Scenario 02: Kafka Broker Failure (~132s) ⭐ NEW
  3. Uploads Results

    • Test artifacts available for download
    • Both scenario results captured

Local Testing

# Run Scenario 2 only
./test/chaos/run_scenario_02.sh

# Run all chaos tests
go test -v ./test/chaos/...

# Run both scenarios
./test/chaos/run_scenario_01.sh
./test/chaos/run_scenario_02.sh

**Prerequisites**
For local testing, you need:
Docker compose running with toxiproxy services
Kafka accessible on localhost:9092 (direct) and localhost:19092 (via toxiproxy)
Toxiproxy API on localhost:8475
The helper script will auto-start services if they're not running.
Files Changed
✅ .github/workflows/teranode_pr_chaostests.yaml - CI integration
✅ compose/docker-compose-ss.yml - Infrastructure fixes
✅ test/chaos/README.md - Updated documentation
✅ test/chaos/run_scenario_02.sh - Test runner script
✅ test/chaos/scenario_02_kafka_broker_failure_test.go - Test implementation

**Related**
Complements PR #39: Chaos testing - Scenario 01 Database Latency
Part of chaos testing suite expansion
Addresses infrastructure issues found during testing

rid3thespiral and others added 3 commits October 23, 2025 08:25
…in#40)

Implement comprehensive chaos testing for Kafka broker failures to validate
system resilience and recovery capabilities.

## Changes

### New Test Implementation
- **test/chaos/scenario_02_kafka_broker_failure_test.go**: Complete chaos test
  for Kafka broker failure scenarios with 9 test phases:
  1. Baseline performance validation
  2. Latency injection (3s via toxiproxy)
  3. Sync producer behavior under latency
  4. Async producer behavior under latency
  5. Complete broker failure simulation (100% connection drops)
  6. Producer failure handling validation
  7. Consumer failure handling validation
  8. System recovery verification
  9. Message consistency validation

### Test Automation
- **test/chaos/run_scenario_02.sh**: Automated test runner with:
  - Pre-flight checks for Kafka and toxiproxy services
  - Auto-start docker compose if needed
  - Service connectivity verification
  - Automatic cleanup after test completion
  - Colored output for better readability

### Infrastructure Fixes
- **compose/docker-compose-ss.yml**:
  - Pin PostgreSQL to version 16 (prevent breaking upgrades)
  - Expose Kafka ports 9092 and 9093 to host
  - Update Kafka advertise listener to localhost for external access
  - Fixes PostgreSQL restart loop and Kafka connectivity issues

### Documentation
- **test/chaos/README.md**: Updated with:
  - Complete Scenario 2 documentation
  - Test phases and expected results
  - Usage instructions with helper scripts
  - Updated test duration estimates
  - Marked Scenario 2 as implemented

## Test Results

All tests passing (132.26 seconds):
- ✅ Baseline Performance
- ✅ Latency Injection
- ✅ Producer With Latency (sync and async)
- ✅ Broker Failure Injection
- ✅ Producer/Consumer Under Failure
- ✅ Recovery Verification
- ✅ Message Consistency

## Testing

```bash
# Run Scenario 2 only
./test/chaos/run_scenario_02.sh

# Run all chaos tests
go test -v ./test/chaos/...
```

## Prerequisites

- Docker compose with toxiproxy-kafka running
- Kafka accessible on localhost:9092 (direct) and localhost:19092 (via toxiproxy)
- Toxiproxy API on localhost:8475

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update GitHub Actions workflow to include Kafka broker failure chaos testing:

- Start kafka-shared and toxiproxy-kafka services alongside PostgreSQL services
- Add Scenario 02 test execution step with 5-minute timeout
- Update artifact upload to capture both scenario results
- Add additional wait time for Kafka to be fully ready

Both chaos test scenarios now run automatically on pull requests:
- Scenario 01: Database Latency
- Scenario 02: Kafka Broker Failure

Test results are uploaded as artifacts for review.
Apply gci formatting to align constant declarations properly.
No functional changes, only code formatting improvements.
@github-actions

github-actions Bot commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary:
This PR implements comprehensive Kafka broker failure chaos testing. The implementation follows established patterns from Scenario 01 and includes proper test phases, error handling, and CI integration.

Issues Found:

  1. Minor: Unused constant testConsumerGroup at line 43 - declared but never used in the test

Strengths:

  • Well-structured test with clear phases (baseline, latency injection, failure simulation, recovery)
  • Proper cleanup with deferred toxiproxy reset
  • Good coverage of sync and async producer behaviors
  • CI integration follows existing workflow patterns
  • Infrastructure fixes (PostgreSQL pinning, Kafka port exposure) address real issues
  • Consistent with existing chaos test patterns from Scenario 01

Notes:

  • The advertise listener change in docker-compose (kafka-sharedlocalhost:9092) enables external host access but may affect other scenarios. Ensure this works for all use cases.
  • Test duration (~132s) is reasonable for chaos testing and properly documented in README.

}
defer producer.Close()

// Try to send message - should fail

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testConsumerGroup constant is declared but never used in the test. Consider removing it or implementing consumer group tests if they were intended to be part of this chaos test scenario.

@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon self-requested a review October 23, 2025 11:58
@oskarszoon oskarszoon merged commit 7317ca1 into bsv-blockchain:main Oct 23, 2025
8 checks passed
torrejonv pushed a commit to torrejonv/teranode that referenced this pull request Oct 26, 2025
oskarszoon added a commit to oskarszoon/teranode that referenced this pull request Jun 2, 2026
Address the open CodeQL code-scanning alerts:

- httpimpl: bound-check int->uint32 conversions before casting in
  GetBlocks (offset) and GetNearestForkHeights (range) to prevent
  truncation/wrap-around (go/incorrect-integer-conversion, bsv-blockchain#90 bsv-blockchain#91 bsv-blockchain#117)
- daemon: change formatBytes to take uint64, removing the unchecked
  int64(limit) narrowing of the cgroup memory limit (bsv-blockchain#113)
- dashboard p2pStore: reject __proto__/constructor/prototype peer_id
  keys before any plain-object write to prevent prototype pollution
  from WebSocket data (bsv-blockchain#79 bsv-blockchain#81 bsv-blockchain#83)
- dashboard urlUtils: remove the dead sanitizeUrl function entirely
  (no callers, no tests); its broken script-tag regex was the source
  of the alerts (js/bad-tag-filter, js/incomplete-multi-character-sanitization, bsv-blockchain#2 bsv-blockchain#4)
- centrifuge client: use textContent instead of innerHTML in drawText
  to fix DOM XSS from server-controlled push data (bsv-blockchain#1)
- grpc_helper: document that SecurityLevel 1 leaves the server cert
  unverified (MITM); behaviour unchanged, bsv-blockchain#42 is an intentional
  config-gated mode dismissed on GitHub
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants