A comprehensive platform for running controlled chaos experiments in Kubernetes clusters to improve system resilience and reliability.
"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production." - Principles of Chaos Engineering
Platform Dashboard: Central view of all chaos experiments and system health metrics
- Overview
- Key Features
- Comparison with Other Tools
- Screenshots
- Tech Stack
- Architecture
- Case Studies
- Getting Started
- Usage
- API Documentation
- Performance
- Development
- Future Enhancements
- Contributing
- FAQ
- Security
- License
- Acknowledgments
Chaos Engineering as a Platform enables organizations to systematically test their system resilience through controlled failure injection. By deliberately introducing failures in a controlled environment, teams can identify weaknesses before they manifest in production outages.
The platform provides a unified interface for defining, scheduling, executing, and monitoring chaos experiments across different infrastructure components. It helps teams build confidence in their systems' ability to withstand turbulent conditions and unexpected failures.
- DevOps Teams: Improve system reliability and reduce incidents
- SRE Teams: Validate service level objectives and error budgets
- Platform Engineers: Ensure platform resilience under various failure conditions
- Development Teams: Build more robust applications by understanding failure modes
- QA Teams: Test application behavior under adverse conditions
- Microservice Resilience Testing: Verify that your microservices can handle the failure of dependencies
- Kubernetes Upgrade Validation: Test application behavior during cluster upgrades
- Disaster Recovery Drills: Practice recovery procedures in a controlled environment
- Performance Degradation Testing: Understand system behavior under resource constraints
- CI/CD Pipeline Integration: Automatically validate resilience before deployment
-
Multiple Experiment Types:
- Pod Failures: Terminate pods to test service resilience and recovery
- Network Delays: Introduce latency to test timeout handling and degraded performance
- CPU Stress: Consume CPU resources to test throttling and resource limits
- Memory Stress: Consume memory resources to test OOM handling
- External Targets: Test third-party dependencies through their APIs
-
User-Friendly Dashboard: Intuitive web interface for creating, managing, and monitoring experiments
-
Safety Guardrails:
- Blast radius limitations to prevent cascading failures
- Automatic experiment termination based on system health metrics
- Protected namespaces and services configuration
- Gradual impact increase with automatic rollback
-
Detailed Metrics:
- Real-time experiment status and progress
- System health metrics during experiments
- Historical experiment results and trends
- Custom metric collection for specific services
-
API Integration:
- RESTful API for integration with CI/CD pipelines
- Webhook notifications for experiment events
- Integration with incident management systems
- Custom experiment extensions
-
Kubernetes Native:
- Designed to work seamlessly with Kubernetes clusters
- Uses Kubernetes RBAC for access control
- Leverages Kubernetes resources for experiment execution
- Compatible with multiple Kubernetes distributions
Dashboard Overview: Real-time view of active experiments and system health
Experiment Creation: Intuitive interface for defining chaos experiments
Experiment Results: Detailed metrics and logs from completed experiments
Metrics Dashboard: Comprehensive visualization of system behavior during experiments
API Integration: Examples of integrating chaos experiments with CI/CD pipelines
| Feature | Chaos Engineering as a Platform | Chaos Mesh | Litmus Chaos | Gremlin |
|---|---|---|---|---|
| Open Source | β | β | β | β |
| Kubernetes Native | β | β | β | β |
| External API Testing | β | β | β | β |
| Safety Guardrails | β | β | ||
| Metrics Integration | β | β | β | β |
| Scheduling | β | β | β | β |
| CI/CD Integration | β | β | β | |
| User Interface | β | β | β | β |
| Custom Experiments | β | β | β |
- Backend: Go 1.19+
- API: RESTful API with JSON
- Database: PostgreSQL for persistent storage
- Caching: Redis for caching and pub/sub
- Monitoring: Prometheus and Grafana
- Frontend: Bootstrap 5, Vanilla JavaScript, Chart.js
- Container Orchestration: Kubernetes 1.19+
- CI/CD: GitHub Actions
- Documentation: Markdown and OpenAPI
The platform follows a layered microservices architecture built on Kubernetes:
-
User Interface Layer:
- Web dashboard for experiment management
- CLI tools for automation and scripting
- API clients for programmatic access
-
API Gateway Layer:
- Authentication and authorization
- Rate limiting and request validation
- Request routing and load balancing
-
Core Services Layer:
- Chaos operator for experiment execution
- Experiment scheduler for timing control
- Safety systems for blast radius management
- Results processor for metrics collection
-
Infrastructure Layer:
- Kubernetes cluster for container orchestration
- PostgreSQL database for experiment storage
- Redis for caching and pub/sub messaging
- Object storage for experiment artifacts
-
Monitoring Layer:
- Prometheus for metrics collection
- Grafana for metrics visualization
- AlertManager for notifications
- Distributed tracing for request flow analysis
-
External Integrations:
- Cloud provider APIs
- Notification systems (Slack, Email, PagerDuty)
- CI/CD systems (Jenkins, GitHub Actions, GitLab CI)
Architecture Diagram: High-level overview of system components and interactions
A major e-commerce platform used Chaos Engineering as a Platform to simulate database failures during Black Friday preparation. They discovered and fixed several critical issues in their fallback mechanisms, resulting in zero downtime during their highest traffic period.
A financial services company implemented regular chaos experiments as part of their compliance requirements. The platform's detailed reporting helped them demonstrate resilience to auditors and reduce the time spent on compliance activities by 40%.
A growing SaaS startup integrated chaos experiments into their CI/CD pipeline, automatically testing new deployments against common failure scenarios. This practice helped them maintain a 99.99% uptime while deploying to production multiple times per day.
Chaos Engineering as a Platform can be installed in several ways:
- Docker Compose: Ideal for local development and testing
- Kubernetes: Recommended for production deployments
- Helm Chart: Easy deployment on Kubernetes using Helm
- Operator: Kubernetes operator for advanced deployment scenarios
Choose the installation method that best fits your environment and requirements.
- Docker and Docker Compose (for local development)
- Kubernetes cluster (for production deployment)
- Go 1.19 or later (for development)
- Clone the repository:
git clone https://github.com/yourusername/chaos-engineering-as-a-platform.git
cd chaos-engineering-as-a-platform- Set up environment variables:
cp .env.example .env
# Edit .env file to customize your environment variables- Start the local development environment:
docker-compose up -d- Access the services:
- Web Dashboard: http://localhost
- API Server: http://localhost:8080
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Create a Kubernetes secret from your environment variables:
# Create a secret from your .env file
kubectl create secret generic chaos-platform-env --from-env-file=.env- Create the required secrets:
kubectl apply -f deployments/kubernetes/secrets.yaml- Deploy the database:
kubectl apply -f deployments/kubernetes/postgres.yaml- Deploy the API server and chaos operator:
kubectl apply -f deployments/kubernetes/api-server.yaml
kubectl apply -f deployments/kubernetes/chaos-operator.yaml- Deploy the monitoring stack:
kubectl apply -f deployments/kubernetes/monitoring.yaml- Deploy the ingress:
kubectl apply -f deployments/kubernetes/ingress.yamlFor detailed usage instructions, refer to the User Guide.
-
Create a target:
- Navigate to the "Targets" section
- Click "New Target"
- Fill in the required information:
- Name: A descriptive name for the target
- Type: Select the target type (pod, deployment, service, etc.)
- Namespace: The Kubernetes namespace
- Selector: Label selector to identify the target resources
- Click "Create Target"
-
Create an experiment:
- Navigate to the "Experiments" section
- Click "New Experiment"
- Fill in the required information:
- Name: A descriptive name for the experiment
- Type: Select the experiment type (pod-failure, network-delay, etc.)
- Target: Select a previously created target
- Parameters: Configure experiment-specific parameters
- Duration: Set how long the experiment should run
- Schedule: Optionally set a recurring schedule
- Click "Create Experiment"
-
Run the experiment:
- Find your experiment in the list
- Click the "Play" button
- Monitor the results in real-time through the dashboard
- View detailed metrics in Grafana
The platform provides a comprehensive RESTful API for integration with other systems.
POST /api/v1/experiments: Create a new experimentGET /api/v1/experiments: List all experimentsGET /api/v1/experiments/{id}: Get an experiment by IDPOST /api/v1/experiments/{id}/execute: Execute an experimentPOST /api/v1/experiments/{id}/stop: Stop a running experimentPUT /api/v1/experiments/{id}: Update an experimentDELETE /api/v1/experiments/{id}: Delete an experimentGET /api/v1/experiments/{id}/results: Get experiment results
POST /api/v1/targets: Create a new targetGET /api/v1/targets: List all targetsGET /api/v1/targets/{id}: Get a target by IDPUT /api/v1/targets/{id}: Update a targetDELETE /api/v1/targets/{id}: Delete a targetGET /api/v1/targets/{id}/status: Get target status
POST /api/v1/schedules: Create a new scheduleGET /api/v1/schedules: List all schedulesGET /api/v1/schedules/{id}: Get a schedule by IDPUT /api/v1/schedules/{id}: Update a scheduleDELETE /api/v1/schedules/{id}: Delete a schedule
For complete API documentation, see the API Reference.
Chaos Engineering as a Platform is designed to be lightweight and performant:
- Low Resource Consumption: The entire platform can run with minimal resources (0.5 CPU, 512MB RAM per component)
- Scalable Architecture: Components can be scaled independently based on your needs
- Efficient Experiment Execution: Minimal overhead when running experiments
- Fast API Response Times: Average API response time under 100ms
- Optimized Database Queries: Efficient data storage and retrieval
Benchmarks (on a standard 3-node Kubernetes cluster):
| Metric | Value |
|---|---|
| Maximum concurrent experiments | 100+ |
| API requests per second | 1000+ |
| Dashboard response time | < 200ms |
| Memory usage (idle) | ~1GB total |
| CPU usage (idle) | < 0.5 cores total |
The application uses environment variables for configuration. These are stored in a .env file which is not committed to the repository for security reasons.
PORT: The port on which the API server runsENVIRONMENT: The environment (development, staging, production)DATABASE_URL: PostgreSQL connection stringKUBECONFIG: Path to Kubernetes configuration fileKUBE_TOKEN: Kubernetes authentication tokenNAMESPACE: Kubernetes namespaceMOCK_KUBERNETES: Whether to use a mock Kubernetes client (for development)PROMETHEUS_ENABLED: Whether to enable Prometheus metricsGF_SECURITY_ADMIN_PASSWORD: Grafana admin password
For a complete list of environment variables, see the .env.example file.
.
βββ cmd/ # Command-line applications
β βββ api-server/ # API server entry point
β βββ chaos-operator/ # Chaos operator entry point
β βββ cli/ # CLI tool entry point
βββ deployments/ # Deployment configurations
β βββ docker/ # Docker build files
β βββ kubernetes/ # Kubernetes manifests
β β βββ config/ # Kubernetes configuration files
β βββ grafana/ # Grafana dashboards
β βββ prometheus/ # Prometheus configuration
βββ docs/ # Documentation
β βββ api/ # API documentation
β βββ images/ # Screenshots and diagrams
βββ pkg/ # Library packages
β βββ api/ # API server implementation
β βββ chaos/ # Chaos experiment implementations
β β βββ executor/ # Experiment execution logic
β β βββ experiments/ # Experiment type definitions
β βββ config/ # Configuration management
β βββ k8s/ # Kubernetes client and utilities
β βββ monitoring/ # Metrics and monitoring
β βββ storage/ # Database storage
βββ scripts/ # Utility scripts
βββ tests/ # Integration tests
βββ web/ # Web dashboard
βββ .env.example # Example environment variables
βββ .gitignore # Git ignore file
βββ docker-compose.yml # Local development setup
βββ go.mod # Go module definition
βββ LICENSE # MIT License
βββ README.md # This file
We use Semantic Versioning for this project:
- Major version: Incompatible API changes
- Minor version: New functionality in a backward-compatible manner
- Patch version: Backward-compatible bug fixes
Current stable version: v1.2.3
# Build all components
go build ./...
# Build specific components
go build ./cmd/api-server
go build ./cmd/chaos-operator
go build ./cmd/cli
# Using make
make build # Build all components
make run # Run the entire platform locally
make run-api # Run the API server locally
make run-operator # Run the chaos operator locally
make test # Run all tests
make clean # Clean build artifacts
make docker-build # Build Docker images
make docker-run # Run with Docker Compose (recommended for local development)
make k8s-deploy # Deploy to Kubernetes
make init-db # Initialize the database schema
make run-examples # Run the example code# Run all tests
go test ./...
# Run tests with coverage
go test -cover ./...
# Using make
make test # Run all tests
make test-coverage # Run tests with coverage- Additional experiment types (DNS failures, disk I/O stress)
- Enhanced reporting capabilities with exportable results
- Multi-cluster support for cross-cluster experiments
- Automated experiment suggestions based on system architecture
- Integration with service mesh technologies
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please make sure your code follows the project's coding standards and includes appropriate tests.
Yes, with proper configuration. The platform includes safety guardrails to limit the blast radius of experiments and prevent cascading failures. We recommend starting in a staging environment and gradually moving to production.
While all three are Kubernetes-native chaos engineering tools, Chaos Engineering as a Platform focuses on providing a more comprehensive solution with enhanced safety features, external API testing capabilities, and deeper metrics integration.
Absolutely! The platform is designed to be extensible. You can create custom experiment types by implementing the experiment interface and registering them with the platform.
The platform supports Kubernetes 1.19 and above. It has been tested extensively on GKE, EKS, AKS, and self-managed Kubernetes clusters.
Not yet, but it's on our roadmap. For now, you'll need to deploy and manage the platform yourself.
Chaos Engineering as a Platform takes security seriously. The platform includes several security features:
- Role-Based Access Control: Fine-grained permissions for different user roles
- Audit Logging: Comprehensive logging of all actions for accountability
- Secure API: Authentication and authorization for all API endpoints
- Protected Namespaces: Prevent experiments from affecting critical infrastructure
- Secure Defaults: Conservative default settings to prevent accidental damage
If you discover a security vulnerability within this project, please send an email to security@example.com. All security vulnerabilities will be promptly addressed.
This project is licensed under the MIT License - see the LICENSE file for details.
- Chaos Mesh - Inspiration for some of the chaos experiment implementations
- Kubernetes - The foundation for our platform
- Prometheus and Grafana - For metrics and visualization
- Litmus Chaos - For chaos engineering patterns and practices
- Gremlin - For chaos engineering concepts and methodologies
Built with β€οΈ by Flack