Skip to content

[Feature]: Chaos testing project #8174

@gbartolini

Description

@gbartolini

Description

Currently, testing for CloudNativePG involves a series of unit tests that evaluate small components, as well as a comprehensive set of end-to-end tests that assess complete features, including responses to simulated failures. However, we lack a systematic approach to address failures that occur unexpectedly and randomly.

We want to expand the scope of our tests, introducing a full-fledged chaos testing framework that can better validate Cloudnativepg's resilience, fault tolerance, and recovery mechanisms.

By adopting chaos testing, we aim to:

  • Increase confidence that our services remain functional under adverse conditions.
  • Identify weak points that traditional testing may not uncover.

This self-contained project is perfect for the LFX Mentorship Program. The mentee can become a component owner of the project.

Expected outcomes

  • Selection of a Kubernetes-native chaos testing framework (e.g., LitmusChaos or Chaos Mesh).
  • Design and automation of an initial set of chaos experiments covering common failure scenarios.
  • Integration of these experiments into CI/CD to ensure reproducible testing.
  • Collection of clear observability metrics (e.g., failover time, data consistency) to assess resilience and recovery.
  • Documentation and guidelines to help contributors create and run new chaos experiments safely.

Recommended Skills

  • Experience with chaos testing frameworks (preferably LitmusChaos or Chaos Mesh).
  • Familiarity with Kubernetes, PostgreSQL, and CloudNativePG.
  • Understanding of observability tools such as Prometheus or Grafana.

Additional context

This should be a separate project under the CloudNativePG organisation.

Metadata

Metadata

Type

Projects

Status

Done

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions