[RFC] Dynamic Correctness Testing Framework

# [RFC] Dynamic Correctness Testing Framework
**Date:** Dec 27, 2024
**Status:** Accepted (Implementation in progress)

> [!NOTE]
> This document is about the context and infrastructure for dynamic testing. For information on how dynamic testing works on a per-test level today, check the [Dynamic Testing Primer](https://github.com/Swiddis/opensearch-sql-distributed-testing/blob/main/docs/primer.md) instead.

## Overview

While the current integration testing framework is effective for development, there have been several bug reports around query correctness due to testing blind-spots. Some common sources for issues include:

- Using complex data types such as [arrays](https://github.com/opensearch-project/sql/issues/3132), [multi-typed fields](https://github.com/opensearch-project/sql/issues/1074), [nested fields](https://github.com/opensearch-project/opensearch-spark/issues/233), or [timestamp fields](https://github.com/opensearch-project/sql/issues/1686).
- Queries involving multiple indices such as [`JOIN`s](https://github.com/opensearch-project/sql/issues/2550) or [wildcard indices](https://github.com/opensearch-project/sql/issues/1542).
- Interactions between our plugin and supported external interfaces such as [JDBC](https://github.com/opensearch-project/sql/issues/1874) and [Spark](https://github.com/opensearch-project/opensearch-spark/issues/739).
- Running queries with large datasets such as those involving [large documents](https://github.com/opensearch-project/opensearch-spark/issues/625), [high-cardinality data](https://github.com/opensearch-project/opensearch-spark/issues/740), as well as associated [`LIMIT` queries](https://github.com/opensearch-project/sql/issues/3121) or [aggregations](https://github.com/opensearch-project/sql/issues/2716).

In addition to the obvious challenges with predicting these edge cases ahead of time, we have the additional issue that the SQL plugin has multiple runtime contexts:

- Different query engines such as Legacy, [V2](https://github.com/opensearch-project/sql/blob/main/docs/dev/intro-v2-engine.md), [Spark](https://github.com/opensearch-project/opensearch-spark), or (soon) Catalyst.
- [Several configuration options](https://github.com/opensearch-project/sql/blob/3430deb57514a4c593813e8735d62484756c840d/common/src/main/java/org/opensearch/sql/common/setting/Settings.java#L21-L62) that have effects on the runtime behavior of queries (e.g. pagination mode, size or memory limits, type tolerance).
- Supporting both SQL and PPL querying.
- Index-specific settings such as [shards](https://github.com/opensearch-project/sql/issues/3064) or [date formats](https://github.com/opensearch-project/sql/issues/2700).

The current integration testing process is based on doesn't scale sufficiently to detect edge cases under all of these scenarios. Historically the method has been "fix the bugs as we go", but given the scale of the project nowadays and in-progress refactoring initiatives, a testing suite that can keep up with the scale is needed.

Inspired by the likes of database testing initiatives in other projects, especially [SQLancer](https://github.com/sqlancer/sqlancer), [Google's OSS-Fuzz](https://github.com/google/oss-fuzz), and the [PostgreSQL BuildFarm](https://buildfarm.postgresql.org/), this RFC proposes a project to implement a distributed random testing process, which will be able to generate a large amount of test scenarios to validate correct behavior. With such a suite, the plugin can be "soak tested" by running a large number of randomized tests and reporting errors.

### Glossary

- **Crash Bugs:** Bugs that are directly visible as erroneous behavior (e.g. error responses, crashes). Distinct from Logic Bugs.
- **Fuzzing:** A testing technique that involves providing invalid, unexpected, or random data as inputs to a system. This usually has a heavier emphasis on asserting the system doesn't crash, as opposed to making sure the responses are actually correct.
- **Logic Bugs:** Bugs where the system does not have an obviously incorrect behavior (e.g. error responses, crashes), but nonetheless returns incorrect results. Distinct from Crash Bugs.
- **Property-based testing:** A testing methodology that focuses on verifying properties of the system under test for a wide range of randomly-selected inputs. Further reading: [What is Hypothesis?](https://hypothesis.works/articles/what-is-hypothesis/) and [ScalaCheck](https://scalacheck.org/documentation.html).

## Current State

Our current testing infrastructure consists of several components that serve different purposes, but none of them clearly address the gaps identified here.

1.  The Main Integration Testing Suite:
    
- Provides comprehensive coverage for current features, individually.
- Effective for catching regressions in existing functionality.
- Limited in its ability to detect edge cases or cross-feature interactions.

2.  Comparison Testing Suite:
    
- Attempts to validate query results by comparing against SQLite.
- Useful for basic SQL functionality shared between our implementation and SQLite.
- Limited by significant differences in feature sets:
    - Cannot test OpenSearch-specific features (e.g. complex data types, PPL).
    - May produce false positives due to intentional differences in behavior.
- These tests have not actually been run in a very long time, and seem to currently be out of commission.

We may consider reusing elements of the comparison testing framework for the new suite. In particular, both this framework and the proposed solution connect to OpenSearch via JDBC. The main concern is whether we can parallelize the workload easily.

3.  Mutation Testing Suite (PiTest):
    
- Helps validate the effectiveness of our existing test suite under code mutations.
- Identifies areas where our tests may be insufficient.
- Still constrained by the scope of our existing tests, as it doesn't generate new test cases.
- These tests also seem to currently be out of commission.

While these tools provide valuable testing capabilities, they fall short in several key areas:

- Limited ability to generate diverse, representative test data.
- Insufficient coverage of complex interactions between different SQL and PPL features.
- Difficulty in testing across multiple runtime contexts and configurations.
- Challenges in scaling to test large datasets or high-volume query scenarios.

## Analysis of Domain

The most important question to ask is, "why can't we use an existing system?" SQL testing in particular has a lot of existing similar initiatives. A cursory look at these initiatives (ref: [SQLancer](https://github.com/sqlancer/sqlancer), [Sqllogictest](https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki), [SQuaLity](https://arxiv.org/abs/2410.21731)) reveals a few issues:

- These suites aren’t designed with PPL or generally non-SQL query languages in mind, so the work to migrate their code to support both languages would be significant.
- These suites generally assume that the database is not read-only: they include detailed tests for modifying database files directly (including corrupting them), as well as several tests that depend on the behavior of SQL’s write functions (`INSERT`, `UPDATE`). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.
- Since our SQL implementation has several limitations that aren’t shared by other systems, we also would need to remove tests that rely on these limitations not being present. This is nontrivial since many of these systems rely on generating queries in bulk, and may implicitly use those same assumptions even for functionality we support.

Compared to trying to adapt these solutions, it seems like the most reliable method for a long-term solution is to write a custom system. In particular, these methods are well-documented, so we likely can make something that can get a similar degree of effectiveness with less effort than the migration cost. This will also give us a lot of flexibility to toggle specific features under test, such as the assortment of unimplemented features in V2 (#1889, #1718, #1487). The flexibility of supporting OpenSearch-specific semantics and options will also open routes for implementing similar testing on core OpenSearch.

Despite these limitations, all of the linked projects have extensive research behind them that can be used to guide our own implementation. Referencing them and taking what we can is likely valuable.

## Goals

Looking at the limitations of our current test cases, I propose the following goals:

- **Scalability:** Currently, it's nontrivial to run our integration tests in parallel without multiple clusters due to cross-test interference. To run the tests at the scale necessary for these types of suites to find deep bugs, we need to ensure test isolation from the start. This ties in with the next goal:
- **Cost Effectiveness:** The costs of the suite should directly correlate with the amount of tests we're running. For the most part, running clusters is the largest cost, so the suite should be able to load balance parallelized tests across a small number of clusters.
- **Model Real Usage:** The existing suites use a small amount of test indices for specific features. The indices generated for the tests should more closely model real indices.

## Implementation Strategy

The goal is to create a separate project that can connect to OpenSearch SQL via a dynamically chosen connector (e.g. JDBC, or REST) -- similar to how the comparison testing suite currently uses JDBC. For the first viable version, only the REST connector will be supported, for simplicity.

The role of the test suite as a system is shown in this System Context diagram. It will run independently of our existing per-PR CI on a specified schedule, informing developers of issues found. The suite will have some config that specifies how many batches it should run for during each invocation, similar to how Hypothesis does [`max_examples`](https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.max_examples).
<img src="https://github.com/user-attachments/assets/a5805a21-6143-4073-8813-ea086f75c6c1" width="500">

The testing framework itself needs two main components: the actual test runner, and a tool that can create clusters to test with various configs. For example, toggling cluster settings, whether to use Spark, and other cluster-level options. Making the cluster manager a separate container will help encapsulate a lot of grungy infrastructure code in a dedicated tool. That said, for the initial version, we can just have the pipeline spin up a single test cluster that the runner uses, with a single default configuration, meaning its box can be replaced by the CI pipeline for now.
<img src="https://github.com/user-attachments/assets/771b8c6f-437e-4f89-9852-38d528779a8e" width="500">

One unit of work we introduce at this level is test batches, based on [the throughput benchmarking below](https://github.com/opensearch-project/sql/issues/3220#issuecomment-2563127410). The idea is that both creating clusters and creating indices are expensive compared to running read-only queries or incremental updates on indices, so we want to reuse created indices wherever possible (aiming for ≥ 100 tests/index). To have a high test throughput, we run tests in cluster-level batches, where each cluster-level batch is made up of several index-level batches. The exact number of tests per batch will be configurable.
1. When starting a cluster-level batch, configure a cluster. The cluster will be assigned to run a fixed number of index-level batches.
2. For each index-level batch, configure (1 to N) indices, and populate them with sample data.
3. For each index-level batch, generate (read-only) queries for the sample data set.
4. Validate query results on the data set in parallel.
5. Once the index-level batch is done, delete the associated indices. Once all index-level batches are done, tear down the cluster and consider the cluster-level batch complete.

Zooming into the test runtime, we can look more at its core components. The bulk of the work for "supporting features" happens in the Query Generator and the Data Generator, the rest of the work is mostly plumbing.
<img src="https://github.com/user-attachments/assets/a356868a-8775-4c97-8ae0-9e9208221000" width="500">

For properties we can test, the following high-level strategies are available:
- Property-based testing as a testing methodology is described in the glossary. For a technology choice, ScalaCheck works well with our current JVM-heavy ecosystem as a widely-supported property-based testing framework.
- Pivoted Query Synthesis (PQS), Non-Optimizing Reference Engine Construction (NoREC), Ternary Logic Partitioning (TLP), Query Plan Guidance (QPG), and Differential Query Plans (DQP) are all described in the [SQLancer Papers](https://github.com/sqlancer/sqlancer/blob/main/docs/PAPERS.md).
- [SQuaLity](https://github.com/suyZhong/SQuaLity) also has some techniques aggregated from other database systems which we can take from.

### Implementation Roadmap

The goal is to get a minimum viable suite running by the end of January, which is able to be extended to support more query features along with current feature development.

#### Bootstrapping

1. Identify specific historic bugs to work toward and identify minimal query and test data features necessary to reproduce them.
2. Write a CI pipeline that can invoke the test runner after setting up a cluster configuration which reproduces the bugs.
3. Write an adapter to communicate with the cluster.
4. For each bug:
	1. Write a test suite to check for a specific property that aligns with the bug. Properties should be broad enough to apply to multiple queries as a general "correctness rule", not be bug-specific assertion errors. In particular, the more general the properties can be while still being related to correctness, the more powerful. [Query Partitioning](https://www.manuelrigger.at/preprints/TLP.pdf) is maybe one of the stronger examples.
	2. Write a data generator that can create an index which reproduces the bug.
	3. Extend the query generator to produce queries that would reproduce the bug.

By the end of this process, all major components in the diagram above should be stubbed and ready to be incrementally added to. The fact that the suite finds known existing bugs can give confidence for identifying bugs in new features that violate the same properties.

Importantly, for bootstrapping we skip writing a full cluster configuration manager, and stick to a minimal adaptor and index configuration.

#### Extending

Once we can find known bugs, the main steps for identifying unknown bugs are similar:

1. Categorize bugs that come up, since that indicates gaps in the suite. Categories may include run configuration options, languages, or data types (per the issue categories identified in the Overview).
2. Add any necessary properties that would align with those categories as a test suite.
3. Add any missing language features related to those categories for query generation.
4. Add any missing data generation features related to those categories.

### Open Questions

- How will we handle testing of queries that involve very large result sets or long-running operations?
	- We can leave this for the future, likely a lot of the behavior here can be reproduced by shrinking the pagination thresholds in the OpenSearch config.
- How will we handle testing of queries that involve external resources (e.g., SQL functions that call external APIs)? Spark is the primary example here.
	- The configuration manager is meant to handle this, but it's not entirely clear how. CloudFormation? Teraform? Would like input from someone who's done similar things before, or who knows how we currently do integration testing for Spark.
- How will we handle testing of security features and access controls within this framework?
	- As part of the cluster configuration.
- How will we prioritize which features or combinations of features to test first?
	- Either features under active development, or features that have a high frequency of bug reports.
- What provisions will we make for testing query behavior under different cluster states (e.g., node failures, network partitions)?
	- Leaving this open for the future, not sure if it's really practical or useful right now.

---

<details>
  <summary>Structurizr DSL for the diagrams</summary>
<pre><code>
workspace "OpenSearch-SQL" "Correctness Testing" {

    !identifiers hierarchical

    model {
        user = person "Developer"

        correctness = softwareSystem "Correctness Testing Framework" {
            tester = container "Test Runtime" {
                queries = component "Query Generator" {
                    description "Implements the bulk of the logic for toggling language features and building queries"
                }
                scheduler = component "Scheduler" {
                    description "Handles the async runtime for parallelizing tests, and managing batch-level configurations"
                }
                index = component "Data Generator" {
                    description "Creates sample indices and populates them with type-level data"
                }
                adapter = component "Adapters" {
                    description "Defines methods for running queries on a cluster (JDBC, REST, Async Jobs...)"
                }
                test = component "Test Suites" {
                    description "Defines correctness properties of interest (TLP, PQS, pagination semantics...)"
                }


                scheduler -> test "Runs with configured connection details, adapter, language features, and suite-scoped indices"
                scheduler -> index "Requests test indices"
                test -> queries "Requests queries with features of interest"
                test -> adapter "Executes queries"
            }

            config = container "Cluster configuration manager"

            cluster = container "OpenSearch Cluster" {
                tags "Database"
                tags "External"
            }

            bin = container "SQL Binary" {
                tags "External"
            }

            tester.adapter -> cluster "Executes queries"
            # config -> tester.scheduler "Connection details"
            tester.scheduler -> config "Requests a new cluster"
            tester.index -> cluster "Creates sample indices"
        }

        results = softwareSystem "Test Report" {
            tags "External"
        }

        ci = softwareSystem "CI Pipeline" {
            tags "External"
        }

        codebase = softwareSystem "Codebase" {
            tags "External"
        }

        user -> ci "Schedules"
        user -> results "Reads"
        user -> codebase "Patches"

        ci -> codebase "Pulls"
        ci -> correctness.tester "Runs"
        ci -> correctness.tester.scheduler "Invokes with config"
        ci -> correctness.bin "Compiles"

        correctness.tester -> correctness.config "Requests cluster for batch"
        correctness.config -> correctness.bin "Runs"
        correctness.config -> correctness.cluster "Configures"
        correctness.tester -> correctness.cluster "Queries via tests"
        correctness.tester -> results "Results"
        correctness.tester.scheduler -> results "Appends results to report"
    }

    views {
        systemContext correctness "Diagram1" {
            include *
            include user
            include codebase
            autolayout lr
        }

        container correctness "Diagram2" {
            include *
            autolayout lr
        }

        component correctness.tester "Diagram3" {
            include *
            autolayout lr
        }

        styles {
            element "Element" {
                color #ffffff
            }
            element "Person" {
                background #D00000
                shape person
            }
            element "Software System" {
                background #D00000
            }
            element "Container" {
                background #D00000
            }
            element "Component" {
                background #D00000
            }
            element "Database" {
                shape cylinder
            }
            element "External" {
                background #888888
            }
        }
    }

    configuration {
        scope softwaresystem
    }

}
</code></pre>
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dynamic Correctness Testing Framework #3220

[RFC] Dynamic Correctness Testing Framework

Overview

Glossary

Current State

Analysis of Domain

Goals

Implementation Strategy

Implementation Roadmap

Bootstrapping

Extending

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Dynamic Correctness Testing Framework #3220

Description

[RFC] Dynamic Correctness Testing Framework

Overview

Glossary

Current State

Analysis of Domain

Goals

Implementation Strategy

Implementation Roadmap

Bootstrapping

Extending

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions