Skip to content

ESQL: Golden tests#139598

Merged
GalLalouche merged 24 commits intoelastic:mainfrom
GalLalouche:feature/golden_tests
Jan 19, 2026
Merged

ESQL: Golden tests#139598
GalLalouche merged 24 commits intoelastic:mainfrom
GalLalouche:feature/golden_tests

Conversation

@GalLalouche
Copy link
Copy Markdown
Contributor

@GalLalouche GalLalouche commented Dec 16, 2025

This PR adds golden tests to ES|QL!

● I added a GoldenTestsReadme.md with instructions on how to add new golden tests.
● I tried to keep this (relatively) minimal for the first PR. In the future we can add more features, configurations, stages (e.g., local node_reduce), etc.
● As a proof of concept, I've re-implemented (most) of the tests in SubstitudeRoundToTests using the golden testing framework.
○ When I recently worked on these tests (#138023,#138765), the majority of my time was spent mechanically fixing the expected output in the tests.
○ This was a good use case since I needed to add more support for test nesting due to the loops in the original tests.
○ While more text is being produced when using golden tests (70k lines!), fixing the expected output is a lot simpler, since you can just bulldoze the output! In this case, eyeballing the difference is enough to be convinced that the change is correct. So the maintenance cost goes down.
○ One of the tests (testRoundToWithTimeSeriesIndices) was skipped because its output isn't actually consistent across runs. The original test only verifies a very specific subset of the AST, so it's not a good candidate for golden tests.

@GalLalouche GalLalouche added >test Issues or PRs that are addressing/adding tests Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Dec 16, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@luigidellaquila luigidellaquila self-requested a review December 23, 2025 08:18
Copy link
Copy Markdown
Member

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GalLalouche, I had a first look at the PR and it looks awesome.

The output is a bit more verbose than the old tests, but definitely clear and manageable.

I left a comment, but I'll need some time to check all the details in depth.

new EsqlFunctionRegistry(),
CsvTests.loadIndexResolution(CsvTests.testDatasets(parsedStatement)),
defaultLookupResolution(),
new EnrichResolution(),
Copy link
Copy Markdown
Member

@luigidellaquila luigidellaquila Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of reusing CSV tests datasets.

nit: The only thing that concerns me a bit here is that we are mixing resolutions from three different places (CSV indices, AnalyzerTestUtils lookup resolution and an empty EnrichResolution), making it harder to understand what data we have when we write new tests, or how to add new datasets.

Long term, it would be good to get all of them from a single place, taking as a single source the datasets defined in CsvTestsDataLoader.

This is for a follow-up PR for sure.

Copy link
Copy Markdown
Member

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two more hints to make the CI green

"mapping-default-incompatible.json",
"employees_incompatible.csv"
).noSubfields();
private static final TestDataset ALL_TYPES = new TestDataset("test", "mapping-all-types.json", "test.csv");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test.csv file doesn't seem to exist, that's the reason for the CI failures.
I tried to add an empty file with that name and the tests seem to pass, but I don't know if that was the intention (or if you had some additional data)

import org.elasticsearch.xpack.esql.session.Versioned;
import org.elasticsearch.xpack.esql.stats.SearchStats;

import java.io.File;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some APIs are forbidden in ES (eg. java.io.File), you'll have to replace them (Eg. with java.nio.file) to make the CI happy.

You can check API usage with

./gradlew :x-pack:plugin:esql:forbiddenApisTest

it will give you a full list of violations

Copy link
Copy Markdown
Member

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another round of comments.

I'll wait for the CI to be green, but apart from that I think it's very good 🚀

* </pre>
*/
final StringBuilder treeString(StringBuilder sb, int depth, BitSet hasParentPerDepth) {
public String goldenTestToString() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be good to have some javadocs here, explaining the difference between goldenTestToString and goldenTestNodeString

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't exist anymore but, I added some docs to the relevant format enum.

List<Object> props = nodeProperties();
for (Object prop : props) {
if (prop instanceof NameId) {
continue;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's irrelevant, but instead of skipping, we could add a constant placeholder (eg. "$NameId$"), just to distinguish the cases where this element is present vs absent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to use a # prefix. This way, I'm removing them in the golden tests using regex, as suggested by @astefan, but we don't lose this information.

When a golden test fails, the test runner will create a new `.actual` file, in the same directory as
the `.expected` file. `.actual` files are git ignored by default, you can manually delete them after
inspection, or if you wish to avoid the `.actual` file creation, you can use `-Dgolden.noactual`.
If you want to automatically delete them on test success, you can use `-Dgolden.cleanactuals`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-Dgolden.cleanactuals should be -Dgolden.cleanactual

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks!

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't go through all the changes, I have a big ask before I continue reviewing: don't change the core classes of ESQL for something that is inherently testing related. Externalize whatever you need from Node and all related Node classes in code that is testing related only.


@Override
protected String propertiesToString(boolean skipIfChild) {
return super.propertiesToString(false /* skipIfChild */);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the inline comment? This code style is inconsistent with the rest of the code in ESQL (maybe ES itself, not sure).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My way of avoiding Boolean blindness :) Any way, I refactored this code, and the new version uses an enum instead.

* Like the above, but ensures consistent tree rendering for golden tests. Overrides of this method should avoid adding information
* that can change between repeated test runs, such as object identities or pseudo-random generated names.
*/
public String goldenTestNodeString() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you modifying all these core classes of ESQL for something that is inherently testing related?
This is production code that should be transparent to whatever testing infrastructure needs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the code so the changes to production classes isn't as bad, but some classes their production code had to change since they truncated their lists. I opted to make the old nodeString method final, to make implemented who wish to override this implementation take the new format enum, as it would ensure that future classes would be aware of the requested format, and also made sure I haven't missed any cases.

As for the named identifiers, I'm removing those via regex instead of doing in production, so that code is much shorter now!

}

@Override
public String toString() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefan I think here it would have been safer to toString() final to ensure all implementers are updated and pass the format to recursive call correctly, but I've avoided that for now to minimize production classes changes, as there's 28 classes to update at least!

return null;
}
if (randomBoolean() && result.contains("*") == false) {
if (randomBoolean() && result.contains("*") == false && result.startsWith("`") == false) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila Since I changed the name to already be wrapped with `, I had to fix this.

.collect(Collectors.toList());
}

private static String normalizeColumnName(String name) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila Since I added a new column with - in it, I changed this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.
I think we never get quoted column names here so we should be OK for now, but maybe it makes sense to add an && name.startsWith("`") == false just to be sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import java.nio.charset.StandardCharsets;

/** Adds -Dgolden.bulldoze to the reproduction line for golden test failures. */
public class GoldenTestReproduceInfoPrinter extends RunListener {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the CI won't accept classes that extend RunListener directly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually dumber than that: it thinks it's a test class because it has method that start with test 🤦. Apparently, making this a nested static class works though.

Copy link
Copy Markdown
Member

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've went through another round with the PR. I think the idea of the golden tests is good, it needs some polishing though.
Also, I really believe that at least some of the examples provided (and their expected files) shouldn't be merged. Some of the tests are not, imho, good examples of golden testing.

cause the golden test to fail nonetheless if you are verifying that the logical plan hasn't changed.

One should feel comfortable bulldozing failed tests if one expects changes in the underlying tree.
Golden tests should be the *first* line of defense, not the only one!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand on this idea by encouraging to write additional unit tests to check for some specific things that cannot be verified through the plan String representation. In these optimization steps we don't always check only the string output of the resulting plan.

For example, LogicalPlanOptimizerTests:355:

assertThat(agg.aggregates().get(0).id(), equalTo(Expressions.attribute(agg.groupings().get(0)).id()));

LogicalPlanOptimizerTests:548:

        assertWarnings(
            "No limit defined, adding default limit of [1000]",
            "Line 2:28: Field 'b' shadowed by field at line 2:47",
            "Line 2:9: Field 'b' shadowed by field at line 2:47"
        );

And there are also some inline stats related tests which, due to the specific characteristics of this command, they have an additional "mocking" step to help with some more advanced checks. See LogicalPlanOptimizerTests.testDoubleInlineStatsPrunning_With_MV_Functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +458 to +461
int maxProperties = switch (format) {
case LIMITED -> TO_STRING_MAX_PROP;
case FULL -> Integer.MAX_VALUE;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass max width and max props values as parameters to NodeStringFormat enum values and here and down below use that enum's value directly without this switch here and the one below.

Something like this

public enum NodeStringFormat {
        /** No list truncation, no line breaks due to string width. */
        FULL(Integer.MAX_VALUE, Integer.MAX_VALUE),
        /** List truncation and line breaks due to string width applied. */
        LIMITED(TO_STRING_MAX_PROP, TO_STRING_MAX_WIDTH);

        private int toStringMaxPropsCount;
        private int toStringMaxWidth;

        NodeStringFormat(int toStringMaxPropsCount, int toStringMaxWidth) {
            this.toStringMaxPropsCount = toStringMaxPropsCount;
            this.toStringMaxWidth = toStringMaxWidth;
        }
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course! Thank you!

logger.info("Bulldozing file {}", outputPath);
return createNewOutput(outputPath, plan);
} else {
if (Files.exists(outputPath)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what happens if the expected output file doesn't exist and in fact it should. I see in this piece of code that it's auto-generated if it doesn't exist.
Shouldn't there be the case that tells the author of the test to "generate the output file first or use this option to auto-generate it"? Or there shouldn't be a need for this and always auto-creating it is fine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always created if it doesn't exist. From the README:

The test runner will pick up the "orphan" test and create the required outputs (see next section)
based on the test name.
You can also specify which plans you want to verify (all plans are output by default)

Comment on lines +288 to +295
protected enum Stage {
ANALYZER,
LOGICAL,
PHYSICAL,
// There's no LOCAL_LOGICAL here since in product we use PlannerUtils.localPlan to produce the local physical plan directly from
// non-local physical plan.
LOCAL_PHYSICAL
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is called a Stage, I think it would be more appropriate to have the values named as ANALYSIS, LOGICAL_OPTIMIZATION, PHYSICAL_OPTIMIZATION, LOCAL_PHYSICAL_OPTIMIZATION.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return s.lines().map(String::strip).collect(Collectors.joining("\n"));
}

private static String extractTestName() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have experience with this low-level way of dealing with tests, but gut-feeling wise I'd say the ES code base should have something for finding out test names, test classes names etc.

I quickly searched and I found that ESTestCase has something called getTestName(). I'd look into that to see maybe reinventing the wheel here is not needed.

private <T extends QueryPlan<T>> TestResult verifyOrWrite(T plan, Stage stage) throws IOException {
var outputPath = outputPath(stage);
if (System.getProperty("golden.bulldoze") != null) {
logger.info("Bulldozing file {}", outputPath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe have an alternative to "Bulldoze", something more techy/IT-specific? I'd resonate more with a word like that. But, if you feel strongly about this one, we can keep it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just the term we used at my previous place :) Renamed to overwrite.

return;
}
if (isGoldenTest(failure)) {
printToErr(captureDelegate(failure).replace("REPRODUCE WITH:", "BULLDOZE WITH:") + " -Dgolden.bulldoze");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have used the logger everywhere if I were to write such a test class. Why the use of System.err.println?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the printing semantics of ReproduceInfoPrinter, which also writes to System.err.println: https://github.com/elastic/elasticsearch/blob/main/test/framework/src/main/java/org/elasticsearch/test/junit/listeners/ReproduceInfoPrinter.java#L113

@@ -0,0 +1,1793 @@
LimitExec[1000[INTEGER],null]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to search and double check the test itself and the test output file, I think it would have been easier for this folder to have the name of the test itself, meaning testRoundToTransformToQueryAndTagsWithCustomizedUpperLimit no underscores anywhere, just the same name as the test. In the IDE I would simply copy the test name and then search for the file with the same name (pasted from the test class).

Also, tbh this expected file is much much harder to grasp and reason about than the actual test and imho this is not a good example of golden testing being a better alternative. The file is huge I don't humanly see how someone would scroll the entire thing and double check it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Renamed the paths to use camelCase instead of underscore_case.
  • I agree that some of the golden tests end up with a huge amount of filler, rendering them useless. I think it might be salvageable by allowing tests to specify how the output is being printed: for example, in this case most of the lines are just from the round_to buckets being printed. This test would still have merit even if it ignored that portion, but that's probably a task for another PR! I've remove all the tests that ended up in huge outputs.

@GalLalouche GalLalouche requested a review from astefan January 16, 2026 14:22
Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @GalLalouche!

@GalLalouche GalLalouche merged commit b35518e into elastic:main Jan 19, 2026
35 checks passed
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
This PR adds golden tests to ES|QL!

● I added a GoldenTestsReadme.md with instructions on how to add new golden tests.
● I tried to keep this (relatively) minimal for the first PR. In the future we can add more features, configurations, stages (e.g., local node_reduce), etc.
● As a proof of concept, I've re-implemented (most) of the tests in SubstitudeRoundToTests using the golden testing framework.
○ When I recently worked on these tests (elastic#138023,elastic#138765), the majority of my time was spent mechanically fixing the expected output in the tests.
○ This was a good use case since I needed to add more support for test nesting due to the loops in the original tests.
○ While more text is being produced when using golden tests (70k lines!), fixing the expected output is a lot simpler, since you can just bulldoze the output! In this case, eyeballing the difference is enough to be convinced that the change is correct. So the maintenance cost goes down.
○ One of the tests (testRoundToWithTimeSeriesIndices) was skipped because its output isn't actually consistent across runs. The original test only verifies a very specific subset of the AST, so it's not a good candidate for golden tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants