Skip to content

sql: intentionally randomize the order of results when no ORDER BY is specified #24223

@rjnn

Description

@rjnn

Currently, CockroachDB queries that intuitively would scan in some natural order don't guarantee to hold to that order. This is consistent with the SQL standard, which does not require that queries return results in order unless an ORDER BY is specified. Most single node databases do return results in order, as does cockroach when the results are on a single node. This is confounded by the fact that the EXPLAIN plans for a query with an ORDER BY looks identical to a query that happens to return in that order, so developers often look at the plan to ensure it's optimized, and then neglect to add a final ORDER BY, since the plans are "identical".

Consider an example, selecting by a table's primary key column:

root@:26257/tpcc> explain select * from district order by d_w_id limit 10;
+-----------+-------+------------------+
|   Tree    | Field |   Description    |
+-----------+-------+------------------+
| limit     |       |                  |
|  └── scan |       |                  |
|           | table | district@primary |
|           | spans | ALL              |
|           | limit | 10               |
+-----------+-------+------------------+
(5 rows)

root@:26257/tpcc> explain select * from district limit 10;
+-----------+-------+------------------+
|   Tree    | Field |   Description    |
+-----------+-------+------------------+
| limit     |       |                  |
|  └── scan |       |                  |
|           | table | district@primary |
|           | spans | ALL              |
|           | limit | 10               |
+-----------+-------+------------------+
(5 rows)

The real problem is that without an explicit ORDER BY, when the table is distributed, cockroachdb optimizes results delivery by returning the first values it receives, regardless of order. Thus, users often encounter this artifact long after the fact, as it "always works" in testing on small data sizes, and only fails at scale.

One solution is to throw in a little bit of randomization when no ORDER BY is specified. Care should be taken to not do so at the cost of performance (perhaps only do this for a small fraction of rows). This is what golang does, and it helps build the intuition early that results won't be ordered unless you explictly specify that they are.

I would like to do the same in CockroachDB, unless I get strong pushback that this is an anti-feature. One useful data point is that this wrong ordering assumption has already bitten users (including some engineers who work on CockroachDB such as myself and @mjibson).

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-sql-executionRelating to SQL execution.A-testingTesting tools and infrastructureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)S-3-productivitySevere issues that impede the productivity of CockroachDB developers.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions