creating crdb indexes can fail on sufficiently large tables

Followup from https://github.com/oxidecomputer/omicron/issues/9866 -- see that issue for context.

With CockroachDB, if we take a table with a sufficiently large amount of data in it, and then create an index (in, e.g., a migration), it will fail with an out of memory error.

## Steps to reproduce

Run this reproduction script in Omicron (so that Omicron's copy of `cockroach` is in the PATH): https://gist.github.com/sunshowers/a5e690e8aa40df4ef00882b1e71e1f2e

This fails with:

```
[nexus1] ERROR: transaction committed but schema change aborted with error: (53200): failed to construct index entries during backfill: root: memory budget exceeded: 2437120 bytes requested, 132620137 currently allocated, 134217728 bytes in budget
[nexus1] SQLSTATE: XXA00
[nexus1] HINT: Consider increasing --max-sql-memory startup parameter.
[nexus1] --
[nexus1] Some of the non-DDL statements may have committed successfully, but some of the DDL statement(s) failed.
[nexus1] Manual inspection may be required to determine the actual state of the database.
[nexus1] --
[nexus1] See: https://github.com/cockroachdb/cockroach/issues/42061
[nexus1] Failed running "sql"
```

## What does "sufficiently large amount of data" mean?

TODO, still analyzing this. A quick summary is that for the example in the repro script above, 1.7M rows doesn't fail but 1.8M rows does.

## Possible options

### Reducing the batch size

For index backfills, [`bulkio.index_backfill.batch_size`](https://github.com/oxidecomputer/cockroach/blob/367bca413bc24e6213a45663fccd583cc726ba08/pkg/sql/backfill.go#L75-L84) controls how many rows are loaded in at once. The default value of this tunable is 50,000, and we don't alter it. Reducing the batch size sufficiently appears to get past this issue (TODO still need to verify this claim for real.)

https://github.com/cockroachdb/cockroach/pull/135563 uses a halving/backoff strategy for the batch size. But due to licensing issues we can't use it directly. If we decide to fix this in our fork of CockroachDB we'll need to reimplement this idea ourselves.

But we can also do this at the migration driver layer. For example, we can either conditionally (if the migration is a `CREATE INDEX` DDL) or unconditionally reduce the batch size to a smaller value of, say, 5000. That should get us past this issue.

### `--max-sql-memory`

The `--max-sql-memory` parameter is set on each node at startup time. By default it is 128 MiB, and we do not alter the default:

https://github.com/oxidecomputer/omicron/blob/79fac7deb9acd9272b867904da03f20076d09265/smf/cockroachdb/method_script.sh#L25-L33

Bumping this up to a larger value can help.

	args=(
	'--insecure'
	'--listen-addr' "$LISTEN_ADDR"
	'--http-addr' '127.0.0.1:8080'
	'--store' "$DATASTORE"
	'--join' "$JOIN_ADDRS"
	)

	exec /opt/oxide/cockroachdb/bin/cockroach start "${args[@]}" &

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creating crdb indexes can fail on sufficiently large tables #9874

Steps to reproduce

What does "sufficiently large amount of data" mean?

Possible options

Reducing the batch size

`--max-sql-memory`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

creating crdb indexes can fail on sufficiently large tables #9874

Description

Steps to reproduce

What does "sufficiently large amount of data" mean?

Possible options

Reducing the batch size

--max-sql-memory

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`--max-sql-memory`