Slow (>50s) and failing schema change transactions

**Describe the problem**

We are running a local set up of the most recent version of k3d and are deploying a single-node cockroach database (see reproduce section).

Alongside that we are also running a job which executes a Go program. The purpose of the go program is to apply migrations (128 individual transactions). To achieve that, it:

1. Creates a schema `CREATE SCHEMA IF NOT EXISTS ...`
2. Switches the session to the schema `SET search_path TO ...`
3. Executes several (around 140) transactions in sequence with `SERIALIZABLE` and `readonly: false` transaction isolation. It also runs some queries without a transaction context.

This schema is running in 8 different workers who check if migrations for other search_paths need to be applied also.

Most transaction execute fine although we have observed that running the same code against PSQL and CRDB results in:

* 6 seconds against PostgreSQL 9.6
* 55 seconds against CRDB 20.2.5

I believe this was already raised in #52556 . Unfortunately, I can not downgrade as older versions of CRDB do not support `CREATE SCHEMA` statements and thus not confirm this.

However, some transactions seem to time out with execution times of > 45 seconds. While some statements do have quite a bit of latency (e.g. `SELECT FROM SCHEMA_MIGRATION`)

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 11 01" src="https://user-images.githubusercontent.com/3372410/108348786-82722b80-71e2-11eb-91b0-b2a6228fab4a.png">

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 11 29" src="https://user-images.githubusercontent.com/3372410/108348781-8140fe80-71e2-11eb-9b0e-0edf59ab0b7b.png">

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 12 19" src="https://user-images.githubusercontent.com/3372410/108348843-9158de00-71e2-11eb-8d67-beafa2438138.png">

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 11 31" src="https://user-images.githubusercontent.com/3372410/108348769-7edea480-71e2-11eb-94c0-278b930152c4.png">

most have a sub 100ms latency, which is probably still a lot but not the main problem.

If we take a look at the `Transactions` table however, we can see several extremely slow `INSERT` and `CREATE TABLE` statements:

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 13 55" src="https://user-images.githubusercontent.com/3372410/108349028-cb29e480-71e2-11eb-84db-bcf013f6f33f.png">

The transactions themselves can be seen here:

<img width="2672" alt="Bildschirmfoto 2021-02-18 um 12 14 55" src="https://user-images.githubusercontent.com/3372410/108349182-f90f2900-71e2-11eb-9ca2-9e40d999c7f0.png">

As you can see, the individual statements execute in sub 100ms total - but the transaction needs much more time.

When looking at the trace ([you can find it here](https://gist.github.com/aeneasr/6780e092293cab5da828652b760a4e77)) we can see that there is a lot of spans with:

```
11:13:26.010093 | .981607 | ... sql/catalog/catalogkv/catalogkv.go:148 [n1,job=634311153829183489,scExec,id=123] fetching descriptor with ID 114
-- | -- | --
11:13:26.012773 | .  2680 | ... kv/txn.go:818 [n1,job=634311153829183489,scExec,id=123] client.Txn did AutoCommit. err: <nil>
```

On my client (the Go service), I see two types of errors:

1. `{"audience":"application","error":{"message":"error committing or rolling back transaction: read tcp 10.42.0.57:41234-\u003e10.43.218.92:26257: i/o timeout","`
2. `{"audience":"application","error":{"message":"problem checking for migration version 20191100000001000004: ERROR: relation \"schema_migration\" does not exist (SQLSTATE 42P01`

The second error is intriguing, because `schema_migration` definitely exists, as the shown schema change is one that is applied after several other schema changes have completed. They all use this table to store the state.

On possibility for the second error is that, due to the io timeout, the Go sql session is restarted, which causes the `SET search_path` to be unset because we are dealing with a new transaction.

**To Reproduce**

I hope the above is rich in detail for investigation.

We are using the CRDB helm chart `v5.0.5` with the following values:

*cockroach-values.yaml*

```yaml
conf:
  single-node: true
ingress:
  enabled: true
  hosts:
    - cockroachdb.local.oryapis.localhost
statefulset:
  replicas: 1
  resources:
    limits:
      cpu: "1"
      memory: 2048Mi
    requests:
      cpu: 10m
      memory: 512Mi
storage:
  persistentVolume:
    size: 10Gi
```

**Expected behavior**

Transactions and schema changes should succeed.

**Additional data / screenshots**
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

**Environment:**

 - CockroachDB version 20.2.5
 - Server OS: Kubernetes k3d
 - Client app: Golang jackc/pgx driver

**Additional context**

Migrations are failing and cause the system to be unresponsive until they restart and eventually succeed.

Jira issue: CRDB-3149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow (>50s) and failing schema change transactions #60724

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow (>50s) and failing schema change transactions #60724

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions