Skip to content

Slow (>50s) and failing schema change transactions #60724

@aeneasr

Description

@aeneasr

Describe the problem

We are running a local set up of the most recent version of k3d and are deploying a single-node cockroach database (see reproduce section).

Alongside that we are also running a job which executes a Go program. The purpose of the go program is to apply migrations (128 individual transactions). To achieve that, it:

  1. Creates a schema CREATE SCHEMA IF NOT EXISTS ...
  2. Switches the session to the schema SET search_path TO ...
  3. Executes several (around 140) transactions in sequence with SERIALIZABLE and readonly: false transaction isolation. It also runs some queries without a transaction context.

This schema is running in 8 different workers who check if migrations for other search_paths need to be applied also.

Most transaction execute fine although we have observed that running the same code against PSQL and CRDB results in:

  • 6 seconds against PostgreSQL 9.6
  • 55 seconds against CRDB 20.2.5

I believe this was already raised in #52556 . Unfortunately, I can not downgrade as older versions of CRDB do not support CREATE SCHEMA statements and thus not confirm this.

However, some transactions seem to time out with execution times of > 45 seconds. While some statements do have quite a bit of latency (e.g. SELECT FROM SCHEMA_MIGRATION)

Bildschirmfoto 2021-02-18 um 12 11 01

Bildschirmfoto 2021-02-18 um 12 11 29

Bildschirmfoto 2021-02-18 um 12 12 19

Bildschirmfoto 2021-02-18 um 12 11 31

most have a sub 100ms latency, which is probably still a lot but not the main problem.

If we take a look at the Transactions table however, we can see several extremely slow INSERT and CREATE TABLE statements:

Bildschirmfoto 2021-02-18 um 12 13 55

The transactions themselves can be seen here:

Bildschirmfoto 2021-02-18 um 12 14 55

As you can see, the individual statements execute in sub 100ms total - but the transaction needs much more time.

When looking at the trace (you can find it here) we can see that there is a lot of spans with:

11:13:26.010093 | .981607 | ... sql/catalog/catalogkv/catalogkv.go:148 [n1,job=634311153829183489,scExec,id=123] fetching descriptor with ID 114
-- | -- | --
11:13:26.012773 | .  2680 | ... kv/txn.go:818 [n1,job=634311153829183489,scExec,id=123] client.Txn did AutoCommit. err: <nil>

On my client (the Go service), I see two types of errors:

  1. {"audience":"application","error":{"message":"error committing or rolling back transaction: read tcp 10.42.0.57:41234-\u003e10.43.218.92:26257: i/o timeout","
  2. {"audience":"application","error":{"message":"problem checking for migration version 20191100000001000004: ERROR: relation \"schema_migration\" does not exist (SQLSTATE 42P01

The second error is intriguing, because schema_migration definitely exists, as the shown schema change is one that is applied after several other schema changes have completed. They all use this table to store the state.

On possibility for the second error is that, due to the io timeout, the Go sql session is restarted, which causes the SET search_path to be unset because we are dealing with a new transaction.

To Reproduce

I hope the above is rich in detail for investigation.

We are using the CRDB helm chart v5.0.5 with the following values:

cockroach-values.yaml

conf:
  single-node: true
ingress:
  enabled: true
  hosts:
    - cockroachdb.local.oryapis.localhost
statefulset:
  replicas: 1
  resources:
    limits:
      cpu: "1"
      memory: 2048Mi
    requests:
      cpu: 10m
      memory: 512Mi
storage:
  persistentVolume:
    size: 10Gi

Expected behavior

Transactions and schema changes should succeed.

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

  • CockroachDB version 20.2.5
  • Server OS: Kubernetes k3d
  • Client app: Golang jackc/pgx driver

Additional context

Migrations are failing and cause the system to be unresponsive until they restart and eventually succeed�.

Jira issue: CRDB-3149

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-schema-changesC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-communityOriginated from the communityT-sql-foundationsSQL Foundations Team (formerly SQL Schema + SQL Sessions)X-blathers-triagedblathers was able to find an owner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions