Polynomial latency growth for database isolated ("Bridge") multi-tenant models

**Describe the problem**

We are running a multi-tenant system which isolates tenants on a per-database level ("Bridge model"). [The documentation recommends](https://www.cockroachlabs.com/docs/stable/schema-design-database.html#database-best-practices) using schemas ("Limit the number of databases you create. If you need to create multiple tables with the same name in your cluster, do so in different user-defined schemas, in the same database.").

In #60724 and our Cockroach Cloud Ticket 7926 we were told to rely on database isolation instead. Unfortunately, here too we find CockroachDB struggling with this multi-tenant model.

CockroachDB does not work well with running stressful workloads and schema changes in parallel - which is documented and we have observed this in production as well. Effects of this issue are:

1. Creating new tenants (databases + tables + indices; no data) takes increasingly more time the more tenants there are.
2. Memory consumption increases over time - all of our Cockroach Cloud cluster nodes are at 85% memory usage.
3. Queries become *extremely slow* with execution times of up to 60s for simple `SELECT * FROM foo;` (~500 rows) statements while tenant workloads (create database, tables, indices; no data) are running.
4. The effect is observable on both multi- and single-node.

**To Reproduce**

To reproduce this we clone one of our projects, [you can find the migrations here](https://github.com/ory/kratos/tree/master/persistence/sql/migrations/sql), we create a single-node instance of CRDB with 4GB RAM and 2 vCPUs and then create 1000 tenants by creating 1000 DBs and execute a bunch of CREATE TABLE, DROP, ALTER, ADD INDEX, ... statements:

```
git clone https://github.com/ory/kratos.git
docker rm -f kratos_test_database_cockroach || true
docker run -m 4GB --cpus=2 --name kratos_test_database_cockroach -p 3446:26257 -p 3447:8080 -d cockroachdb/cockroach:v20.2.7 start-single-node --insecure

rm migrate.log
export LOG_LEVEL=debug
export DSN=cockroach://root@127.0.0.1:3446

cd kratos
go build -tags sqlite -o kratos .

# Create 1000 databases and run the SQL schema changes for all 1000 databases
for i in {1..1000}
do
    docker exec kratos_test_database_cockroach cockroach sql --insecure --execute="CREATE DATABASE tenant_$i"
    ./kratos migrate sql --yes "$DSN/tenant_$i?sslmode=disable" >> migrate.log 2>&1
done 
```

I was running this workload over night (8+ hours) and we are now at 526 databases, so the workload has not completed yet.

**Expected behavior**

Given that the databases are empty

```
root@:26257/defaultdb> use tenant_68;
SET

Time: 8ms total (execution 8ms / network 0ms)

root@:26257/tenant_68> show tables;
  schema_name |              table_name               | type  | owner | estimated_row_count
--------------+---------------------------------------+-------+-------+----------------------
  public      | continuity_containers                 | table | root  |                   0
  public      | courier_messages                      | table | root  |                   0
  public      | identities                            | table | root  |                   0
  public      | identity_credential_identifiers       | table | root  |                   0
  public      | identity_credential_types             | table | root  |                   0
  public      | identity_credentials                  | table | root  |                   0
  public      | identity_recovery_addresses           | table | root  |                   0
  public      | identity_recovery_tokens              | table | root  |                   0
  public      | identity_verifiable_addresses         | table | root  |                   0
  public      | identity_verification_tokens          | table | root  |                   0
  public      | schema_migration                      | table | root  |                 111
  public      | selfservice_errors                    | table | root  |                   0
  public      | selfservice_login_flow_methods        | table | root  |                   0
  public      | selfservice_login_flows               | table | root  |                   0
  public      | selfservice_recovery_flow_methods     | table | root  |                   0
  public      | selfservice_recovery_flows            | table | root  |                   0
  public      | selfservice_registration_flow_methods | table | root  |                   0
  public      | selfservice_registration_flows        | table | root  |                   0
  public      | selfservice_settings_flow_methods     | table | root  |                   0
  public      | selfservice_settings_flows            | table | root  |                   0
  public      | selfservice_verification_flow_methods | table | root  |                   0
  public      | selfservice_verification_flows        | table | root  |                   0
  public      | sessions                              | table | root  |                   0
(23 rows)

Time: 2.109s total (execution 2.109s / network 0.001s)
```

we would not expect such a siginficant increase in latency.

**Additional data / screenshots**

You can parse the log file and output it as CSV using

```
echo "migration,execution" > out.csv
cat migrate.log | grep seconds | awk '{print NR,$3}s' | sed -e 's/msg=//' | sed -e 's/ /,/' >> out.csv
```

which in our case shows polynomial (2nd degree) / exponential growth with R^2 of almost 1. I suspect we'll get R^2 > 1 at some point:

![execution vs migration-6](https://user-images.githubusercontent.com/3372410/113816986-e1c9d200-9775-11eb-81bc-a681826e01ee.png)

All other metrics point towards the same issue - I have therefore included screenshots from all metrics dashboards:

<details>
**Usage (530MB)**

![image](https://user-images.githubusercontent.com/3372410/113818946-de841580-9778-11eb-8106-2f428bb8b009.png)

**Overview**

![image](https://user-images.githubusercontent.com/3372410/113817149-27869a80-9776-11eb-9963-ef456573257c.png)

**Hardware**

![image](https://user-images.githubusercontent.com/3372410/113817418-98c64d80-9776-11eb-856b-820abacbe1e8.png)

**Runtime**

![image](https://user-images.githubusercontent.com/3372410/113817444-a1b71f00-9776-11eb-8099-afc97a21d447.png)

**SQL**

![image](https://user-images.githubusercontent.com/3372410/113817463-aaa7f080-9776-11eb-9e31-d86f58f35f7f.png)

**Storage**

![image](https://user-images.githubusercontent.com/3372410/113817486-b398c200-9776-11eb-87c7-27482a789d67.png)

**Replication**

![image](https://user-images.githubusercontent.com/3372410/113817517-be535700-9776-11eb-9411-96d57fda7749.png)

**Distributed**

![image](https://user-images.githubusercontent.com/3372410/113817539-c7442880-9776-11eb-8498-b2690612b4d1.png)

**Queues**

![image](https://user-images.githubusercontent.com/3372410/113817560-d034fa00-9776-11eb-9cae-98093e065a7f.png)

**Slow Requests**

![image](https://user-images.githubusercontent.com/3372410/113817576-d9be6200-9776-11eb-8c78-c4d63b5a2a22.png)

**Changefeeds**

![image](https://user-images.githubusercontent.com/3372410/113817600-e478f700-9776-11eb-8687-80d229c71aac.png)
</details>

**Environment:**

- CockroachCloud 20.2.7
- CockroachDB 20.2.7
 - Server OS: macOS / Docker, CockroachCloud
 - Client app: Any

**Additional context**

We are heavily impacted by this issue and it has been a month-long effort to somehow get multi-tenancy to work in CRDB. The current effects are:

- Our production system which was scheduled to launch last month could not launch due to this issue and #60724
- CockroachDB does not offer [Row Level Security](https://www.postgresql.org/docs/9.5/ddl-rowsecurity.html) (see [AWS blog post](https://aws.amazon.com/de/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/)) making shared-table ("Pool") multi-tenant models impossible to implement, especially for security-sensitive applications like ours.
- On CockroachCloud ("production") we observe worse and worse query performance (up to 60s for simple `SELECT`) with memory usage capping out at around 600 databases.
- Running a multi-node set up (which we do in Cockroach Cloud) makes it worse.
- If the time it takes to spin up a tenant increases with the amount of tenants in the system, we can not make release announcements on e.g. Hacker News (which hosts Ory on the FP regularly) as the traffic spikes would completely kill the system.

This currently ruins our economic model where we would like to offer a "free tier" with strong usage limits that developers can use to try out the system, as a $1200 p/mo CockroachCloud cluster can only handle around 300-400 tenants without them having any data!

We were also planning on requesting a multi-region cluster set up in CockroachCloud which we expect to become more expensive. All of this points us to the result where CockroachDB is currently not able to support multi-tenant systems in an economical fashion.

While we are big fans of this technology and want to introduce it to our 30k+ deployments, especially with the new releases of https://github.com/ory/hydra, https://github.com/ory/keto, https://github.com/ory/kratos we are left with a rather frustrating experience of going back and forth - being told to try out different things and in the end hitting the roadblock very quickly.

I have also tried to create a "debug zip" but unfortunately it can not be generated due to query timeouts:

```
23 tables found
requesting table details for tenant_103.public.schema_migration... writing: debug/schema/tenant_103/public_schema_migration.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.schema_migration" timed out after 10s
requesting table details for tenant_103.public.identities... writing: debug/schema/tenant_103/public_identities.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.identities" timed out after 10s
requesting table details for tenant_103.public.identity_credential_types... writing: debug/schema/tenant_103/public_identity_credential_types.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.identity_credential_types" timed out after 10s
requesting table details for tenant_103.public.identity_credentials... writing:

...
```

I then killed all workloads and re-ran debug.zip but it is taking painfully long for to collect the table details (one database a 23 tables takes about 30 seconds) so I aborted the process.

**Comparison to PostgreSQL**

To compare, with PostgreSQL

```
docker rm -f kratos_test_database_postgres || true
docker run -m 4GB --cpus=2  --name kratos_test_database_postgres -p 3445:5432 -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=postgres -d postgres:13.2 postgres

sleep 5

rm migrate-psql.log
export LOG_LEVEL=debug
export DSN=postgres://postgres:secret@127.0.0.1:3445
for i in {1..1000}
do
    docker exec kratos_test_database_postgres psql -U postgres -c "CREATE DATABASE tenant_$i"
    ./kratos migrate sql --yes "$DSN/tenant_$i?sslmode=disable" >> migrate-psql.log 2>&1
done
```

the container stays below 20% CPU usage

<img width="1390" alt="Bildschirmfoto 2021-04-07 um 08 30 09" src="https://user-images.githubusercontent.com/3372410/113820801-800c6680-977b-11eb-8bf5-9b9542e17b53.png">

requires less than 1/10th of the time to execute the statements:

```
cat migrate-psql.log | grep seconds | awk '{print NR,$3}s' | sed -e 's/msg=//' | sed -e 's/ /,/'
1,0.8378
2,0.7610
3,0.7952
4,0.7446
5,0.8967
6,0.7752
7,0.8294
8,0.7808
9,0.9094
```

and does not show any significant (it looks like O(1)) change in execution time (the spikes correlate with me closing browser tabs and using Google Sheets in Safari...):

![execution vs migration-8](https://user-images.githubusercontent.com/3372410/113821446-61f33600-977c-11eb-954f-94dd18eb1ce9.png)

Jira issue: CRDB-6472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polynomial latency growth for database isolated ("Bridge") multi-tenant models #63206

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Polynomial latency growth for database isolated ("Bridge") multi-tenant models #63206

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions