-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: sustained constraint non-conformance for MR schemas #108127
Description
Originally posted by @dikshant in #106128 (comment)
Here is a debug zip, see below for repro steps.
https://drive.google.com/file/d/1Ilkl1vWS8CpyuNDku93dC9XLs0U9aI4k/view?usp=sharing
I tried this on a 23.1.7 on a 18 node multi region cluster in roachprod.
So this is interesting. Mapping replicas to replica_localities using @j82w 's fixed query shows the correct mappings:
root@localhost:26257/meetup> SELECT DISTINCT
-> split_part(unnest(replica_localities), ',', 2) replica_localities,
-> unnest(replicas) replica,
-> range_id
-> FROM [SHOW RANGE FROM TABLE product FOR ROW ('europe-west1', '2f22da46-d983-4878-8ad2-a6e6ff7e8f39')];
replica_localities | replica | range_id
-------------------------+---------+-----------
region=europe-west1 | 7 | 69
region=europe-west1 | 9 | 69
region=europe-central2 | 11 | 69
region=europe-central2 | 12 | 69
region=europe-north1 | 15 | 69
(5 rows)
However, the violating range is still present and this is after waiting 10+ minutes:
root@localhost:26257/meetup> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
zone_id | subzone_id | type | config | report_id | violation_start | violating_ranges
----------+------------+------------------+--------------------+-----------+-------------------------------+-------------------
116 | 0 | voter_constraint | +region=us-east4:2 | 1 | 2023-08-02 23:44:58.271424+00 | 3
(1 row)
Time: 105ms total (execution 105ms / network 0ms)
Reproduction steps:
-
Create a MR cluster. I used:
roachprod create dikshant-test -n 18 --gce-zones 'us-east4-a','us-east4-a','us-east4-a','us- central1-a','us-central1-a','us-central1-a','europe-west1-b','europe-west1-b','europe-west1- b','europe-central2-b','europe-central2-b','europe-central2-b',"europe-north1-b","europe- north1-b","europe-north1-b","us-west1-a","us-west1-a","us-west1-a" && roachprod stage dikshant-test release v23.1.7 && roachprod start dikshant-test:1-18 -
Apply the following DDL and DML:
https://gist.github.com/dikshant/d4d170d70e493119b7cb6306aedb7551 -
Check for violating ranges after waiting for ~10 minutes:
SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
It seems the violating range always has the primary region on the config. I don't know if this is expected behavior.
For example I ran an ALTER to change the primary region:
ALTER DATABASE "meetup" SET PRIMARY REGION "us-west1";
SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
And got:
zone_id | subzone_id | type | config | report_id | violation_start | violating_ranges
----------+------------+------------------+--------------------+-----------+-------------------------------+-------------------
116 | 0 | voter_constraint | +region=us-west1:2 | 1 | 2023-08-03 00:11:17.833192+00 | 2
(1 row)
Whereas running:
SET alter_primary_region_super_region_override = 'on';
ALTER DATABASE "meetup" SET PRIMARY REGION "europe-west1";
Gives us (after waiting a bit):
SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
zone_id | subzone_id | type | config | report_id | violation_start | violating_ranges
----------+------------+------------------+------------------------+-----------+-------------------------------+-------------------
116 | 0 | voter_constraint | +region=europe-west1:2 | 1 | 2023-08-03 00:16:19.488701+00 | 8
Jira issue: CRDB-30324