Skip to content

kvserver: sustained constraint non-conformance for MR schemas #108127

@irfansharif

Description

@irfansharif

Originally posted by @dikshant in #106128 (comment)

Here is a debug zip, see below for repro steps.
https://drive.google.com/file/d/1Ilkl1vWS8CpyuNDku93dC9XLs0U9aI4k/view?usp=sharing

I tried this on a 23.1.7 on a 18 node multi region cluster in roachprod.

So this is interesting. Mapping replicas to replica_localities using @j82w 's fixed query shows the correct mappings:

root@localhost:26257/meetup> SELECT DISTINCT
                          ->       split_part(unnest(replica_localities), ',', 2) replica_localities,
                          ->       unnest(replicas) replica,
                          ->       range_id
                          ->     FROM [SHOW RANGE FROM TABLE product FOR ROW ('europe-west1', '2f22da46-d983-4878-8ad2-a6e6ff7e8f39')];
    replica_localities   | replica | range_id
-------------------------+---------+-----------
  region=europe-west1    |       7 |       69
  region=europe-west1    |       9 |       69
  region=europe-central2 |      11 |       69
  region=europe-central2 |      12 |       69
  region=europe-north1   |      15 |       69
(5 rows)

However, the violating range is still present and this is after waiting 10+ minutes:

root@localhost:26257/meetup> SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
  zone_id | subzone_id |       type       |       config       | report_id |        violation_start        | violating_ranges
----------+------------+------------------+--------------------+-----------+-------------------------------+-------------------
      116 |          0 | voter_constraint | +region=us-east4:2 |         1 | 2023-08-02 23:44:58.271424+00 |                3
(1 row)

Time: 105ms total (execution 105ms / network 0ms)

Reproduction steps:

  1. Create a MR cluster. I used:

    roachprod create dikshant-test -n 18 --gce-zones 'us-east4-a','us-east4-a','us-east4-a','us- 
    central1-a','us-central1-a','us-central1-a','europe-west1-b','europe-west1-b','europe-west1- 
    b','europe-central2-b','europe-central2-b','europe-central2-b',"europe-north1-b","europe- 
    north1-b","europe-north1-b","us-west1-a","us-west1-a","us-west1-a" && roachprod stage 
    dikshant-test release v23.1.7 && roachprod start dikshant-test:1-18
    
  2. Apply the following DDL and DML:
    https://gist.github.com/dikshant/d4d170d70e493119b7cb6306aedb7551

  3. Check for violating ranges after waiting for ~10 minutes:

    SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
    

It seems the violating range always has the primary region on the config. I don't know if this is expected behavior.

For example I ran an ALTER to change the primary region:

ALTER DATABASE "meetup" SET PRIMARY REGION "us-west1";
SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;

And got:

  zone_id | subzone_id |       type       |       config       | report_id |        violation_start        | violating_ranges
----------+------------+------------------+--------------------+-----------+-------------------------------+-------------------
      116 |          0 | voter_constraint | +region=us-west1:2 |         1 | 2023-08-03 00:11:17.833192+00 |                2
(1 row)

Whereas running:

SET alter_primary_region_super_region_override = 'on';
ALTER DATABASE "meetup" SET PRIMARY REGION "europe-west1";

Gives us (after waiting a bit):

SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;
  zone_id | subzone_id |       type       |         config         | report_id |        violation_start        | violating_ranges
----------+------------+------------------+------------------------+-----------+-------------------------------+-------------------
      116 |          0 | voter_constraint | +region=europe-west1:2 |         1 | 2023-08-03 00:16:19.488701+00 |                8

Jira issue: CRDB-30324

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.A-kv-serverRelating to the KV-level RPC serverC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Teambranch-release-23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.1

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions