Skip to content

Keeper expungement failing on a4x2 #6945

@andrewjstone

Description

@andrewjstone

Steps to reproduce

All omdb ops performed on sled g0

  1. Launch a4x2
  2. Set clickhouse-policy to both via omdb
  3. regenerate a blueprint and make target
  4. hyperstop g2 (node with one keeper)
  5. expunge g2 via omdb
  6. regenerate a couple blueprints and set as targets
  7. Ensure the zones get expunged in the blueprints

Evidence

Sled g2 can definitely no longer be reached. I see a log related to failing to contact it in the nexus node on g3. However, the keeper still shows it in inventory both in keeper-config.xml and via the clickhouse keeper-client command.

root@oxz_clickhouse_keeper_1d4c8dac:~# clickhouse keeper-client --host [fd00:1122:3344:104::21]
Connected to ZooKeeper at [fd00:1122:3344:104::21]:9181 with session_id 8
Keeper feature flag FILTERED_LIST: enabled
Keeper feature flag MULTI_READ: enabled
Keeper feature flag CHECK_NOT_EXISTS: disabled
/ :) get /keeper/config
server.1=fd00:1122:3344:101::21:9234;participant;1
server.2=fd00:1122:3344:104::21:9234;participant;1
server.3=fd00:1122:3344:103::21:9234;participant;1
server.4=fd00:1122:3344:103::22:9234;participant;1
server.5=fd00:1122:3344:102::21:9234;participant;1

The keeper on sled g2 is server.5

I then checked to see that there has been keeper log entries committed by the leader and they are increasing.

/ :) lgif
first_log_idx   1
first_log_term  1
last_log_idx    1515
last_log_term   1
last_committed_log_idx  1515
leader_committed_log_idx        1515
target_committed_log_idx        1515
last_snapshot_idx       0

I then checked crdb to see what the configuration was:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from bp_clickhouse_cluster_config ;
              blueprint_id             | generation | max_used_server_id | max_used_keeper_id |   cluster_name   |            cluster_secret            | highest_seen_keeper_leader_committed_log_index
---------------------------------------+------------+--------------------+--------------------+------------------+--------------------------------------+-------------------------------------------------
  16dfac44-0091-453a-b5e0-2e1b8cad2329 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  69cdc490-9a9d-46e9-b0c0-c8661b0b4794 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  79d919a7-13cd-4b47-9e9c-d15515c8532f |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  bc23843c-1b2a-49d0-9b7b-224f1ed2e892 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0

Interestingly the highest_seen_keeper_leader_committed_log_index is 0 for all blueprints.

There are also no related rows in inventory:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from  inv_clickhouse_keeper_membership;
  inv_collection_id | queried_keeper_id | leader_committed_log_index | raft_config
--------------------+-------------------+----------------------------+--------------
(0 rows)


Time: 4ms total (execution 4ms / network 1ms)

root@[fd00:1122:3344:101::3]:32221/omicron>

It appears that retrieving this inventory data from clickhouse-admin-keeper is not working resulting in failure to modify the keepers.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions