Steps to reproduce
All omdb ops performed on sled g0
- Launch a4x2
- Set clickhouse-policy to
both via omdb
- regenerate a blueprint and make target
- hyperstop g2 (node with one keeper)
- expunge g2 via omdb
- regenerate a couple blueprints and set as targets
- Ensure the zones get expunged in the blueprints
Evidence
Sled g2 can definitely no longer be reached. I see a log related to failing to contact it in the nexus node on g3. However, the keeper still shows it in inventory both in keeper-config.xml and via the clickhouse keeper-client command.
root@oxz_clickhouse_keeper_1d4c8dac:~# clickhouse keeper-client --host [fd00:1122:3344:104::21]
Connected to ZooKeeper at [fd00:1122:3344:104::21]:9181 with session_id 8
Keeper feature flag FILTERED_LIST: enabled
Keeper feature flag MULTI_READ: enabled
Keeper feature flag CHECK_NOT_EXISTS: disabled
/ :) get /keeper/config
server.1=fd00:1122:3344:101::21:9234;participant;1
server.2=fd00:1122:3344:104::21:9234;participant;1
server.3=fd00:1122:3344:103::21:9234;participant;1
server.4=fd00:1122:3344:103::22:9234;participant;1
server.5=fd00:1122:3344:102::21:9234;participant;1
The keeper on sled g2 is server.5
I then checked to see that there has been keeper log entries committed by the leader and they are increasing.
/ :) lgif
first_log_idx 1
first_log_term 1
last_log_idx 1515
last_log_term 1
last_committed_log_idx 1515
leader_committed_log_idx 1515
target_committed_log_idx 1515
last_snapshot_idx 0
I then checked crdb to see what the configuration was:
root@[fd00:1122:3344:101::3]:32221/omicron> select * from bp_clickhouse_cluster_config ;
blueprint_id | generation | max_used_server_id | max_used_keeper_id | cluster_name | cluster_secret | highest_seen_keeper_leader_committed_log_index
---------------------------------------+------------+--------------------+--------------------+------------------+--------------------------------------+-------------------------------------------------
16dfac44-0091-453a-b5e0-2e1b8cad2329 | 2 | 3 | 5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e | 0
69cdc490-9a9d-46e9-b0c0-c8661b0b4794 | 2 | 3 | 5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e | 0
79d919a7-13cd-4b47-9e9c-d15515c8532f | 2 | 3 | 5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e | 0
bc23843c-1b2a-49d0-9b7b-224f1ed2e892 | 2 | 3 | 5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e | 0
Interestingly the highest_seen_keeper_leader_committed_log_index is 0 for all blueprints.
There are also no related rows in inventory:
root@[fd00:1122:3344:101::3]:32221/omicron> select * from inv_clickhouse_keeper_membership;
inv_collection_id | queried_keeper_id | leader_committed_log_index | raft_config
--------------------+-------------------+----------------------------+--------------
(0 rows)
Time: 4ms total (execution 4ms / network 1ms)
root@[fd00:1122:3344:101::3]:32221/omicron>
It appears that retrieving this inventory data from clickhouse-admin-keeper is not working resulting in failure to modify the keepers.
Steps to reproduce
All omdb ops performed on sled g0
bothvia omdbEvidence
Sled g2 can definitely no longer be reached. I see a log related to failing to contact it in the nexus node on g3. However, the keeper still shows it in inventory both in
keeper-config.xmland via theclickhouse keeper-clientcommand.The keeper on sled g2 is
server.5I then checked to see that there has been keeper log entries committed by the leader and they are increasing.
I then checked crdb to see what the configuration was:
Interestingly the
highest_seen_keeper_leader_committed_log_indexis 0 for all blueprints.There are also no related rows in inventory:
It appears that retrieving this inventory data from
clickhouse-admin-keeperis not working resulting in failure to modify the keepers.