-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Bug Report
Hi Guys,
Currently, we offlined a TiKV node, and then we found that our TiKV cluster reported some error logs simultaneously. The errors were the same:
[2024/11/19 18:13:27.752 +08:00] [WARN] [client.rs:138] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval\")"]
[2024/11/19 18:13:27.754 +08:00] [ERROR] [util.rs:460] ["request failed"] [err_code=KV:PD:gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"invalid store ID 6365, not found\") }))"]
[2024/11/19 18:13:27.754 +08:00] [ERROR] [util.rs:469] ["reconnect failed"] [err_code=KV:PD:Unknown] [err="Other(\"[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval\")"]tikv
TiKV wants to resolve another TiKV node from PD, which store ID is 6365. However, 6365 is an offlined store and has already been removed as a tombstone. So PD just returns an error message: "invalid store ID 6365, not found."
I think these infinite error logs may be caused by the code below.
Lines 114 to 124 in 3bd8c24
| Err(e) => { | |
| // Tombstone store may be removed manually or automatically | |
| // after 30 days of deletion. PD returns | |
| // "invalid store ID %d, not found" for such store id. | |
| // See https://github.com/tikv/pd/blob/v7.3.0/server/grpc_service.go#L777-L780 | |
| if format!("{:?}", e).contains("not found") { | |
| RESOLVE_STORE_COUNTER_STATIC.not_found.inc(); | |
| info!("resolve store not found"; "store_id" => store_id); | |
| self.router.report_store_maybe_tombstone(store_id); | |
| } | |
| return Err(box_err!(e)); |
If a store ID is not found in PD, PD returns an error message: "invalid store ID xxxx, not found." TiKV's resolve get_address perhaps needs to return an error Error::StoreTombstone(store_id) so the raft client could know this store is a tombstone and end the loop.
tikv/src/server/raft_client.rs
Lines 837 to 845 in 3bd8c24
| if let ResolveError::StoreTombstone(_) = e { | |
| let mut pool = pool.lock().unwrap(); | |
| if let Some(s) = pool.connections.remove(&(back_end.store_id, conn_id)) { | |
| s.set_conn_state(ConnState::Disconnected); | |
| } | |
| pool.tombstone_stores.insert(back_end.store_id); | |
| return; | |
| } | |
| continue; |
What version of TiKV are you using?
5.0.6
What operating system and CPU are you using?
Linux Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Steps to reproduce
- stop a tombstone tikv service.
- use pd-ctl remove-tombstone
What did you expect?
the error log gone.
What did happened?
tikv report error simultaneously