Skip to content

resolving tikv store address return an error "invalid store ID xxxx, not found" leading to raft client infinite loop #17875

@SonglinLife

Description

@SonglinLife

Bug Report

Hi Guys,

Currently, we offlined a TiKV node, and then we found that our TiKV cluster reported some error logs simultaneously. The errors were the same:

[2024/11/19 18:13:27.752 +08:00] [WARN] [client.rs:138] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval\")"]
[2024/11/19 18:13:27.754 +08:00] [ERROR] [util.rs:460] ["request failed"] [err_code=KV:PD:gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"invalid store ID 6365, not found\") }))"]
[2024/11/19 18:13:27.754 +08:00] [ERROR] [util.rs:469] ["reconnect failed"] [err_code=KV:PD:Unknown] [err="Other(\"[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval\")"]tikv

TiKV wants to resolve another TiKV node from PD, which store ID is 6365. However, 6365 is an offlined store and has already been removed as a tombstone. So PD just returns an error message: "invalid store ID 6365, not found."

I think these infinite error logs may be caused by the code below.

tikv/src/server/resolve.rs

Lines 114 to 124 in 3bd8c24

Err(e) => {
// Tombstone store may be removed manually or automatically
// after 30 days of deletion. PD returns
// "invalid store ID %d, not found" for such store id.
// See https://github.com/tikv/pd/blob/v7.3.0/server/grpc_service.go#L777-L780
if format!("{:?}", e).contains("not found") {
RESOLVE_STORE_COUNTER_STATIC.not_found.inc();
info!("resolve store not found"; "store_id" => store_id);
self.router.report_store_maybe_tombstone(store_id);
}
return Err(box_err!(e));

If a store ID is not found in PD, PD returns an error message: "invalid store ID xxxx, not found." TiKV's resolve get_address perhaps needs to return an error Error::StoreTombstone(store_id) so the raft client could know this store is a tombstone and end the loop.

if let ResolveError::StoreTombstone(_) = e {
let mut pool = pool.lock().unwrap();
if let Some(s) = pool.connections.remove(&(back_end.store_id, conn_id)) {
s.set_conn_state(ConnState::Disconnected);
}
pool.tombstone_stores.insert(back_end.store_id);
return;
}
continue;

What version of TiKV are you using?

5.0.6

What operating system and CPU are you using?

Linux Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz

Steps to reproduce

  1. stop a tombstone tikv service.
  2. use pd-ctl remove-tombstone

What did you expect?

the error log gone.

What did happened?

tikv report error simultaneously

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.report/communityThe community has encountered this bug.severity/minortype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions