schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue#1029
Conversation
|
/cc @crazycs520 |
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
fe14eb0 to
2d20e8f
Compare
| github.com/tikv/client-go/v2 => ../ | ||
| ) | ||
|
|
||
| replace github.com/pingcap/tidb => github.com/windtalker/tidb v1.1.0-beta.0.20231020063218-4d1c15539f3f |
There was a problem hiding this comment.
Because there is a cycle dependency between tidb and client-go integration test. I change the interface of Cluster, intergration test will fail, I have use a local tidb to make the test pass.
| } | ||
|
|
||
| if store.storeType == tikvrpc.TiFlash { | ||
| r.hasUnavailableTiFlashStore = true |
There was a problem hiding this comment.
It seems the unavailable flag is set when loading region from PD.
If all the TiFlash stores are up when loading the region, then one of them goes down, the hasUnavailableTiFlashStore will be false, and the cached region might be used continuingly and may never loading again. Then the reload will also be skipped.
Correct me if I understand something wrong.
There was a problem hiding this comment.
Yes, your understanding is right. If all the TiFlash nodes are up, and there is continuous query on TiFlash, the region will not be out-dated, and if one of the TiFlash goes down, the region cache is not aware of it. But although region cache is not aware of the down node, TiDB mpp can handle this correctly because for each mpp query, TiDB will send isAlive rpc to all the candidate TiFlash nodes, and if fail to get response or the response is false, TiDB will not send task to that TiFlash node.
There was a problem hiding this comment.
TiDB will not send task to that TiFlash node.
Even after the node is recovered, TiDB still do not send task to it?
There was a problem hiding this comment.
Once region cache can "see" the TiFlash node, TiDB will send task to it if it is back. In the case we discussed above, region cache can always "see" the TiFlash node, so TiDB will send task to it after it is recovered.
|
/run-unit-tests |
Describe
Ref pingcap/tidb#35418 for details.
The basic idea is for region that has unavailable TiFlash peers, we should reload the region so TiDB will be aware if related TiFlash is back.
This pr
Regionto indicate whether this region has unavailable TiFlash peers and log the last load time for this regionGetTiFlashRPCContextNote inorder to resolve the cycle dependency of client-go integration tests and tidb, I have to use a local tidb in this pr, will update it once TiDB code is merged.
Test
Test step
Test result
Without this pr

When TiFlash node is back around 15:22, the load keep extremely unbalanced between the 2 TiFlash nodes

With this pr
When TiFlash node is back around 16:45, the load automatically balanced at ~16:55