Skip to content

storage: large raft log not truncating for unknown reason #34269

@andreimatei

Description

@andreimatei

I look at a cluster of 6 nodes, with two of them being down. One range was under-replicated. The replication queue would like to upreplicate it but it's not because the Raft log is considered too large (the log is 50MB and the cutoff is 16MB I think). There's tens of thousands of log entries.
The Raft log is not being truncated because the Raft log queue does not consider it should queue it. Why no is the million dollar question. The queue is aware of the size of the log and correctly uses it to assign a high score. We unfortunately don't have the truncateDecision because we're never logging it if it's negative (I'm fixing that). I can only assume that the truncation decision is circularly related to the under-replicated status, but I don't have a smoking gun.
The only information I could get about the opinion of the raft log queue is the following:

image 1

Here's the admin UI report for that range.
r28024-PM.pdf

Here's the range info from a debug.zip.
28024.txt

Here's a debug.zip (internal only)
https://drive.google.com/open?id=1T_cGn_8NlztXvHHOgTroLl870kuI2iTD

Another interesting thing going on in that cluster is that some stores seem to be periodically throttled in the storePool of the node with the large Raft log (n2), because they seem to be refusing snapshots (and I'm not talking about the nodes that are dead). This can be seen in the following trace taken from the Manuel Enqueue page asking the replication queue.
https://gist.github.com/andreimatei/eeacd0a79eec809e2494961fb9f19be3

However, I have no reason to believe that the throttling has anything to do with the refusal to truncate. The throttling might have independently been a problem for the up-replication.

Tobi, I believe you know the truncation decision code the best...

cc @petermattis

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions