cephfs: MDCache request cleanup #62630
Conversation
In cases where there is a single element in a batch_op_map,new_batch_head is a nullptr, when this is retried at Finisher we'd hit one of the asserts when dereferencing Fixes: https://tracker.ceph.com/issues/70769 Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@cern.ch>
Ignore null requests Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@cern.ch>
|
The main fix is the first commit; the second commit adds a couple of null checks, though this may prevent hiding actual issues so happy to drop it. |
|
@mchangir PTAL |
|
@theanalyst the crash is most probably this issue #62684 |
if the new_batch_head is not found it defaults to null, and this is passed to Finisher. We see this happening quite a lot if you launch requests with find etc on a large tree where a single inode might have only a single batch request and then you pass nullptr to Finisher which crashes. So I believe passing a non null new_batch_head is needed. What do you think? |
@theanalyst would you be able to form a set of bash commands that mimic your workload ? |
|
I've just seen this in the wild on a single active v18.2.7 fs. I don't have any more detail about the use-case that triggered it. |
|
@mchangir I believe @theanalyst shared the reproducer elsewhere (via mail?). Could you please update the tracker with those details please. I will plan to have a look then. |
|
From the codeflow, what I can understand, when a session gets evicted it tries to clear the requests, This While following the comment, mds wants to just proceed the request marking it as a killed request. But when it comes to batch_head request, either it should replaced by new batch head(which is not killed) request completely in the |
|
@mchangir gentle nudge on this. |
| } else { | ||
| mds->finisher->queue(new C_MDS_RetryRequest(this, new_batch_head)); |
There was a problem hiding this comment.
this looks sensible to me ... wrapping the re-queuing in an else block probably got missed earlier
There was a problem hiding this comment.
For most of the crashes we see in the Finisher, this seems like the most logical place. I also tried adding print statements and I can see that we indeed queue up a null new_batch_head when the request gets killed and the inode doesn't have a new_batch_head. Finally we crash in finish where we try to access this nullptr
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
(unmarking stale) I haven't been following the discussion here closely. @mchangir - Is this good to go? What about @sajibreadd comment #62630 (comment) ? |
|
This PR is under test in https://tracker.ceph.com/issues/73693. |
When BatchGetAttr finds no new head, we end up passing a potential null request object to Finisher which crashes later. Handle this
Fixes: https://tracker.ceph.com/issues/70769
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition