Skip to content

client: crash caused by invalid iterator in _readdir_cache_cb#64627

Merged
vshankar merged 1 commit intoceph:mainfrom
zhsgao:client_crash_on_invalid_iterator
Sep 3, 2025
Merged

client: crash caused by invalid iterator in _readdir_cache_cb#64627
vshankar merged 1 commit intoceph:mainfrom
zhsgao:client_crash_on_invalid_iterator

Conversation

@zhsgao
Copy link
Contributor

@zhsgao zhsgao commented Jul 23, 2025

Capacity of readdir_cache may change after client_lock is unlocked in iterations of readdir_cache, and it can cause the iterator to be invalid, then using the invalid iterator in the next iteration will cause crash.
Crash may happen at Dentry *dn = *pd (pd points to invalid memory), or at if (pd >= dir->readdir_cache.end() || *pd != dn) (pd is smaller than begin() if idx is negative).
Use index instead of iterator to solve this problem.

Fixes: https://tracker.ceph.com/issues/72247
Signed-off-by: Zhansong Gao zhsgao@hotmail.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions github-actions bot added the cephfs Ceph File System label Jul 23, 2025
Capacity of `readdir_cache` may change after `client_lock` is unlocked in iterations of `readdir_cache`,
and it can cause the iterator to be invalid, then using the invalid iterator in the next iteration will cause crash.
Crash may happen at `Dentry *dn = *pd` (pd points to invalid memory),
or at `if (pd >= dir->readdir_cache.end() || *pd != dn)` (pd is smaller than begin() if idx is negative).
Use index instead of iterator to solve this problem.

Fixes: https://tracker.ceph.com/issues/72247
Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
@zhsgao zhsgao force-pushed the client_crash_on_invalid_iterator branch from 33c18be to 9e0488d Compare July 23, 2025 07:26
@zhsgao zhsgao changed the title client: crash caused by invalid iterator client: crash caused by invalid iterator in _readdir_cache_cb Jul 23, 2025
@vshankar vshankar requested a review from a team July 24, 2025 12:47
@vshankar
Copy link
Contributor

@dparmar18 PTAL

@vshankar
Copy link
Contributor

jenkins retest this please

@vshankar
Copy link
Contributor

vshankar commented Aug 4, 2025

@dparmar18 ptal to review the fix.

@vshankar
Copy link
Contributor

vshankar commented Aug 4, 2025

jenkins test windows

@dparmar18
Copy link
Contributor

@dparmar18 ptal to review the fix.

on it now

@vshankar
Copy link
Contributor

vshankar commented Aug 8, 2025

@dparmar18 gentle nudge on this.

@dparmar18
Copy link
Contributor

@zhsgao the code looks good, can this be reproduced locally/easily? Do you have/know any instances where it crashed?

@zhsgao
Copy link
Contributor Author

zhsgao commented Aug 12, 2025

@zhsgao the code looks good, can this be reproduced locally/easily? Do you have/know any instances where it crashed?

I have a few crashes and I find out through the coredump that they happen at *pd != dn of if (pd >= dir->readdir_cache.end() || *pd != dn), so I think it is caused by invalid iterator pd.
I haven't tried to reproduce it yet, maybe it's not easy to reproduce.

@dparmar18
Copy link
Contributor

@zhsgao the code looks good, can this be reproduced locally/easily? Do you have/know any instances where it crashed?

I have a few crashes and I find out through the coredump that they happen at *pd != dn of if (pd >= dir->readdir_cache.end() || *pd != dn), so I think it is caused by invalid iterator pd. I haven't tried to reproduce it yet, maybe it's not easy to reproduce.

yeah if the invalid *pd is dereferenced then it should codedump but i'm keen to know cases where it happens, is there any pattern you've noticed with your workload that might've triggered it?

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/72565.

Copy link
Contributor

@dparmar18 dparmar18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good

@zhsgao
Copy link
Contributor Author

zhsgao commented Aug 13, 2025

@zhsgao the code looks good, can this be reproduced locally/easily? Do you have/know any instances where it crashed?

I have a few crashes and I find out through the coredump that they happen at *pd != dn of if (pd >= dir->readdir_cache.end() || *pd != dn), so I think it is caused by invalid iterator pd. I haven't tried to reproduce it yet, maybe it's not easy to reproduce.

yeah if the invalid *pd is dereferenced then it should codedump but i'm keen to know cases where it happens, is there any pattern you've noticed with your workload that might've triggered it?

I have tried to reproduce the crash but have not been successful, so there is no case for it yet.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/72565.

Have to rerun tests due to unrelated infra failures.

vshankar added a commit to vshankar/ceph that referenced this pull request Aug 22, 2025
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

vshankar commented Sep 3, 2025

Nice work @zhsgao

@vshankar vshankar merged commit 6a69922 into ceph:main Sep 3, 2025
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants