Skip to content

mds: blocklist clients with "bloated" session metadata#52944

Merged
vshankar merged 4 commits intoceph:mainfrom
vshankar:wip-61947
Aug 25, 2023
Merged

mds: blocklist clients with "bloated" session metadata#52944
vshankar merged 4 commits intoceph:mainfrom
vshankar:wip-61947

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Aug 11, 2023

Fixes: http://tracker.ceph.com/issues/61947

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@vshankar vshankar added the cephfs Ceph File System label Aug 11, 2023
@vshankar vshankar requested a review from a team August 11, 2023 11:31
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a PendingReleaseNote would be appropriate.

It'd also be good if we had some kind of cluster warning about this. Maybe add a message parameter to evict_client so it can issue the cluster warning (I'm not sure SessionMap.cc can do it easily).

@dparmar18
Copy link
Contributor

@vshankar the comments are yet to be addressed right?

@vshankar
Copy link
Contributor Author

@vshankar the comments are yet to be addressed right?

Pushing an update (I mark it resolved before an update just for my sanity).

@vshankar
Copy link
Contributor Author

Added note in PendingReleaseNotes.

Copy link
Contributor

@robbat2 robbat2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving as some of the original users affected by this. Minor ask, not-critical: having a perf metric about how many sessions exceeded the threshold would be valuable.

@vshankar
Copy link
Contributor Author

approving as some of the original users affected by this. Minor ask, not-critical: having a perf metric about how many sessions exceeded the threshold would be valuable.

Good idea. I'll include this and push an update.

Buggy clients (or maybe a MDS bug) causes a huge buildup of
`completed_requests` metadata in its session information.
This could cause the MDS to go read-only when its flushing
session metadata to the journal since the bloated metadata
causes the ODSOp payload to exceed the maximum write size.

Blocklist such clients so as to allow the MDS to continue
servicing requests.

Fixes: http://tracker.ceph.com/issues/61947
Signed-off-by: Venky Shankar <vshankar@redhat.com>
…data threshold being exceeded

Signed-off-by: Venky Shankar <vshankar@redhat.com>
... when its session metadata is bloated due to buildup of
`completed_requests`.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
…mds config

Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

fixed and updated.

Copy link
Contributor

@dparmar18 dparmar18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vshankar
Copy link
Contributor Author

fs suite test results - https://pulpito.ceph.com/vshankar-2023-08-22_10:18:46-fs-wip-vshankar-testing-20230822.064807-testing-default-smithi/

will go through the run list tomorrow and do the needful.

@vshankar
Copy link
Contributor Author

fs suite test results - https://pulpito.ceph.com/vshankar-2023-08-22_10:18:46-fs-wip-vshankar-testing-20230822.064807-testing-default-smithi/

will go through the run list tomorrow and do the needful.

Test run looks fine - no failures related to this change. I'll merge this shortly.

@vshankar
Copy link
Contributor Author

jenkins test api

Copy link
Contributor Author

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izxl007
Copy link
Contributor

izxl007 commented Aug 20, 2025

@vshankar Hi~
There is another scenario regarding this issue.
When the total size of all sessions exceeds the limit value of the OSD, it will also cause that the MDS goes read-only.
My idea is to flash the sessions in batches, and the size of each batch should not exceed the limit value of the OSD.
I don't know if this idea has any other impacts. Could you give me some suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants