Skip to content

mds/quiesce: agent: avoid a race condition with rapid db updates#56956

Merged
leonid-s-usov merged 1 commit intomainfrom
wip-lusov-quiesce-agent-race
Apr 18, 2024
Merged

mds/quiesce: agent: avoid a race condition with rapid db updates#56956
leonid-s-usov merged 1 commit intomainfrom
wip-lusov-quiesce-agent-race

Conversation

@leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented Apr 17, 2024

When new roots begin processing but don't yet make it into the
currently tracked set, there is a window for the next update
with the same roots to treat them as new.

We fix it by simplifying the agent model, getting rid of
the intermediate working set. Since we never remove or add
items into the current roots collection, it's safe to update the
current set directly from the pending set.

The race was due to the fact that db_update() relied on the current
to deduce new roots into pending, while the same new root
could have already been seen and posted into the working set.
This would lead to submitting the same new root twice.
Without the working set such race isn't possible.

Fixes: https://tracker.ceph.com/issues/65545

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • No doc update is appropriate
  • Tests (select at least one)
    • Includes bug reproducer
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@leonid-s-usov leonid-s-usov requested a review from batrick April 17, 2024 13:40
@github-actions github-actions bot added cephfs Ceph File System tests labels Apr 17, 2024
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-agent-race branch from c3ff08e to ebc02e0 Compare April 17, 2024 13:44
@leonid-s-usov leonid-s-usov requested a review from a team April 17, 2024 13:45
When new roots begin processing but don't yet make it into the
currently tracked set, there is a window for the next update
with the same roots to treat them as new.

We fix it by simplifying the agent model, getting rid of
the intermediate `working` set. Since we never remove or add
items into the current roots collection, it's safe to update the
current set directly from the pending set.

The race was due to the fact that `db_update()` relied on the `current`
to deduce new roots into `pending`, while the same new root
could have already been seen and posted into the `working` set.
This would lead to submitting the same new root twice.
Without the `working` set such race isn't possible.

Fixes: https://tracker.ceph.com/issues/65545
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-agent-race branch from ebc02e0 to 2a3faf1 Compare April 17, 2024 13:52
@leonid-s-usov
Copy link
Contributor Author

Passed 8 tests on ubuntu machines here: https://pulpito.ceph.com/leonidus-2024-04-18_07:15:14-fs-wip-lusov-quiesce-agent-race-distro-default-smithi/

the other tests were dead due to an infra issue

@leonid-s-usov
Copy link
Contributor Author

jenkins test api

@leonid-s-usov
Copy link
Contributor Author

The downstream backport: https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/597

@leonid-s-usov leonid-s-usov merged commit f84b45c into main Apr 18, 2024
@leonid-s-usov leonid-s-usov deleted the wip-lusov-quiesce-agent-race branch April 18, 2024 12:35
leonid-s-usov added a commit that referenced this pull request Apr 18, 2024
When new roots begin processing but don't yet make it into the
currently tracked set, there is a window for the next update
with the same roots to treat them as new.

We fix it by simplifying the agent model, getting rid of
the intermediate `working` set. Since we never remove or add
items into the current roots collection, it's safe to update the
current set directly from the pending set.

The race was due to the fact that `db_update()` relied on the `current`
to deduce new roots into `pending`, while the same new root
could have already been seen and posted into the `working` set.
This would lead to submitting the same new root twice.
Without the `working` set such race isn't possible.

Fixes: https://tracker.ceph.com/issues/65570
Original-Issue: https://tracker.ceph.com/issues/65545
Original-PR: #56956
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 2a3faf1)
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Aug 7, 2024
When new roots begin processing but don't yet make it into the
currently tracked set, there is a window for the next update
with the same roots to treat them as new.

We fix it by simplifying the agent model, getting rid of
the intermediate `working` set. Since we never remove or add
items into the current roots collection, it's safe to update the
current set directly from the pending set.

The race was due to the fact that `db_update()` relied on the `current`
to deduce new roots into `pending`, while the same new root
could have already been seen and posted into the `working` set.
This would lead to submitting the same new root twice.
Without the `working` set such race isn't possible.

Fixes: https://tracker.ceph.com/issues/65570
Original-Issue: https://tracker.ceph.com/issues/65545
Original-PR: ceph#56956
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 2a3faf1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants