Skip to content

[WIP]crimson/osd/osd_operations/pg_advance: make PGAdvanceMap a pg_operation#51425

Closed
Matan-B wants to merge 16 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-osdmap-enoent
Closed

[WIP]crimson/osd/osd_operations/pg_advance: make PGAdvanceMap a pg_operation#51425
Matan-B wants to merge 16 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-osdmap-enoent

Conversation

@Matan-B
Copy link
Contributor

@Matan-B Matan-B commented May 10, 2023

[eb27130] Fixes the following case:

DEBUG 2023-05-09 13:27:22,883 [shard 0] osd - pg_advance_map(id=13033, detail=PGAdvanceMap(pg=10.16 from=76 to=77)): complete
DEBUG 2023-05-09 13:27:22,883 [shard 0] osd - pg_advance_map(id=13152, detail=PGAdvanceMap(pg=15.11 from=76 to=77)): sending pg temp
DEBUG 2023-05-09 13:27:22,883 [shard 0] osd - pg_advance_map(id=12506, detail=PGAdvanceMap(pg=3.1b from=77 to=76)): start: getting map 78
DEBUG 2023-05-09 13:27:22,883 [shard 0] osd - get_local_map loading osdmap.78 from disk
INFO  2023-05-09 13:27:22,883 [shard 0] osd - load_map osdmap.78
DEBUG 2023-05-09 13:27:22,883 [shard 0] osd - load_map_bl loading osdmap.78 from disk
DEBUG 2023-05-09 13:27:22,883 [shard 0] alienstore - read
...
DEBUG 2023-05-09 13:27:22,957 [shard 0] osd - pg_advance_map(id=13074, detail=PGAdvanceMap(pg=12.d from=76 to=77)): complete
DEBUG 2023-05-09 13:27:22,957 [shard 0] osd - pg_advance_map(id=12975, detail=PGAdvanceMap(pg=8.b from=76 to=77)): sending pg temp
DEBUG 2023-05-09 13:27:22,957 [shard 0] osd - pg_advance_map(id=13144, detail=PGAdvanceMap(pg=15.7 from=76 to=77)): complete
DEBUG 2023-05-09 13:27:22,957 [shard 0] osd - pg_advance_map(id=12833, detail=PGAdvanceMap(pg=2.2 from=76 to=77)): sending pg temp
DEBUG 2023-05-09 13:27:22,957 [shard 0] osd - OSDMeta::load_map: start read gave enoent on 78
ERROR 2023-05-09 13:27:22,957 [shard 0] none - ../src/crimson/osd/osd_meta.cc:40 : In function 'OSDMeta::load_map(epoch_t)::<lambda()>', abort(%s)
abort() called
// See PGAdvanceMap(pg=3.1b from=77 to=76)
// from is set via `pg->get_osdmap_epoch()`
// 'from' is already newer than 'to'
// we start iterating from 77+1 (osdmap.78 does not exist)

Drafted for discussion - comments left in commit. The fix assumes that if the pg was already advanced to a later osdmap than to - we can skip the PGAdvanceMap operation.

https://gist.github.com/Matan-B/67a6c2a9b9c581600799c2c37704e913

Fixes: https://tracker.ceph.com/issues/59165

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@Matan-B Matan-B requested review from athanatos and rzarzynski May 10, 2023 14:14
@Matan-B Matan-B force-pushed the wip-matanb-crimson-osdmap-enoent branch from eb27130 to fcfe7a6 Compare May 10, 2023 15:34
@athanatos
Copy link
Contributor

So the bug is that the PG already has a map epoch after to? Is it that we're processing the same maps from multiple MOSDMap messages? It's entirely possible, likely even, for an OSD to get several MOSDMap messages with overlapping map sequences close together since any peer OSD will send one if it thinks the OSD is out of date. Likely means we're not correctly maintaining the state that would avoid processing the same map more than once.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See question above)

@Matan-B
Copy link
Contributor Author

Matan-B commented May 11, 2023

So the bug is that the PG already has a map epoch after to? Is it that we're processing the same maps from multiple MOSDMap messages? It's entirely possible, likely even, for an OSD to get several MOSDMap messages with overlapping map sequences close together since any peer OSD will send one if it thinks the OSD is out of date. Likely means we're not correctly maintaining the state that would avoid processing the same map more than once.

@athanatos, I think this is caused by misusing Seastar's do_for_each implementation.
The only check is for begin != end which assumes that begin iterator is smaller than end iterator [1].
This can also be resolved by using do_until with a safer stop condition, although I prefer the current solution.

I will also note that in an early test run, it seems like this issue no longer appear with 56/60 passed (Crimson suite x3 times) [2]

[1]https://github.com/scylladb/seastar/blob/68b5aece19d1b3656a95c6869ff510db2e402dec/include/seastar/core/loop.hh#L424

[2]https://pulpito.ceph.com/matan-2023-05-11_08:22:25-crimson-rados-crimson-only-no-mclock-osdmap-enoent-distro-crimson-smithi/

@Matan-B Matan-B requested a review from athanatos May 11, 2023 13:52
@Matan-B Matan-B marked this pull request as ready for review May 11, 2023 17:10
@Matan-B Matan-B requested a review from a team as a code owner May 11, 2023 17:10
@athanatos
Copy link
Contributor

It's not so much that it's a misuse as that the code as written encodes an assumption that we don't initiate PGAdvanceMap on the same epoch twice. It seems to me like we really do want to enforce that invariant, so I'd like to understand how the PG advanced past that epoch in the first place. Tolerating that case seems at best inefficient and at worst would hide a problem.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See above)

@Matan-B
Copy link
Contributor Author

Matan-B commented May 15, 2023

It's not so much that it's a misuse as that the code as written encodes an assumption that we don't initiate PGAdvanceMap on the same epoch twice. It seems to me like we really do want to enforce that invariant, so I'd like to understand how the PG advanced past that epoch in the first place. Tolerating that case seems at best inefficient and at worst would hide a problem.

@athanatos, I have made some remarks based on a more thorough examination of this issue and provided an abstract solution/re-design to the current way we schedule a pg advance map operation.
Can you please take a look at https://gist.github.com/Matan-B/67a6c2a9b9c581600799c2c37704e913 for any initial feedback?
Thank you.


Regarding previous comment,

So the bug is that the PG already has a map epoch after to? Is it that we're processing the same maps from multiple MOSDMap messages? It's entirely possible, likely even, for an OSD to get several MOSDMap messages with overlapping map sequences close together since any peer OSD will send one if it thinks the OSD is out of date. Likely means we're not correctly maintaining the state that would avoid processing the same map more than once.

IIUC, this is not an issue of processing the same map more than once, this is being maintained successfully.

@Matan-B Matan-B requested a review from athanatos May 15, 2023 18:50
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matan-B IIRC, the plan here was to add a mutex or some other mechanism to correctly pipeline concurrent MOSDMap messages, right?

@Matan-B
Copy link
Contributor Author

Matan-B commented May 21, 2023

@Matan-B IIRC, the plan here was to add a mutex or some other mechanism to correctly pipeline concurrent MOSDMap messages, right?

@athanatos Right, it's wip.

Matan-B added 7 commits May 24, 2023 14:06
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Added logs
If case reconstructed - epoch is unsigned.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
…GAdvanceMap ctor

No connection is passed as this is not a pg_operation

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B changed the title crimson/osd/osd_operations/pg_advance: Fix PGAdvanceMap::start() [WIP]crimson/osd/osd_operations/pg_advance: make PGAdvanceMap a pg_operation May 24, 2023
@Matan-B Matan-B force-pushed the wip-matanb-crimson-osdmap-enoent branch from fcfe7a6 to 5eb7d11 Compare May 24, 2023 14:36
Comment on lines +121 to +126
[this, conn, advance_from, advance_to, &state](auto& pg) {
return start_pg_operation_with_state<PGAdvanceMap>(
state,
conn,
shard_services.local(),
pg.second, advance_from, advance_to,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially will allow us to schedule a PGAdvanceMap as an exclusive pg_operation with bounded advance_from and advance_to epochs.

(Pushed for early round of review)

@Matan-B Matan-B force-pushed the wip-matanb-crimson-osdmap-enoent branch 2 times, most recently from 0c5bb48 to d340687 Compare May 29, 2023 14:22
@Matan-B Matan-B force-pushed the wip-matanb-crimson-osdmap-enoent branch from d340687 to 38ed821 Compare May 29, 2023 14:23
@Matan-B Matan-B requested a review from athanatos May 29, 2023 14:42
Matan-B added 9 commits May 29, 2023 16:12
…peration

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
…ional

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-osdmap-enoent branch from 38ed821 to 66c4485 Compare May 29, 2023 16:30
@Matan-B
Copy link
Contributor Author

Matan-B commented May 31, 2023

Closing in favor of #51860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants