osd: EC optimizations: new types and additions to data structures by bill-scales · Pull Request #62451 · ceph/ceph

bill-scales · 2025-03-23T15:05:17Z

Add new types and make additions to data structures for the EC optimizations feature.

EC optimized pools do not always update every shard for every I/O, this makes recovery (where peering uses the pg log and pg info structures to reconcile differences between the shards) and backfill (where peering uses the object_info_t structure to compare the object version number of different shards to reconcile differences) more complicated.

The pg_log_entry_t, pg_info_t and pg_fast_info_t structures are extended to store extra information that will be required for recovery. The object_info_t structure is extended to store extra information that will be required for backfill and scrubbing.

EC optimized pools restrict the selection of the primary to shards that are always updated. The pg_pool_t structure is
extended to track the set of shards that are not suitable to become the primary.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

bill-scales · 2025-03-24T07:35:39Z

jenkins test make check arm64

bill-scales · 2025-03-24T15:47:53Z

jenkins test make check arm64

rzarzynski

Looks fine except a decoding issue and some nits.

rzarzynski · 2025-03-26T18:06:45Z

src/osd/ECTypes.h

+
+  void encode(ceph::buffer::list &bl) const {
+    using ceph::encode;
+    encode(id, bl);


No versioning. I think that's for wrapper over of an integer.

Correct - it copies what is done for shard_id_t in src/include/types.h

src/osd/osd_types.cc

rzarzynski · 2025-03-26T18:26:13Z

src/osd/osd_types.h


+  /// EC partial writes: test if a shard is a non-primary
+  bool is_nonprimary_shard(const shard_id_t shard) const {
+    return !nonprimary_shards.empty() && nonprimary_shards.contains(shard);


ACK, is_nonprimary_shard returns false for empty nonprimary_shards (legacy EC paths).

rzarzynski · 2025-03-26T18:32:23Z

src/osd/osd_types.cc

 void pg_info_t::decode(ceph::buffer::list::const_iterator &bl)
 {
-  DECODE_START(32, bl);
+  DECODE_START(33, bl);


rzarzynski · 2025-03-26T18:33:28Z

src/osd/osd_types.h


  interval_set<snapid_t> purged_snaps;

+  std::map<shard_id_t,std::pair<eversion_t,eversion_t>> partial_writes_last_complete; ///< last_complete for shards not modified by a partial write


nit: space after ,.

rzarzynski · 2025-03-26T18:34:55Z

src/osd/osd_types.cc

 void pg_log_entry_t::decode(ceph::buffer::list::const_iterator &bl)
 {
-  DECODE_START_LEGACY_COMPAT_LEN(14, 4, 4, bl);
+  DECODE_START_LEGACY_COMPAT_LEN(15, 4, 4, bl);


rzarzynski · 2025-03-26T18:36:11Z

src/osd/osd_types.h

+    return written_shards.empty() || written_shards.contains(shard);
+  }
+  bool is_present_shard(const shard_id_t shard) const {
+    return present_shards.empty() || present_shards.contains(shard);


rzarzynski

LGTM!

rzarzynski · 2025-03-28T16:38:57Z

src/osd/osd_types.cc

 void pg_pool_t::decode(ceph::buffer::list::const_iterator& bl)
 {
-  DECODE_START_LEGACY_COMPAT_LEN(31, 5, 5, bl);
+  DECODE_START_LEGACY_COMPAT_LEN(32, 5, 5, bl);


src/osd/ECTypes.h

aainscow · 2025-03-31T08:16:01Z

jenkins test make check

bill-scales · 2025-04-01T06:59:35Z

jenkins test make check

Add some extra types required by the EC optimizations code: raw_shard_id_t is an equivalent type to shard_id_t but is used for storing raw shards. Strong typing prevents bugs where code forgets to translate between the two types. shard_id_map is a mini_flat_map indexed by shard_id_t which will be used by the EC optimizations I/O path to track updates to each shard. shard_id_set is a bitset_set of shard_id_t which is a compact and fast way of storing a set of shards involved in an EC operation. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

EC optimizations pools do not update every shard on every I/O. The primary must have a complete log and requires objects to have up to date object attributes, so the choice of primary has to be restricted. Shards that cannot become a primary are listed in the nonprimary_shards set. For a K+M EC pool with optimizations enabled the 1st data shard and all M coding parity shards are always updated and can become a primary, the other shards will be marked as nonprimary. The new set nonprimary_shards stores shards that cannot become the primary, by default it is an empty set which retains existing behavior. When optimisations are enabled on an EC pool this set will be filled in to restrict the choice of primary. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

Add partial_writes_last_complete map to pg_info_t and pg_fast_info_t. For optimized EC pools not all shards receive every log entry. As log entries are marked completed the partial writeis last complete map is updated to track shards that did not receive the log entry. Each map entry stores an eversion range. The first version is the last completion the shard participated in, the second version tracks subsequent updates where the shard was not updated. For example the range 88'10-88'12 means a shard completed update 10 and that updates 11 and 12 intentionally did not update the shard. This information is used during peering to distinguish a shard that is missing updates from a shard that intentionally did not participate in an update to work out what recovery is required. By default this map is empty indicating that every shard is expected to participate in an update and have a copy of the log entry. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

…nty_t Add two new sets to the pg_log_entry for use by EC optimization pools. The written shards set tracks which shards were written to, the present shards set tracks which shards were in the acting set at the time of the write. An empty set (default) is used to indicate all shards. For pools without allow_ec_optimizations the written set is empty (indicating all shards are written) and the present set is empty and unused. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

EC optimized pools do not always update every shard for every write I/O, this includes not updating the object_info_t (OI attribute). This means different shards can have OI indicaiting the object is at different versions. When an I/O updates a subset of the shards, the OI for the updated shards will record the old version number for the unmodified shards in the shard_versions map. The latest OI therefore has a record of the expected version number for all the shards which can be used to work out what needs to be backfilled. An empty shard_versions map imples that the OI attribute should be the same on all shards. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

bill-scales · 2025-04-07T14:02:23Z

Rebased - addressed conflict with #61804

bill-scales · 2025-04-07T18:34:24Z

jenkins test make check arm64

bill-scales · 2025-04-07T18:39:17Z

jenkins test api

bill-scales · 2025-04-07T18:40:04Z

jenkins test dashboard cephadm

bill-scales · 2025-04-08T06:08:35Z

jenkins test make check arm64

bill-scales · 2025-04-08T07:41:11Z

jenkins test make check arm64

bill-scales · 2025-04-08T09:12:53Z

jenkins test make check arm64

sseshasa · 2025-04-09T15:15:50Z

Rados Approved.
See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues70727

ronen-fr · 2025-04-10T10:02:29Z

This PR (specifically: commit 88ac6d9) breaks the osd-scrub-repair test.
I'll fix the test

bill-scales requested a review from a team as a code owner March 23, 2025 15:05

github-actions bot added the core label Mar 23, 2025

rzarzynski reviewed Mar 26, 2025

View reviewed changes

markhpc added the performance label Mar 27, 2025

bill-scales force-pushed the ec_data_structs branch from 55ee635 to 98a4e53 Compare March 27, 2025 16:19

rzarzynski approved these changes Mar 28, 2025

View reviewed changes

rzarzynski added the needs-qa label Mar 28, 2025

aainscow reviewed Mar 30, 2025

View reviewed changes

src/osd/ECTypes.h Show resolved Hide resolved

yuriw added the wip-yuri13-testing label Mar 31, 2025

athanatos approved these changes Apr 1, 2025

View reviewed changes

bill-scales force-pushed the ec_data_structs branch from 98a4e53 to 9358262 Compare April 7, 2025 14:00

bill-scales requested review from a team as code owners April 7, 2025 14:00

bill-scales requested review from pecastro and pujaoshahu and removed request for a team April 7, 2025 14:00

github-actions bot added dashboard documentation mgr pybind labels Apr 7, 2025

github-project-automation bot added this to Ceph-Dashboard Apr 7, 2025

github-project-automation bot moved this to New in Ceph-Dashboard Apr 7, 2025

bill-scales and others added 5 commits April 7, 2025 15:00

bill-scales force-pushed the ec_data_structs branch from 9358262 to 88ac6d9 Compare April 7, 2025 14:01

bill-scales mentioned this pull request Apr 7, 2025

osd: EC Optimizations: Backfill changes for partial writes #62710

Merged

14 tasks

aainscow self-requested a review April 8, 2025 06:54

aainscow approved these changes Apr 8, 2025

View reviewed changes

github-project-automation bot moved this from New to Reviewer approved in Ceph-Dashboard Apr 8, 2025

yuriw merged commit ee1b273 into ceph:main Apr 9, 2025
14 checks passed

github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Apr 9, 2025


		interval_set<snapid_t> purged_snaps;

		std::map<shard_id_t,std::pair<eversion_t,eversion_t>> partial_writes_last_complete; ///< last_complete for shards not modified by a partial write

Conversation

bill-scales commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

bill-scales commented Mar 24, 2025

Uh oh!

bill-scales commented Mar 24, 2025

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aainscow commented Mar 31, 2025

Uh oh!

bill-scales commented Apr 1, 2025

Uh oh!

bill-scales commented Apr 7, 2025

Uh oh!

bill-scales commented Apr 7, 2025

Uh oh!

bill-scales commented Apr 7, 2025

Uh oh!

bill-scales commented Apr 7, 2025

Uh oh!

bill-scales commented Apr 8, 2025

Uh oh!

bill-scales commented Apr 8, 2025

Uh oh!

bill-scales commented Apr 8, 2025

Uh oh!

sseshasa commented Apr 9, 2025

Uh oh!

Uh oh!

ronen-fr commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

bill-scales commented Mar 23, 2025 •

edited

Loading