Skip to content

osd: EC optimizations: new types and additions to data structures#62451

Merged
yuriw merged 5 commits intoceph:mainfrom
bill-scales:ec_data_structs
Apr 9, 2025
Merged

osd: EC optimizations: new types and additions to data structures#62451
yuriw merged 5 commits intoceph:mainfrom
bill-scales:ec_data_structs

Conversation

@bill-scales
Copy link
Contributor

@bill-scales bill-scales commented Mar 23, 2025

Add new types and make additions to data structures for the EC optimizations feature.

EC optimized pools do not always update every shard for every I/O, this makes recovery (where peering uses the pg log and pg info structures to reconcile differences between the shards) and backfill (where peering uses the object_info_t structure to compare the object version number of different shards to reconcile differences) more complicated.

The pg_log_entry_t, pg_info_t and pg_fast_info_t structures are extended to store extra information that will be required for recovery. The object_info_t structure is extended to store extra information that will be required for backfill and scrubbing.

EC optimized pools restrict the selection of the primary to shards that are always updated. The pg_pool_t structure is
extended to track the set of shards that are not suitable to become the primary.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@bill-scales bill-scales requested a review from a team as a code owner March 23, 2025 15:05
@github-actions github-actions bot added the core label Mar 23, 2025
@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine except a decoding issue and some nits.


void encode(ceph::buffer::list &bl) const {
using ceph::encode;
encode(id, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No versioning. I think that's for wrapper over of an integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - it copies what is done for shard_id_t in src/include/types.h


/// EC partial writes: test if a shard is a non-primary
bool is_nonprimary_shard(const shard_id_t shard) const {
return !nonprimary_shards.empty() && nonprimary_shards.contains(shard);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, is_nonprimary_shard returns false for empty nonprimary_shards (legacy EC paths).

void pg_info_t::decode(ceph::buffer::list::const_iterator &bl)
{
DECODE_START(32, bl);
DECODE_START(33, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.


interval_set<snapid_t> purged_snaps;

std::map<shard_id_t,std::pair<eversion_t,eversion_t>> partial_writes_last_complete; ///< last_complete for shards not modified by a partial write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after ,.

void pg_log_entry_t::decode(ceph::buffer::list::const_iterator &bl)
{
DECODE_START_LEGACY_COMPAT_LEN(14, 4, 4, bl);
DECODE_START_LEGACY_COMPAT_LEN(15, 4, 4, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

return written_shards.empty() || written_shards.contains(shard);
}
bool is_present_shard(const shard_id_t shard) const {
return present_shards.empty() || present_shards.contains(shard);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

void pg_pool_t::decode(ceph::buffer::list::const_iterator& bl)
{
DECODE_START_LEGACY_COMPAT_LEN(31, 5, 5, bl);
DECODE_START_LEGACY_COMPAT_LEN(32, 5, 5, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@aainscow
Copy link
Contributor

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check

bill-scales and others added 5 commits April 7, 2025 15:00
Add some extra types required by the EC optimizations code:

raw_shard_id_t is an equivalent type to shard_id_t but is used
for storing raw shards. Strong typing prevents bugs where code
forgets to translate between the two types.

shard_id_map is a mini_flat_map indexed by shard_id_t which will
be used by the EC optimizations I/O path to track updates to
each shard.

shard_id_set is a bitset_set of shard_id_t which is a compact
and fast way of storing a set of shards involved in an EC
operation.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
EC optimizations pools do not update every shard on every I/O. The primary
must have a complete log and requires objects to have up to date object
attributes, so the choice of primary has to be restricted. Shards that
cannot become a primary are listed in the nonprimary_shards set.

For a K+M EC pool with optimizations enabled the 1st data shard and all
M coding parity shards are always updated and can become a primary, the
other shards will be marked as nonprimary.

The new set nonprimary_shards stores shards that cannot become the primary,
by default it is an empty set which retains existing behavior. When
optimisations are enabled on an EC pool this set will be filled in to
restrict the choice of primary.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Add partial_writes_last_complete map to pg_info_t and pg_fast_info_t.
For optimized EC pools not all shards receive every log entry. As
log entries are marked completed the partial writeis last complete
map is updated to track shards that did not receive the log entry.

Each map entry stores an eversion range. The first version is the last
completion the shard participated in, the second version tracks subsequent
updates where the shard was not updated. For example the range 88'10-88'12
means a shard completed update 10 and that updates 11 and 12 intentionally
did not update the shard. This information is used during peering to
distinguish a shard that is missing updates from a shard that intentionally
did not participate in an update to work out what recovery is required.

By default this map is empty indicating that every shard is expected to
participate in an update and have a copy of the log entry.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
…nty_t

Add two new sets to the pg_log_entry for use by EC optimization pools.
The written shards set tracks which shards were written to, the
present shards set tracks which shards were in the acting set at the
time of the write.

An empty set (default) is used to indicate all shards. For pools without
allow_ec_optimizations the written set is empty (indicating all shards are
written) and the present set is empty and unused.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
EC optimized pools do not always update every shard for every write I/O,
this includes not updating the object_info_t (OI attribute). This means
different shards can have OI indicaiting the object is at different
versions. When an I/O updates a subset of the shards, the OI for the
updated shards will record the old version number for the unmodified
shards in the shard_versions map. The latest OI therefore has a record
of the expected version number for all the shards which can be used to
work out what needs to be backfilled.

An empty shard_versions map imples that the OI attribute should be the
same on all shards.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
@bill-scales
Copy link
Contributor Author

Rebased - addressed conflict with #61804

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor Author

jenkins test api

@bill-scales
Copy link
Contributor Author

jenkins test dashboard cephadm

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@aainscow aainscow self-requested a review April 8, 2025 06:54
@github-project-automation github-project-automation bot moved this from New to Reviewer approved in Ceph-Dashboard Apr 8, 2025
@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@sseshasa
Copy link
Contributor

sseshasa commented Apr 9, 2025

@yuriw yuriw merged commit ee1b273 into ceph:main Apr 9, 2025
14 checks passed
@github-project-automation github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Apr 9, 2025
@ronen-fr
Copy link
Contributor

This PR (specifically: commit 88ac6d9) breaks the osd-scrub-repair test.
I'll fix the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants