osd: EC optimizations: new types and additions to data structures#62451
osd: EC optimizations: new types and additions to data structures#62451
Conversation
|
jenkins test make check arm64 |
1 similar comment
|
jenkins test make check arm64 |
rzarzynski
left a comment
There was a problem hiding this comment.
Looks fine except a decoding issue and some nits.
|
|
||
| void encode(ceph::buffer::list &bl) const { | ||
| using ceph::encode; | ||
| encode(id, bl); |
There was a problem hiding this comment.
No versioning. I think that's for wrapper over of an integer.
There was a problem hiding this comment.
Correct - it copies what is done for shard_id_t in src/include/types.h
|
|
||
| /// EC partial writes: test if a shard is a non-primary | ||
| bool is_nonprimary_shard(const shard_id_t shard) const { | ||
| return !nonprimary_shards.empty() && nonprimary_shards.contains(shard); |
There was a problem hiding this comment.
ACK, is_nonprimary_shard returns false for empty nonprimary_shards (legacy EC paths).
| void pg_info_t::decode(ceph::buffer::list::const_iterator &bl) | ||
| { | ||
| DECODE_START(32, bl); | ||
| DECODE_START(33, bl); |
src/osd/osd_types.h
Outdated
|
|
||
| interval_set<snapid_t> purged_snaps; | ||
|
|
||
| std::map<shard_id_t,std::pair<eversion_t,eversion_t>> partial_writes_last_complete; ///< last_complete for shards not modified by a partial write |
| void pg_log_entry_t::decode(ceph::buffer::list::const_iterator &bl) | ||
| { | ||
| DECODE_START_LEGACY_COMPAT_LEN(14, 4, 4, bl); | ||
| DECODE_START_LEGACY_COMPAT_LEN(15, 4, 4, bl); |
| return written_shards.empty() || written_shards.contains(shard); | ||
| } | ||
| bool is_present_shard(const shard_id_t shard) const { | ||
| return present_shards.empty() || present_shards.contains(shard); |
55ee635 to
98a4e53
Compare
| void pg_pool_t::decode(ceph::buffer::list::const_iterator& bl) | ||
| { | ||
| DECODE_START_LEGACY_COMPAT_LEN(31, 5, 5, bl); | ||
| DECODE_START_LEGACY_COMPAT_LEN(32, 5, 5, bl); |
|
jenkins test make check |
|
jenkins test make check |
98a4e53 to
9358262
Compare
Add some extra types required by the EC optimizations code: raw_shard_id_t is an equivalent type to shard_id_t but is used for storing raw shards. Strong typing prevents bugs where code forgets to translate between the two types. shard_id_map is a mini_flat_map indexed by shard_id_t which will be used by the EC optimizations I/O path to track updates to each shard. shard_id_set is a bitset_set of shard_id_t which is a compact and fast way of storing a set of shards involved in an EC operation. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
EC optimizations pools do not update every shard on every I/O. The primary must have a complete log and requires objects to have up to date object attributes, so the choice of primary has to be restricted. Shards that cannot become a primary are listed in the nonprimary_shards set. For a K+M EC pool with optimizations enabled the 1st data shard and all M coding parity shards are always updated and can become a primary, the other shards will be marked as nonprimary. The new set nonprimary_shards stores shards that cannot become the primary, by default it is an empty set which retains existing behavior. When optimisations are enabled on an EC pool this set will be filled in to restrict the choice of primary. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Add partial_writes_last_complete map to pg_info_t and pg_fast_info_t. For optimized EC pools not all shards receive every log entry. As log entries are marked completed the partial writeis last complete map is updated to track shards that did not receive the log entry. Each map entry stores an eversion range. The first version is the last completion the shard participated in, the second version tracks subsequent updates where the shard was not updated. For example the range 88'10-88'12 means a shard completed update 10 and that updates 11 and 12 intentionally did not update the shard. This information is used during peering to distinguish a shard that is missing updates from a shard that intentionally did not participate in an update to work out what recovery is required. By default this map is empty indicating that every shard is expected to participate in an update and have a copy of the log entry. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
…nty_t Add two new sets to the pg_log_entry for use by EC optimization pools. The written shards set tracks which shards were written to, the present shards set tracks which shards were in the acting set at the time of the write. An empty set (default) is used to indicate all shards. For pools without allow_ec_optimizations the written set is empty (indicating all shards are written) and the present set is empty and unused. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
EC optimized pools do not always update every shard for every write I/O, this includes not updating the object_info_t (OI attribute). This means different shards can have OI indicaiting the object is at different versions. When an I/O updates a subset of the shards, the OI for the updated shards will record the old version number for the unmodified shards in the shard_versions map. The latest OI therefore has a record of the expected version number for all the shards which can be used to work out what needs to be backfilled. An empty shard_versions map imples that the OI attribute should be the same on all shards. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
9358262 to
88ac6d9
Compare
|
Rebased - addressed conflict with #61804 |
|
jenkins test make check arm64 |
|
jenkins test api |
|
jenkins test dashboard cephadm |
|
jenkins test make check arm64 |
|
jenkins test make check arm64 |
1 similar comment
|
jenkins test make check arm64 |
|
Rados Approved. |
|
This PR (specifically: commit 88ac6d9) breaks the osd-scrub-repair test. |
Add new types and make additions to data structures for the EC optimizations feature.
EC optimized pools do not always update every shard for every I/O, this makes recovery (where peering uses the pg log and pg info structures to reconcile differences between the shards) and backfill (where peering uses the object_info_t structure to compare the object version number of different shards to reconcile differences) more complicated.
The pg_log_entry_t, pg_info_t and pg_fast_info_t structures are extended to store extra information that will be required for recovery. The object_info_t structure is extended to store extra information that will be required for backfill and scrubbing.
EC optimized pools restrict the selection of the primary to shards that are always updated. The pg_pool_t structure is
extended to track the set of shards that are not suitable to become the primary.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition