osd/PGLog.cc: Trim duplicates by number of entries#45529
Conversation
src/osd/PGLog.cc
Outdated
| } | ||
|
|
||
| while (!dups.empty()) { | ||
| while (!dups.empty() && dups.size() > cct->_conf->osd_pg_log_dups_tracked) { |
There was a problem hiding this comment.
nit: there seem no need for !dups.empty() check now, dups.size() > cct->_conf->osd_pg_log_dups_tracked would be enough.
|
This looks good. Will this PR also add dup trimming to ceph-objectstore-tool? (to fix OSDs which cannot be started) |
6416d56 to
9693c98
Compare
@NitzanMordhai I think we should take care of the ceph-objectstore-tool change in this PR |
Working on it |
|
@jdurgin \ @neha-ojha please review the tools |
e2d2d22 to
9efb207
Compare
trociny
left a comment
There was a problem hiding this comment.
Wouldn't the osd_pg_log_dups_tracked config setting make the ceph-objectore-tool trim log result confusing for users in the cases when dups are not trimmed due to osd_pg_log_dups_tracked is large?
Probably need to output some warning? Or may be reset conf->osd_pg_log_dups_tracked to 0 in ceph-objectstore-tool before colling pg_log.trim?
| pg_log.head = info.last_update; | ||
| pg_log.skip_can_rollback_to_to_head(); | ||
|
|
||
| ceph_assert(info.last_update.version > max_entries); |
There was a problem hiding this comment.
may we have this assert after we removed if (info.last_update.version - info.log_tail.version <= max_entries) check? max_entries comes from the config param and may be as big as a user wants. I suppose we need to update if it is more than info.last_update.version.
There was a problem hiding this comment.
@trociny We may have duplicate entries that we need to trim, the check before I removed it skipped all the process if we knew in advance that we don't have nothing to trim, but in case of duplicate entries, we don't know it until we reconstruct PGLog
There was a problem hiding this comment.
I understand why you removed the check. My point is that after its removal, the ceph_assert(info.last_update.version > max_entries) may fail. And we still need the condition info.last_update.version > max_entries to be valid (i.e. we can't just remove the assert), because we use info.last_update.version - max_entries below. So my idea is that we should set max_entries not just to conf->osd_max_pg_log_entries but to something like min(conf->osd_max_pg_log_entries, info.last_update.version - 1) (I have not checked though if last_update.version may be 0 though).
There was a problem hiding this comment.
@trociny i done it differently, please review
src/tools/ceph_objectstore_tool.cc
Outdated
| pg_log_dup_t dup; | ||
| try { | ||
| dup.decode(bp); | ||
| pg_log.dups.push_back(dup); |
There was a problem hiding this comment.
Don't we want to check if (pg_log.dups.size() >= trim_at_once)? Someone might want to do this on a cluster with millions of dups created when running older version.
There was a problem hiding this comment.
@trociny with the new change (the other commit in that PR for PGLog trim) we will trim the duplicate entries by the osd_pg_log_dups_tracked, I don't think trim_at_once (osd_pg_log_trim_max) is for duplicate entries, it is only for pg log entry.
There was a problem hiding this comment.
But if we don't limit the number of duplicate entries trimmed in one transaction in any way (as we do this for log entries), I expect we may have problems if their number is large?
There was a problem hiding this comment.
We do, by osd_pg_log_dups_tracked
There was a problem hiding this comment.
osd_pg_log_dups_tracked is the analog of osd_max_pg_log_entries (for log entries), i.e. it says how many entries we want to keep. While osd_pg_log_trim_max says how many entries we want to trim in one operation, to have a limit and avoid a situation when the trim operation is too heavy.
There was a problem hiding this comment.
@NitzanMordhai consider that your implementation here will only ever be needed for PGs with millions of dups, when the OSD cannot boot normally due to OOM.
One user showed an OSD with tens of PGs having 1.6 million dups each, and at least one PG with 3.2M dups.
Indeed the implementation here needs to focus on that scenario... be low on memory, split the bluestore operations.
9efb207 to
d751600
Compare
src/tools/ceph_objectstore_tool.cc
Outdated
|
|
||
| uint64_t max_entries = g_ceph_context->_conf->osd_max_pg_log_entries; | ||
| uint64_t max_entries = std::min<uint64_t>(g_ceph_context->_conf->osd_max_pg_log_entrie, | ||
| g_ceph_context->_conf->osd_pg_log_trim_max); |
There was a problem hiding this comment.
I don't understand this. osd_max_pg_log_entrie is how many entries we went to keep and osd_pg_log_trim_max is how many entries we want to trim in one iteration in while (!done) loop below. How can you mix this in one variable?
src/tools/ceph_objectstore_tool.cc
Outdated
|
|
||
| ObjectStore::Transaction t; | ||
| if (!trimmed.empty()) { | ||
| cout << "Removing keys " << *trimmed.begin() << " - " << *trimmed.rbegin() << std::endl; |
There was a problem hiding this comment.
nit: if we are changin this, I would prefer messages like this to go to cerr, not cout. Because they are more like log messages. The same is for other messages below.
src/tools/ceph_objectstore_tool.cc
Outdated
| break; | ||
| // We have enough pg logs entries to trim | ||
| if ((pg_log.log.size() + pg_log.dups.size()) >= trim_at_once) { | ||
| done = true; |
There was a problem hiding this comment.
I don't think we want to set done here. I think we want pg_log.trim to be called multiple times, being reset after processing a chunk, unless all the keys are processed.
Alternatively, we could not break here, collecting instead all potential entries to trim into pg_log. Then call pg_log.trim only once, obtaining the full lists of the trimmed keys, and then call t.omap_rmkeys multiple times in trim_at_once chunks. The problem is though that in this scenario pg_log.trim may build very large trimmed and trimmed_dups lists and this seems that we may not allow.
|
Just FYI, on one cluster we have a pg with 31 million of dups: I think we will want to try ceph-objectstore-tool patched with this PR (when it is ready) to trim. |
|
Another question to rise. What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? And it looks like if one is unlucky enough to have debug osd level set to 6 or higher (even if it is just for memory buffer) it will try to print the |
d751600 to
bf283f0
Compare
|
ok, I misunderstood the osd_pg_log_trim_max, now I got it correctly. |
trociny
left a comment
There was a problem hiding this comment.
@NitzanMordhai I have some questions for your ceph-objectstore-tool trim implementation (see the comments), but my main concern is still how the osd pg_log trim will behave after the upgrade when it needs to trim millions dups. And if it is a real problem (as I currently think) then it should be resolved first, and then the ceph-objectstore-tool trim implementation might be updated accordingly.
src/tools/ceph_objectstore_tool.cc
Outdated
|
|
||
| uint64_t max_entries = g_ceph_context->_conf->osd_max_pg_log_entries; | ||
| uint64_t max_entries = std::min<uint64_t>(g_ceph_context->_conf->osd_max_pg_log_entries, | ||
| info.last_update.version - 1); |
There was a problem hiding this comment.
I hope it is safe to assume that info.last_update.version will be always > 0 here...
src/tools/ceph_objectstore_tool.cc
Outdated
| pg_log.trim(g_ceph_context, trim_to, &trimmed, &trimmed_dups, &write_from_dups); | ||
| // Is there any trimming to do? | ||
| if (!trimmed.empty() || !trimmed_dups.empty()) { | ||
| new_tail = pg_log.log.back().version; |
There was a problem hiding this comment.
Isn't possible that pg_log.log may be empty here?
There was a problem hiding this comment.
yes, we can use pg_log.tail for that since trim will set it
src/tools/ceph_objectstore_tool.cc
Outdated
| e.decode_with_checksum(bp); | ||
| if (e.version > trim_to) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
That looks weird. It seems the idea was to use pg_log.trim in order not to do the trimming manually, but we still need this.
There was a problem hiding this comment.
i thought we can stop collecting entries, but yes, we can let trim function do it for us
src/tools/ceph_objectstore_tool.cc
Outdated
| if (!dry_run && new_tail != eversion_t()) { | ||
| info.log_tail = new_tail; | ||
| if (latest_dups.size() > 0) { | ||
| info.log_tail = latest_dups.back().version; |
There was a problem hiding this comment.
This looks suspicious to me. Isn't possible a situation when latest_dups.size() > 0 but info.log_tail should be set to new_tail?
There was a problem hiding this comment.
need to set it always to new_tail. nothing to do with dups. fixed
7628ca8 to
2050279
Compare
|
jenkins render docs |
|
|
||
| ceph_assert(info.last_update.version > max_entries); | ||
| version_t trim_to = info.last_update.version - max_entries; | ||
| eversion_t trim_to(info.last_update.epoch, info.last_update.version - max_entries); |
There was a problem hiding this comment.
using eversion_t like this will trim many more entries since it's sorted by epoch, then version. don't think it matters in practice in this case since this is an emergency recovery tool though.
|
@jdurgin So do you think my concern [1, 2] about how pg log trim would behave after upgrading an osd with many millions of dups is not valid? Just as information, we have been running ceph-objectstore-tool built with this PR to trim a pg with 32 millions dups. After it being run for almost a day, we interrupted it to check the progress. 15 millions still remained. We restarted it yesterday and it is still running. Setting Unfortunately, we can't check how an upgraded osd with so many dups would behave on this cluster, as it is a production cluster. [1] #45529 (comment) |
Are you sure? Some users have OSDs consuming a few GBs of ram, still running. If they upgrade, all OSDs hosting a PG will trim at the same time. If ::trim takes a long time, the OSDs and PG will be marked down. |
| ); | ||
| if (info.last_update.version - info.log_tail.version <= max_entries) { | ||
| cerr << "Log not larger than osd_max_pg_log_entries " << max_entries << std::endl; | ||
| return 0; |
There was a problem hiding this comment.
On a PG with 18M entries, we get
Log bounds are: (1314488'70894144,1314494'70896562]
Log not larger than osd_max_pg_log_entries 10000
Finished trimming pg log
Can we also check len(dups) before aborting the trim ?
|
@NitzanMordhai We are testing using an octopus build from @trociny Here is the output. It took ~10 minutes to trim 18M entries :-) This osd would OOM with > 350GB ram before trimming. Now it starts. ( |
I see, in that case a trim max in online trimming is needed. Thanks for explaining! Trimming everything at load_pgs time would potentially delay startup too long and cause availability problems, so I think the trim max approach is better. It does rely on I/O to trigger it though. |
@jdurgin I had a discussion with @neha-ojha today regarding that, should we do that in another PR, or do we need it in that PR? there are a few users that waiting for it and if we want to include it in the next release we should have it out. |
|
@NitzanMordhai as one of those users with large clusters (and already being greatly impacted by the issue #45529 (comment) ), I would like if possible for it to be released as soon as possible with the next patch. |
|
We are waiting for this to be in master like water in the desert... Thank you for your great job!ç |
2050279 to
8fc4a63
Compare
Ok, made small change in that PR to support OSDs that will OOM during start, we won't load all the dups entries, only the latest ones up to osd_pg_log_dups_tracked |
8fc4a63 to
b40738f
Compare
jdurgin
left a comment
There was a problem hiding this comment.
Only loading the tracked dups off disk sounds good for the online solution. It means a small amount of space waste, but that should be negligible and can be worked around with the ceph-objectstore-tool if desired.
| } | ||
| dups.push_back(dup); | ||
| if (dups.size() >= g_ceph_context->_conf->osd_pg_log_dups_tracked) { | ||
| dups.pop_front(); |
There was a problem hiding this comment.
It is a bit unfortunate that on every start we will need to waste time (it was up to 10 minutes in our case) reading all dups unless the user runs ceph-objectstore-tool to trim the log manually. And the user will not have any clue what is going on.
May be we should add a counter here and log at least a warning message with the recommendation to trim the log manually, when the counter is large enough?
There was a problem hiding this comment.
@trociny if the osd sopped again, we will save the new duplicate log, so we will load the trimmed list next time.
There was a problem hiding this comment.
So all the old dup keys will be removed from rocksdb on the save? I was not aware of it. Then I think there is no problem.
b40738f to
d8a4c5e
Compare
Adding duplicate entries trimming to trim-pg-log opertion, we will use the exist PGLog trim function to find out the set of entries\dup entries that we suppose to trim. to use it we need to build the PGLog from disk. Fixed: https://tracker.ceph.com/issues/53729 Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
src/osd/PGLog.h
Outdated
| ceph_assert(dups.back().version < dup.version); | ||
| } | ||
| dups.push_back(dup); | ||
| if (dups.size() >= g_conf()->osd_pg_log_dups_tracked) { |
There was a problem hiding this comment.
PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth. Fixed: https://tracker.ceph.com/issues/53729 Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
d8a4c5e to
0d253bc
Compare
|
jenkins test api |
|
jenkins test windows |
quincy: revert backport of #45529 Reviewed-by: Laura Flores <lflores@redhat.com>
PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai nmordec@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows