Skip to content

osd/PGLog.cc: Trim duplicates by number of entries#45529

Merged
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-pglog-dups-not-trimmed
May 11, 2022
Merged

osd/PGLog.cc: Trim duplicates by number of entries#45529
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-pglog-dups-not-trimmed

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Mar 21, 2022

PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.

Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai nmordec@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

src/osd/PGLog.cc Outdated
}

while (!dups.empty()) {
while (!dups.empty() && dups.size() > cct->_conf->osd_pg_log_dups_tracked) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there seem no need for !dups.empty() check now, dups.size() > cct->_conf->osd_pg_log_dups_tracked would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny agree, done

@dvanders
Copy link
Contributor

This looks good. Will this PR also add dup trimming to ceph-objectstore-tool? (to fix OSDs which cannot be started)

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from 6416d56 to 9693c98 Compare March 22, 2022 08:42
@neha-ojha
Copy link
Member

This looks good. Will this PR also add dup trimming to ceph-objectstore-tool? (to fix OSDs which cannot be started)

@NitzanMordhai I think we should take care of the ceph-objectstore-tool change in this PR

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai I think we should take care of the ceph-objectstore-tool change in this PR

Working on it

@NitzanMordhai
Copy link
Contributor Author

@jdurgin \ @neha-ojha please review the tools

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from e2d2d22 to 9efb207 Compare March 24, 2022 11:56
Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the osd_pg_log_dups_tracked config setting make the ceph-objectore-tool trim log result confusing for users in the cases when dups are not trimmed due to osd_pg_log_dups_tracked is large?

Probably need to output some warning? Or may be reset conf->osd_pg_log_dups_tracked to 0 in ceph-objectstore-tool before colling pg_log.trim?

pg_log.head = info.last_update;
pg_log.skip_can_rollback_to_to_head();

ceph_assert(info.last_update.version > max_entries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may we have this assert after we removed if (info.last_update.version - info.log_tail.version <= max_entries) check? max_entries comes from the config param and may be as big as a user wants. I suppose we need to update if it is more than info.last_update.version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny We may have duplicate entries that we need to trim, the check before I removed it skipped all the process if we knew in advance that we don't have nothing to trim, but in case of duplicate entries, we don't know it until we reconstruct PGLog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you removed the check. My point is that after its removal, the ceph_assert(info.last_update.version > max_entries) may fail. And we still need the condition info.last_update.version > max_entries to be valid (i.e. we can't just remove the assert), because we use info.last_update.version - max_entries below. So my idea is that we should set max_entries not just to conf->osd_max_pg_log_entries but to something like min(conf->osd_max_pg_log_entries, info.last_update.version - 1) (I have not checked though if last_update.version may be 0 though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny i done it differently, please review

pg_log_dup_t dup;
try {
dup.decode(bp);
pg_log.dups.push_back(dup);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to check if (pg_log.dups.size() >= trim_at_once)? Someone might want to do this on a cluster with millions of dups created when running older version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny with the new change (the other commit in that PR for PGLog trim) we will trim the duplicate entries by the osd_pg_log_dups_tracked, I don't think trim_at_once (osd_pg_log_trim_max) is for duplicate entries, it is only for pg log entry.

Copy link
Contributor

@trociny trociny Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we don't limit the number of duplicate entries trimmed in one transaction in any way (as we do this for log entries), I expect we may have problems if their number is large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, by osd_pg_log_dups_tracked

Copy link
Contributor

@trociny trociny Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osd_pg_log_dups_tracked is the analog of osd_max_pg_log_entries (for log entries), i.e. it says how many entries we want to keep. While osd_pg_log_trim_max says how many entries we want to trim in one operation, to have a limit and avoid a situation when the trim operation is too heavy.

Copy link
Contributor

@dvanders dvanders Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NitzanMordhai consider that your implementation here will only ever be needed for PGs with millions of dups, when the OSD cannot boot normally due to OOM.

One user showed an OSD with tens of PGs having 1.6 million dups each, and at least one PG with 3.2M dups.

Indeed the implementation here needs to focus on that scenario... be low on memory, split the bluestore operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvanders @trociny got it, i think i need to change the test for osd_pg_log_trim_max and move it, and compare it to pg_log.log.size() + pg_log.dups.size()

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from 9efb207 to d751600 Compare March 24, 2022 19:15
@dvanders dvanders requested a review from ifed01 March 24, 2022 19:22

uint64_t max_entries = g_ceph_context->_conf->osd_max_pg_log_entries;
uint64_t max_entries = std::min<uint64_t>(g_ceph_context->_conf->osd_max_pg_log_entrie,
g_ceph_context->_conf->osd_pg_log_trim_max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. osd_max_pg_log_entrie is how many entries we went to keep and osd_pg_log_trim_max is how many entries we want to trim in one iteration in while (!done) loop below. How can you mix this in one variable?


ObjectStore::Transaction t;
if (!trimmed.empty()) {
cout << "Removing keys " << *trimmed.begin() << " - " << *trimmed.rbegin() << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we are changin this, I would prefer messages like this to go to cerr, not cout. Because they are more like log messages. The same is for other messages below.

break;
// We have enough pg logs entries to trim
if ((pg_log.log.size() + pg_log.dups.size()) >= trim_at_once) {
done = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to set done here. I think we want pg_log.trim to be called multiple times, being reset after processing a chunk, unless all the keys are processed.

Alternatively, we could not break here, collecting instead all potential entries to trim into pg_log. Then call pg_log.trim only once, obtaining the full lists of the trimmed keys, and then call t.omap_rmkeys multiple times in trim_at_once chunks. The problem is though that in this scenario pg_log.trim may build very large trimmed and trimmed_dups lists and this seems that we may not allow.

@trociny
Copy link
Contributor

trociny commented Mar 25, 2022

Just FYI, on one cluster we have a pg with 31 million of dups:

root@one1-ceph2:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-20 --op log --pgid 2.88 > pg_log.2.88.json
root@one1-ceph2:~# jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pg_log.2.88.json
3852
31282035

I think we will want to try ceph-objectstore-tool patched with this PR (when it is ready) to trim.

@trociny
Copy link
Contributor

trociny commented Mar 26, 2022

Another question to rise. What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction?

And it looks like if one is unlucky enough to have debug osd level set to 6 or higher (even if it is just for memory buffer) it will try to print the trimmed_dups set in PGLog::write_log_and_missing.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from d751600 to bf283f0 Compare March 28, 2022 11:37
@NitzanMordhai
Copy link
Contributor Author

ok, I misunderstood the osd_pg_log_trim_max, now I got it correctly.
We don't know the number of dups, we will need to keep the last N dups entries (osd_pg_log_dups_tracked), to do it in 1 pass of the reading file, I added a deque that will fill until it reaches N entries, if a new entry found we will push it and the pop the oldest one (which will be trimmed later).

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NitzanMordhai I have some questions for your ceph-objectstore-tool trim implementation (see the comments), but my main concern is still how the osd pg_log trim will behave after the upgrade when it needs to trim millions dups. And if it is a real problem (as I currently think) then it should be resolved first, and then the ceph-objectstore-tool trim implementation might be updated accordingly.


uint64_t max_entries = g_ceph_context->_conf->osd_max_pg_log_entries;
uint64_t max_entries = std::min<uint64_t>(g_ceph_context->_conf->osd_max_pg_log_entries,
info.last_update.version - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it is safe to assume that info.last_update.version will be always > 0 here...

pg_log.trim(g_ceph_context, trim_to, &trimmed, &trimmed_dups, &write_from_dups);
// Is there any trimming to do?
if (!trimmed.empty() || !trimmed_dups.empty()) {
new_tail = pg_log.log.back().version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't possible that pg_log.log may be empty here?

Copy link
Contributor Author

@NitzanMordhai NitzanMordhai Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can use pg_log.tail for that since trim will set it

e.decode_with_checksum(bp);
if (e.version > trim_to) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks weird. It seems the idea was to use pg_log.trim in order not to do the trimming manually, but we still need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we can stop collecting entries, but yes, we can let trim function do it for us

if (!dry_run && new_tail != eversion_t()) {
info.log_tail = new_tail;
if (latest_dups.size() > 0) {
info.log_tail = latest_dups.back().version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious to me. Isn't possible a situation when latest_dups.size() > 0 but info.log_tail should be set to new_tail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to set it always to new_tail. nothing to do with dups. fixed

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch 2 times, most recently from 7628ca8 to 2050279 Compare March 29, 2022 08:00
@NitzanMordhai NitzanMordhai requested a review from trociny March 31, 2022 07:13
@NitzanMordhai
Copy link
Contributor Author

jenkins render docs


ceph_assert(info.last_update.version > max_entries);
version_t trim_to = info.last_update.version - max_entries;
eversion_t trim_to(info.last_update.epoch, info.last_update.version - max_entries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using eversion_t like this will trim many more entries since it's sorted by epoch, then version. don't think it matters in practice in this case since this is an emergency recovery tool though.

@trociny
Copy link
Contributor

trociny commented Apr 6, 2022

@jdurgin So do you think my concern [1, 2] about how pg log trim would behave after upgrading an osd with many millions of dups is not valid?

Just as information, we have been running ceph-objectstore-tool built with this PR to trim a pg with 32 millions dups. After it being run for almost a day, we interrupted it to check the progress. 15 millions still remained. We restarted it yesterday and it is still running. Setting --osd_pg_log_trim_max=100000 instead of the default 10000 could probably speed it up, but I just noticed that I had made a mistake when restarting the tool, missed one 0 in that setting, and it is still running with the default. So this is not checked yet. I suppose we will have an opportunity to check on other osds.

Unfortunately, we can't check how an upgraded osd with so many dups would behave on this cluster, as it is a production cluster.

[1] #45529 (comment)
[2] #45529 (review)

@dvanders
Copy link
Contributor

dvanders commented Apr 8, 2022

@jdurgin So do you think my concern [1, 2] about how pg log trim would behave after upgrading an osd with many millions of dups is not valid?

I think any OSD that ends up hitting this will quickly go OOM, and need offline recovery, so it doesn't seem necessary to handle in the online trimming.

Are you sure? Some users have OSDs consuming a few GBs of ram, still running. If they upgrade, all OSDs hosting a PG will trim at the same time. If ::trim takes a long time, the OSDs and PG will be marked down.
We might need a trim_max in ::trim or to trim the PG log during load_pgs before boot.
Yeah, Mykola echod the same.

);
if (info.last_update.version - info.log_tail.version <= max_entries) {
cerr << "Log not larger than osd_max_pg_log_entries " << max_entries << std::endl;
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a PG with 18M entries, we get

Log bounds are: (1314488'70894144,1314494'70896562]
Log not larger than osd_max_pg_log_entries 10000
Finished trimming pg log

Can we also check len(dups) before aborting the trim ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvanders
Copy link
Contributor

dvanders commented Apr 8, 2022

@NitzanMordhai We are testing using an octopus build from @trociny

Here is the output. It took ~10 minutes to trim 18M entries :-)

root@ceph-r6-5143:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-742/ --op log --pgid 3.10 | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
2418
18131032

root@ceph-r6-5143:~# CEPH_ARGS="--osd_pg_log_trim_max=500000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=. ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-742 --op trim-pg-log --pgid 3.10
Log bounds are: (1314488'70894144,1314494'70896562]
Removing keys 0001314488.00000000000070894145 - 0001314493.00000000000070896510
Removing Dups dup_0001073877.00000000000092039960 - dup_0001097203.00000000000004724151
Removing Dups dup_0001097203.00000000000004724152 - dup_0001099439.00000000000005221099
Removing Dups dup_0001099441.00000000000005221100 - dup_0001101713.00000000000005718047
Removing Dups dup_0001101713.00000000000005718048 - dup_0001104044.00000000000006214995
Removing Dups dup_0001104044.00000000000006214996 - dup_0001104427.00000000000006711943
Removing Dups dup_0001104427.00000000000006711944 - dup_0001104427.00000000000007208891
Removing Dups dup_0001104427.00000000000007208892 - dup_0001104822.00000000000007705839
Removing Dups dup_0001104822.00000000000007705840 - dup_0001109927.00000000000008202787
Removing Dups dup_0001109927.00000000000008202788 - dup_0001115297.00000000000008699735
Removing Dups dup_0001115297.00000000000008699736 - dup_0001116223.00000000000009196683
Removing Dups dup_0001116223.00000000000009196684 - dup_0001116887.00000000000009693631
Removing Dups dup_0001116887.00000000000009693632 - dup_0001117277.00000000000010190579
Removing Dups dup_0001117277.00000000000010190580 - dup_0001118505.00000000000010687527
Removing Dups dup_0001118505.00000000000010687528 - dup_0001123919.00000000000011184475
Removing Dups dup_0001123919.00000000000011184476 - dup_0001126117.00000000000011681423
Removing Dups dup_0001126117.00000000000011681424 - dup_0001130231.00000000000012178371
Removing Dups dup_0001130231.00000000000012178372 - dup_0001131560.00000000000012675319
Removing Dups dup_0001131560.00000000000012675320 - dup_0001136888.00000000000013172267
Removing Dups dup_0001136888.00000000000013172268 - dup_0001142999.00000000000013669215
Removing Dups dup_0001142999.00000000000013669216 - dup_0001146940.00000000000014166163
Removing Dups dup_0001146940.00000000000014166164 - dup_0001149293.00000000000014663111
Removing Dups dup_0001149293.00000000000014663112 - dup_0001149704.00000000000015160059
Removing Dups dup_0001149704.00000000000015160060 - dup_0001150128.00000000000015657007
Removing Dups dup_0001150128.00000000000015657008 - dup_0001150952.00000000000016153955
Removing Dups dup_0001150952.00000000000016153956 - dup_0001151361.00000000000016650903
Removing Dups dup_0001151361.00000000000016650904 - dup_0001160712.00000000000017147851
Removing Dups dup_0001160712.00000000000017147852 - dup_0001165684.00000000000017644799
Removing Dups dup_0001165684.00000000000017644800 - dup_0001169427.00000000000018141747
Removing Dups dup_0001169427.00000000000018141748 - dup_0001171615.00000000000018638695
Removing Dups dup_0001171615.00000000000018638696 - dup_0001173129.00000000000019135643
Removing Dups dup_0001173129.00000000000019135644 - dup_0001179813.00000000000021718244
Removing Dups dup_0001179813.00000000000021718245 - dup_0001186083.00000000000022400383
Removing Dups dup_0001186083.00000000000022400384 - dup_0001186495.00000000000022897331
Removing Dups dup_0001186495.00000000000022897332 - dup_0001187291.00000000000023394279
Removing Dups dup_0001187291.00000000000023394280 - dup_0001190921.00000000000023929837
Removing Dups dup_0001190921.00000000000023929838 - dup_0001314034.00000000000070653874
Removing Dups dup_0001314034.00000000000070653875 - dup_0001314487.00000000000070891144
Finished trimming, now compacting...
Finished trimming pg log
root@ceph-r6-5143:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-742/ --op log --pgid 3.10 | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
52
3000

This osd would OOM with > 350GB ram before trimming. Now it starts. (But we are still recovering other replicas so I don't yet know if this really fixes everything). HEALTH_OK!
@cervigni

@jdurgin
Copy link
Member

jdurgin commented Apr 8, 2022

I think any OSD that ends up hitting this will quickly go OOM, and need offline recovery, so it doesn't seem necessary to handle in the online trimming.

Not necessary.

Just to make clear. We had two clusters with this issue. On the first cluster the problem (several millions dups for many PGs) was detected when we changed pg_num, and the osds started to go OOM on pg split. In this scenario yes, the way to proceed is either to remove the problem pg or fix it with ceph-objectstore-tool. We resolved it with removing because we didn't have the patched ceph-objectstore-tool at that time yet.

On another cluster the problem (several tens of millions of dups per PG) was detected only because we started to investigate the cluster strange behavior: long (up to 10 minutes) pg load on start and instability due to osds higher memory usage when an osd is restarted. In this case, after understanding the root cause, we just waited for the patch for ceph-objectstore-tool to be ready and started to fix the pgs by removing the dups. BTW, the largest dups number we saw so far was 37M dups and it took 2 hours for ceph-objectstore-tool (run with --osd_pg_log_trim_max=500000) to remove the dups.

Now imagine if we had not done this investigation and just upgraded the cluster with "fixed" osds. I suppose we would have had problems with our cluster after the upgrade. And I can imagine there are many clusters in the wild with this problem the users are not aware of. What will happen to them after the upgrade?

I see, in that case a trim max in online trimming is needed. Thanks for explaining!

Trimming everything at load_pgs time would potentially delay startup too long and cause availability problems, so I think the trim max approach is better. It does rely on I/O to trigger it though.

@NitzanMordhai
Copy link
Contributor Author

Trimming everything at load_pgs time would potentially delay startup too long and cause availability problems, so I think the trim max approach is better. It does rely on I/O to trigger it though.

@jdurgin I had a discussion with @neha-ojha today regarding that, should we do that in another PR, or do we need it in that PR? there are a few users that waiting for it and if we want to include it in the next release we should have it out.

@cervigni
Copy link

@NitzanMordhai as one of those users with large clusters (and already being greatly impacted by the issue #45529 (comment) ), I would like if possible for it to be released as soon as possible with the next patch.

@gadLinux
Copy link

gadLinux commented Apr 19, 2022

We are waiting for this to be in master like water in the desert... Thank you for your great job!ç
I'm just building the branch and giving it a try...

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from 2050279 to 8fc4a63 Compare April 20, 2022 12:29
@NitzanMordhai
Copy link
Contributor Author

@jdurgin I had a discussion with @neha-ojha today regarding that, should we do that in another PR, or do we need it in that PR? there are a few users that waiting for it and if we want to include it in the next release we should have it out.

Ok, made small change in that PR to support OSDs that will OOM during start, we won't load all the dups entries, only the latest ones up to osd_pg_log_dups_tracked

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from 8fc4a63 to b40738f Compare April 20, 2022 13:07
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only loading the tracked dups off disk sounds good for the online solution. It means a small amount of space waste, but that should be negligible and can be worked around with the ceph-objectstore-tool if desired.

}
dups.push_back(dup);
if (dups.size() >= g_ceph_context->_conf->osd_pg_log_dups_tracked) {
dups.pop_front();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unfortunate that on every start we will need to waste time (it was up to 10 minutes in our case) reading all dups unless the user runs ceph-objectstore-tool to trim the log manually. And the user will not have any clue what is going on.

May be we should add a counter here and log at least a warning message with the recommendation to trim the log manually, when the counter is large enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny if the osd sopped again, we will save the new duplicate log, so we will load the trimmed list next time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all the old dup keys will be removed from rocksdb on the save? I was not aware of it. Then I think there is no problem.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from b40738f to d8a4c5e Compare April 25, 2022 10:25
Adding duplicate entries trimming to trim-pg-log opertion, we will use the exist
PGLog trim function to find out the set of entries\dup entries that we suppose to
trim. to use it we need to build the PGLog from disk.

Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
src/osd/PGLog.h Outdated
ceph_assert(dups.back().version < dup.version);
}
dups.push_back(dup);
if (dups.size() >= g_conf()->osd_pg_log_dups_tracked) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.

Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pglog-dups-not-trimmed branch from d8a4c5e to 0d253bc Compare April 26, 2022 05:57
@ljflores
Copy link
Member

jenkins test api

@ljflores
Copy link
Member

jenkins test windows

@NitzanMordhai
Copy link
Contributor Author

https://shaman.ceph.com/builds/ceph/wip-yuri5-testing-2022-04-28-1007/847220033f596a97c5c107497f89b7e0027da12a/
http://pulpito.front.sepia.ceph.com/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi
http://pulpito.front.sepia.ceph.com/yuriw-2022-05-04_21:19:19-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi

A few rados tests failed

Failures, unrelated:
opened new trackers:
6813955 - Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown - RADOS - Ceph
6813921 - Bug #55607: tests CmpOmap.cmp_vals_u64_empty_default\CmpOmap.cmp_rm_keys_u64_empty failed - RADOS - Ceph

updated trackers:
6813824 - Bug #49591: no active mgr (MGR_DOWN)" in cluster log - RADOS - Ceph
6813940 - Bug #45721: CommandFailedError: Command failed (workunit test rados/test_python.sh) FAIL: test_rados.TestWatchNotify.test - RADOS - Ceph
6814082- Bug #48440: log [ERR] : scrub mismatch - RADOS - Ceph
6813848 - Bug #55531: octopus: No remote osd logs captured in dead jobs - Ceph - Ceph

Details:

  1. log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
  2. tests CmpOmap.cmp_vals_u64_empty_default\CmpOmap.cmp_rm_keys_u64_empty failed
  3. no active mgr (MGR_DOWN) - It seems that the monitor had trouble communicating with the manager: 2022-04-29T17:13:31.437489+0000 mon.a (mon.0) 87 : cluster [ERR] Health check failed: no active mgr (MGR_DOWN)
  4. test_rados.TestWatchNotify.test - known and unfixed issue with API test
  5. scrub mismatch after slow op.
  6. this looks like a Teuthology bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants