Skip to content

osd, tools, kv: non-aggressive, on-line trimming of accumulated dups#47046

Merged
ljflores merged 8 commits intoceph:mainfrom
rzarzynski:wip-dup-trimming-test2
Aug 26, 2022
Merged

osd, tools, kv: non-aggressive, on-line trimming of accumulated dups#47046
ljflores merged 8 commits intoceph:mainfrom
rzarzynski:wip-dup-trimming-test2

Conversation

@rzarzynski
Copy link
Contributor

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

}

while (!dups.empty()) {
// we can hit an inflated `dups` b/c of https://tracker.ceph.com/issues/53729
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach looks good to me!

should update the osd_pg_log_dups_tracked docs since the meaning of the option is a bit different now - we'll have either that many log entries, or if max log entries is lower, that many dup entries in addition to log entries.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rzarzynski rzarzynski force-pushed the wip-dup-trimming-test2 branch from 0b81d15 to 511bb49 Compare August 1, 2022 14:45
@tchaikov
Copy link
Contributor

tchaikov commented Aug 6, 2022

jenkins test make check

1 similar comment
@tchaikov
Copy link
Contributor

tchaikov commented Aug 7, 2022

jenkins test make check

@tchaikov tchaikov removed the needs-qa label Aug 7, 2022
@tchaikov
Copy link
Contributor

tchaikov commented Aug 7, 2022

@rzarzynski the change does not compile.

@tchaikov
Copy link
Contributor

tchaikov commented Aug 7, 2022

/usr/bin/ccache /usr/bin/clang++-14  -DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION -DBOOST_ASIO_USE_TS_EXECUTOR_AS_DEFAULT -DBOOST_DATE_TIME_DYN_LINK -DBOOST_DATE_TIME_NO_LIB -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_IOSTREAMS_NO_LIB -DBOOST_MPL_CFG_NO_PREPROCESSED_HEADERS -DBOOST_MPL_LIMIT_LIST_SIZE=30 -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_RANDOM_DYN_LINK -DBOOST_RANDOM_NO_LIB -DBOOST_SYSTEM_DYN_LINK -DBOOST_SYSTEM_NO_LIB -DBOOST_THREAD_DYN_LINK -DBOOST_THREAD_NO_LIB -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_REENTRANT -D_THREAD_SAFE -D__CEPH__ -D__STDC_FORMAT_MACROS -D__linux__ -Isrc/include -I../src -I../src/dmclock/src -I../src/dmclock/support/src -isystem /opt/ceph/include -isystem include -isystem ../src/xxHash -isystem ../src/rapidjson/include -isystem ../src/fmt/include -isystem ../src/rocksdb/include -isystem /usr/include/fuse -isystem ../src/jaegertracing/opentelemetry-cpp/api/include -isystem ../src/jaegertracing/opentelemetry-cpp/exporters/jaeger/include -isystem ../src/jaegertracing/opentelemetry-cpp/ext/include -isystem ../src/jaegertracing/opentelemetry-cpp/sdk/include -g -Werror -fPIE   -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -Wall -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wno-inconsistent-missing-override -Wno-mismatched-tags -Wno-unused-private-field -Wno-address-of-packed-member -Wno-unused-function -Wno-unused-local-typedef -Wno-varargs -Wno-gnu-designator -Wno-missing-braces -Wno-parentheses -Wno-deprecated-register -DCEPH_DEBUG_MUTEX -D_GLIBCXX_ASSERTIONS -fdiagnostics-color=auto -std=c++2a -MD -MT src/tools/CMakeFiles/ceph-objectstore-tool.dir/ceph_objectstore_tool.cc.o -MF src/tools/CMakeFiles/ceph-objectstore-tool.dir/ceph_objectstore_tool.cc.o.d -o src/tools/CMakeFiles/ceph-objectstore-tool.dir/ceph_objectstore_tool.cc.o -c ../src/tools/ceph_objectstore_tool.cc
../src/tools/ceph_objectstore_tool.cc:3283:70: error: adding 'json_spirit::Value_type' to a string does not append to the string [-Werror,-Wstring-plus-int]
      throw std::runtime_error("JSON array/object not allowed type:" + obj.type());
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
../src/tools/ceph_objectstore_tool.cc:3283:70: note: use array indexing to silence this warning
      throw std::runtime_error("JSON array/object not allowed type:" + obj.type());
                                                                     ^
                               &                                     [           ]
1 error generated.

@NitzanMordhai
Copy link
Contributor

jenkins test make check

NitzanMordhai and others added 7 commits August 9, 2022 09:50
To be able to recreate and test pg log duplicate entries, a new option
added to the COT: --op pg-log-inject-dups we will also need to provide
--file json_arry of dups, it can get as many dups that need to be inject

the json for dups is in the following format:
{"reqid": "client.n.n:n", "version": "n'n", "user_version": n, "return_code": n}

Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
Fixes: https://tracker.ceph.com/issues/53729
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This commit brings the changes to the `PGLogTrimTest` that were a part
of the initial fix for dups trimming issue (commit 0d253bc)
that got reverted in the meantime by 9bf0053.

Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
Will be needed for dealing with the conf.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/53729
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski rzarzynski force-pushed the wip-dup-trimming-test2 branch from 67d2c87 to 0a986bf Compare August 9, 2022 09:58
@neha-ojha
Copy link
Member

jenkins test api

1 similar comment
@rzarzynski
Copy link
Contributor Author

jenkins test api

@rzarzynski
Copy link
Contributor Author

jenkins test api

1 similar comment
@tchaikov
Copy link
Contributor

jenkins test api

@ljflores
Copy link
Member

ljflores commented Aug 22, 2022

Tracking the API test failures here:

https://tracker.ceph.com/issues/57225
https://tracker.ceph.com/issues/57226

@epuertat FYI

@rzarzynski
Copy link
Contributor Author

jenkins test api

rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 23, 2022
… dups

This commit aggregates changes for multiple PR:

* Offline: ceph#46630
* Online: ceph#47046

* Offline fix: ceph#46706
* Online fix: ceph#47688

* Offline fix: ceph#46631
* Online fix: ceph#47701

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 23, 2022
… dups

This commit aggregates changes for multiple PR:

* Offline: ceph#46630
* Online: ceph#47046

* Offline fix: ceph#46706
* Online fix: ceph#47688

* Offline fix: ceph#46631
* Online fix: ceph#47701

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 23, 2022
… dups

This commit aggregates changes for multiple PR:

main
----
* Offline: ceph#46630
* Online: ceph#47046

quincy
------
* Offline fix: ceph#46706
* Online fix: ceph#47688

pacific
-------
* Offline fix: ceph#46631
* Online fix: ceph#47701

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 23, 2022
… dups

This commit aggregates changes for multiple PR:

main
----
* Offline: ceph#46630
* Online: ceph#47046

quincy
------
* Offline fix: ceph#46706
* Online fix: ceph#47688

pacific
-------
* Offline fix: ceph#46631
* Online fix: ceph#47701

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@ljflores
Copy link
Member

ljflores commented Aug 23, 2022

@rzarzynski can you take a look at this failure? It may be related:

https://pulpito.ceph.com/yuriw-2022-08-22_20:21:58-rados-wip-yuri11-testing-2022-08-22-1005-distro-default-smithi/6986124/

2022-08-23T00:09:14.944 INFO:tasks.workunit.client.0.smithi185.stdout:osd.2: 64424509464
2022-08-23T00:09:14.955 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:47: test_log_size:  sleep 3
2022-08-23T00:09:17.957 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:48: test_log_size:  ceph pg 1.0 query
2022-08-23T00:09:17.957 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:48: test_log_size:  jq .info.stats.log_size
2022-08-23T00:09:18.118 INFO:tasks.workunit.client.0.smithi185.stdout:15
2022-08-23T00:09:18.121 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:49: test_log_size:  ceph pg 1.0 query
2022-08-23T00:09:18.122 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:49: test_log_size:  jq .info.stats.log_size
2022-08-23T00:09:18.123 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:49: test_log_size:  grep 15
2022-08-23T00:09:18.253 INFO:tasks.workunit.client.0.smithi185.stdout:15
2022-08-23T00:09:18.254 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:50: test_log_size:  ceph pg 1.0 query
2022-08-23T00:09:18.254 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:50: test_log_size:  jq .info.stats.log_dups_size
2022-08-23T00:09:18.399 INFO:tasks.workunit.client.0.smithi185.stdout:6
2022-08-23T00:09:18.401 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:51: test_log_size:  ceph pg 1.0 query
2022-08-23T00:09:18.401 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:51: test_log_size:  jq .info.stats.log_dups_size
2022-08-23T00:09:18.402 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:51: test_log_size:  grep 5
2022-08-23T00:09:18.542 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:172: TEST_trim_max_entries_with_dups:  return 1
2022-08-23T00:09:18.542 INFO:tasks.workunit.client.0.smithi185.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/repro_long_log.sh:34: run:  return 1

Looks like it first occurred in #46561, but then Nitzan corrected it. All other occurrences of this failure have happened when this PR was tested. @NitzanMordhai does this sound right to you?

@rzarzynski
Copy link
Contributor Author

@ljflores: thanks for finding this!

TEST_trim_max_entries_with_dups was introduced pretty recently in 2aecf0d. I will dig in the morning. @NitzanMordhai, would mind looking as well?

@NitzanMordhai
Copy link
Contributor

TEST_trim_max_entries_with_dups was introduced pretty recently in 2aecf0d. I will dig in the morning. @NitzanMordhai, would mind looking as well?

@rzarzynski the new for loop in that PR is a bit different from the original one that I have done while I added that test, it also checks the osd-pg-log-trim-max, I need to fix the test accordingly to that trim dups.

Change the number of dups trimmied according to the new loop.

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
@rzarzynski
Copy link
Contributor Author

[rzarzynski@o06 build]$ ../qa/run-standalone.sh repro_long_log.sh
...
../qa/standalone/ceph-helpers.sh:2269: main:  return 0
Elapsed 4:26.35 (266.35 seconds)
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
ALL 1 TESTS PASSED
[rzarzynski@o06 build]$ echo $?
0

@ljflores
Copy link
Member

Rados suite review:
https://pulpito.ceph.com/?branch=wip-yuri11-testing-2022-08-22-1005
http://pulpito.front.sepia.ceph.com/lflores-2022-08-25_17:56:48-rados-wip-yuri11-testing-2022-08-24-0658-distro-default-smithi/

Failures, unrelated:
1. https://tracker.ceph.com/issues/57122
2. https://tracker.ceph.com/issues/55986
3. https://tracker.ceph.com/issues/57270
4. https://tracker.ceph.com/issues/57165
5. https://tracker.ceph.com/issues/57207
6. https://tracker.ceph.com/issues/57268
7. https://tracker.ceph.com/issues/52321
8. https://tracker.ceph.com/issues/56573
9. https://tracker.ceph.com/issues/57163
10. https://tracker.ceph.com/issues/51282
11. https://tracker.ceph.com/issues/57310 -- opened a new Tracker for this; first time this has appeared, but it doesn't seem related to the PR tested in this run.
12. https://tracker.ceph.com/issues/55853
13. https://tracker.ceph.com/issues/57311 -- opened a new Tracker for this; unrelated to PR tested in this run

Details:
1. test failure: rados:singleton-nomsgr librados_hello_world - Ceph - RADOS
2. cephadm: Test failure: test_cluster_set_reset_user_config (tasks.cephfs.test_nfs.TestNFS) - Ceph - Orchestrator
3. cephadm: RuntimeError: Failed command: apt-get update: E: The repository 'https://download.ceph.com/debian-octopus jammy Release' does not have a Release file. - Ceph - Orchestrator
4. expected valgrind issues and found none - Ceph - RADOS
5. AssertionError: Expected to find element: cd-modal .badge:not(script,style):cy-contains('/^foo$/'), cd-modal .badge[type='submit'][value~='/^foo$/'], but never found it. - Ceph - Mgr - Dashboard
6. rook: The CustomResourceDefinition "installations.operator.tigera.io" is invalid - Ceph - Orchestrator
7. qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Ceph - Rook
8. test_cephadm.sh: KeyError: 'TYPE' - Ceph - Orchestrator
9. free(): invalid pointer - Ceph - RADOS
10. pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings - Ceph - Mgr
11. StriperTest: The futex facility returned an unexpected error code - Ceph - RADOS
12. test_cls_rgw.sh: failures in 'cls_rgw.index_list' and 'cls_rgw.index_list_delimited` - Ceph - RADOS
13. rook: ensure CRDs are installed first - Ceph - Orchestrator

@ljflores ljflores merged commit 8ccd4e2 into ceph:main Aug 26, 2022
@nkshirsagar
Copy link
Contributor

nkshirsagar commented Sep 8, 2022

@rzarzynski @ljflores , regarding backporting this one to downstream octopus please may I get an opinion?

To backport commit 0a986bf we need to pull in stuff from the RocksDBStore.h file

  /// column families in use, name->handles
  struct prefix_shards {
    uint32_t hash_l;  //< first character to take for hash calc.
    uint32_t hash_h;  //< last character to take for hash calc.
    std::vector<rocksdb::ColumnFamilyHandle *> handles;
  };
  std::unordered_map<std::string, prefix_shards> cf_handles;
private:
  /// this iterator spans single cf
  rocksdb::Iterator* new_shard_iterator(rocksdb::ColumnFamilyHandle* cf);

Assuming the header file changes are pulled in to get it to compile, is there anything that comes to mind that may be a problem regarding the backport to Octopus? From what I see, the rest of the commits in https://github.com/ceph/ceph/pull/47046/commits can be backported to Octopus without much issues.

@rzarzynski
Copy link
Contributor Author

@nkshirsagar: there is an alternative. The commit you mentioned (0a986bf) is solely about adding new debugs. If any of them brings problematic backport dependencies, we could skip it.

@nkshirsagar
Copy link
Contributor

@rzarzynski thanks, I've got this backport to build on top of octopus 15.2.17. (quick note about this in src/tools/ceph_objectstore_tool.cc,

-      ret = tool.do_export(fs, coll, pgid, info, map_epoch, struct_ver, *superblock, past_intervals);
+      ret = tool.do_export(cct.get(), fs, coll, pgid, info, map_epoch, struct_ver, *superblock, past_intervals);

CephContext * being a boost intrusive pointer, I needed cct.get() like in the patch, but the ObjectStore *fs is used as the raw pointer it is in octopus (its a smart pointer in the patch that went to main but a raw pointer in Octopus, so here I'm using just fs and not fs.get())

Not many other changes apart from manual resolution of the other patches that did not cleanly cherry-pick from https://github.com/ceph/ceph/pull/47046/commits. (first 4 were clean cherry-pick in fact)

Manually adding the debugs of 0a986bf was easier than cherry-picking it because then I did not need to take in the header file changes to the RocksDBStore.h file. Please let me know what you think of this final patch which I want to patch into downstream Octopus.


diff --git a/qa/standalone/osd/repro_long_log.sh b/qa/standalone/osd/repro_long_log.sh
index a95a5ee94e7..bae0a72f004 100755
--- a/qa/standalone/osd/repro_long_log.sh
+++ b/qa/standalone/osd/repro_long_log.sh
@@ -148,6 +148,44 @@ function TEST_trim_max_entries()
     test_log_size $PGID 3 || return 1
 }
 
+function TEST_trim_max_entries_with_dups()
+{
+    local dir=$1
+
+    setup_log_test $dir || return 1
+
+    ceph tell osd.\* injectargs -- --osd_target_pg_log_entries_per_osd 2 || return 1
+    ceph tell osd.\* injectargs -- --osd-min-pg-log-entries 2
+    ceph tell osd.\* injectargs -- --osd-pg-log-trim-min 2
+    ceph tell osd.\* injectargs -- --osd-pg-log-trim-max 4
+    ceph tell osd.\* injectargs -- --osd_pg_log_dups_tracked 20 || return 1
+
+    # adding log entries, should only trim 4 and add one each time
+    # dups should be trimmed to 1
+    rados -p test rm foo
+    test_log_size $PGID 18 2 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 15 6 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 12 10 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 9 14 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 6 18 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 3 20 || return 1
+
+    # below trim_min
+    rados -p test rm foo
+    test_log_size $PGID 4 20 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 3 20 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 4 20 || return 1
+    rados -p test rm foo
+    test_log_size $PGID 3 20 || return 1
+}
+
 main repro-long-log "$@"
 
 # Local Variables:
diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc
index 328277e52a3..a58ecb1d32c 100644
--- a/src/kv/RocksDBStore.cc
+++ b/src/kv/RocksDBStore.cc
@@ -1008,6 +1008,8 @@ void RocksDBStore::RocksDBTransactionImpl::rm_range_keys(const string &prefix,
                                                          const string &start,
                                                          const string &end)
 {
+  ldout(db->cct, 10) << __func__ << " enter start=" << start
+                    << " end=" << end << dendl;
   auto cf = db->get_cf_handle(prefix);
 
   uint64_t cnt = db->delete_range_threshold;
@@ -1021,8 +1023,12 @@ void RocksDBStore::RocksDBTransactionImpl::rm_range_keys(const string &prefix,
     if (!cnt) {
       bat.RollbackToSavePoint();
       if (cf) {
+        ldout(db->cct, 10) << __func__ << " p_iter == end(), resorting to DeleteRange"
+                        << dendl;
         bat.DeleteRange(cf, rocksdb::Slice(start), rocksdb::Slice(end));
       } else {
+       ldout(db->cct, 10) << __func__ << " p_iter == end(), resorting to DeleteRange"
+                        << dendl;
         bat.DeleteRange(db->default_cf,
                         rocksdb::Slice(combine_strings(prefix, start)),
                         rocksdb::Slice(combine_strings(prefix, end)));
@@ -1038,6 +1044,7 @@ void RocksDBStore::RocksDBTransactionImpl::rm_range_keys(const string &prefix,
     --cnt;
   }
   bat.PopSavePoint();
+  ldout(db->cct, 10) << __func__ << " end" << dendl;
 }
 
 void RocksDBStore::RocksDBTransactionImpl::merge(
diff --git a/src/osd/PGLog.cc b/src/osd/PGLog.cc
index 7ed51488083..7ae4ea5fcf3 100644
--- a/src/osd/PGLog.cc
+++ b/src/osd/PGLog.cc
@@ -50,6 +50,7 @@ void PGLog::IndexedLog::trim(
   set<string>* trimmed_dups,
   eversion_t *write_from_dups)
 {
+  lgeneric_subdout(cct, osd, 10) << "IndexedLog::trim s=" << s << dendl;
   ceph_assert(s <= can_rollback_to);
   if (complete_to != log.end())
     lgeneric_subdout(cct, osd, 20) << " complete_to " << complete_to->version << dendl;
@@ -121,10 +122,18 @@ void PGLog::IndexedLog::trim(
     }
   }
 
-  while (!dups.empty()) {
+  // we can hit an inflated `dups` b/c of https://tracker.ceph.com/issues/53729
+  // the idea is to slowly trim them over a prolonged period of time and mix
+  // omap deletes with writes (if we're here, a new log entry got added) to
+  // neither: 1) blow size of single Transaction nor 2) generate-n-accumulate
+  // large amount of tombstones in BlueStore's RocksDB.
+  // if trimming immediately is a must, then the ceph-objectstore-tool is
+  // the way to go.
+  const size_t max_dups = cct->_conf->osd_pg_log_dups_tracked;
+  for (size_t max_dups_to_trim = cct->_conf->osd_pg_log_trim_max;
+       max_dups_to_trim > 0 && dups.size() > max_dups;
+       max_dups_to_trim--) {
     const auto& e = *dups.begin();
-    if (e.version.version >= earliest_dup_version)
-      break;
     lgeneric_subdout(cct, osd, 20) << "trim dup " << e << dendl;
     if (trimmed_dups)
       trimmed_dups->insert(e.get_key_name());
@@ -135,6 +144,10 @@ void PGLog::IndexedLog::trim(
   // raise tail?
   if (tail < s)
     tail = s;
+  lgeneric_subdout(cct, osd, 20) << "IndexedLog::trim after trim"
+                        << " dups.size()=" << dups.size()
+                        << " tail=" << tail
+                        << " s=" << s << dendl;
 }
 
 ostream& PGLog::IndexedLog::print(ostream& out) const
@@ -506,6 +519,9 @@ void PGLog::merge_log(pg_info_t &oinfo, pg_log_t &olog, pg_shard_t fromosd,
 
 // returns true if any changes were made to log.dups
 bool PGLog::merge_log_dups(const pg_log_t& olog) {
+  dout(5) << __func__
+         << " log.dups.size()=" << log.dups.size()
+         <<  "olog.dups.size()=" << olog.dups.size() << dendl;
   bool changed = false;
 
   if (!olog.dups.empty()) {
@@ -584,6 +600,9 @@ bool PGLog::merge_log_dups(const pg_log_t& olog) {
     }
   }
 
+  dout(5) << "end of " << __func__ << " changed=" << changed
+          << " log.dups.size()=" << log.dups.size()
+         << " olog.dups.size()=" << olog.dups.size() << dendl;
   return changed;
 }
 
@@ -645,7 +664,8 @@ void PGLog::write_log_and_missing(
       dirty_from_dups,
       write_from_dups,
       &may_include_deletes_in_missing_dirty,
-      (pg_log_debug ? &log_keys_debug : nullptr));
+      (pg_log_debug ? &log_keys_debug : nullptr),
+      this);
     undirty();
   } else {
     dout(10) << "log is not dirty" << dendl;
@@ -659,14 +679,15 @@ void PGLog::write_log_and_missing_wo_missing(
     pg_log_t &log,
     const coll_t& coll, const ghobject_t &log_oid,
     map<eversion_t, hobject_t> &divergent_priors,
-    bool require_rollback
+    bool require_rollback,
+    const DoutPrefixProvider *dpp
     )
 {
   _write_log_and_missing_wo_missing(
     t, km, log, coll, log_oid,
     divergent_priors, eversion_t::max(), eversion_t(), eversion_t(),
     true, true, require_rollback,
-    eversion_t::max(), eversion_t(), eversion_t(), nullptr);
+    eversion_t::max(), eversion_t(), eversion_t(), nullptr, dpp);
 }
 
 // static
@@ -678,7 +699,8 @@ void PGLog::write_log_and_missing(
     const ghobject_t &log_oid,
     const pg_missing_tracker_t &missing,
     bool require_rollback,
-    bool *may_include_deletes_in_missing_dirty)
+    bool *may_include_deletes_in_missing_dirty,
+    const DoutPrefixProvider *dpp)
 {
   _write_log_and_missing(
     t, km, log, coll, log_oid,
@@ -692,7 +714,7 @@ void PGLog::write_log_and_missing(
     eversion_t::max(),
     eversion_t(),
     eversion_t(),
-    may_include_deletes_in_missing_dirty, nullptr);
+    may_include_deletes_in_missing_dirty, nullptr, dpp);
 }
 
 // static
@@ -711,10 +733,14 @@ void PGLog::_write_log_and_missing_wo_missing(
   eversion_t dirty_to_dups,
   eversion_t dirty_from_dups,
   eversion_t write_from_dups,
-  set<string> *log_keys_debug
+  set<string> *log_keys_debug,
+  const DoutPrefixProvider *dpp
   )
 {
-  // dout(10) << "write_log_and_missing, clearing up to " << dirty_to << dendl;
+  ldpp_dout(dpp, 10) << "_write_log_and_missing_wo_missing, clearing up to " << dirty_to
+                    << " dirty_to_dups=" << dirty_to_dups
+                    << " dirty_from_dups=" << dirty_from_dups
+                    << " write_from_dups=" << write_from_dups << dendl;
   if (touch_log)
     t.touch(coll, log_oid);
   if (dirty_to != eversion_t()) {
@@ -765,6 +791,8 @@ void PGLog::_write_log_and_missing_wo_missing(
   if (dirty_to_dups != eversion_t()) {
     pg_log_dup_t min, dirty_to_dup;
     dirty_to_dup.version = dirty_to_dups;
+    ldpp_dout(dpp, 10) << __func__ << " remove dups min=" << min.get_key_name()
+              << " to dirty_to_dup=" << dirty_to_dup.get_key_name() << dendl;
     t.omap_rmkeyrange(
       coll, log_oid,
       min.get_key_name(), dirty_to_dup.get_key_name());
@@ -773,11 +801,16 @@ void PGLog::_write_log_and_missing_wo_missing(
     pg_log_dup_t max, dirty_from_dup;
     max.version = eversion_t::max();
     dirty_from_dup.version = dirty_from_dups;
+    ldpp_dout(dpp, 10) << __func__ << " remove dups dirty_from_dup="
+                      << dirty_from_dup.get_key_name()
+                      << " to max=" << max.get_key_name() << dendl;
     t.omap_rmkeyrange(
       coll, log_oid,
       dirty_from_dup.get_key_name(), max.get_key_name());
   }
 
+  ldpp_dout(dpp, 10) << __func__ << " going to encode log.dups.size()="
+            << log.dups.size() << dendl;
   for (const auto& entry : log.dups) {
     if (entry.version > dirty_to_dups)
       break;
@@ -786,6 +819,8 @@ void PGLog::_write_log_and_missing_wo_missing(
     (*km)[entry.get_key_name()].claim(bl);
   }
 
+  ldpp_dout(dpp, 10) << __func__ << " 1st round encoded log.dups.size()="
+            << log.dups.size() << dendl;
   for (list<pg_log_dup_t>::reverse_iterator p = log.dups.rbegin();
        p != log.dups.rend() &&
         (p->version >= dirty_from_dups || p->version >= write_from_dups) &&
@@ -796,8 +831,11 @@ void PGLog::_write_log_and_missing_wo_missing(
     (*km)[p->get_key_name()].claim(bl);
   }
 
+  ldpp_dout(dpp, 10) << __func__ << " 2st round encoded log.dups.size()="
+            << log.dups.size() << dendl;
   if (dirty_divergent_priors) {
-    //dout(10) << "write_log_and_missing: writing divergent_priors" << dendl;
+    ldpp_dout(dpp, 10) << "write_log_and_missing: writing divergent_priors"
+              << dendl;
     encode(divergent_priors, (*km)["divergent_priors"]);
   }
   if (require_rollback) {
@@ -808,6 +846,7 @@ void PGLog::_write_log_and_missing_wo_missing(
       log.get_rollback_info_trimmed_to(),
       (*km)["rollback_info_trimmed_to"]);
   }
+  ldpp_dout(dpp, 10) << "end of " << __func__ << dendl;
 }
 
 // static
@@ -829,8 +868,14 @@ void PGLog::_write_log_and_missing(
   eversion_t dirty_from_dups,
   eversion_t write_from_dups,
   bool *may_include_deletes_in_missing_dirty, // in/out param
-  set<string> *log_keys_debug
+  set<string> *log_keys_debug,
+  const DoutPrefixProvider *dpp
   ) {
+  ldpp_dout(dpp, 10) << __func__ << " clearing up to " << dirty_to
+                    << " dirty_to_dups=" << dirty_to_dups
+                    << " dirty_from_dups=" << dirty_from_dups
+                    << " write_from_dups=" << write_from_dups
+                    << " trimmed_dups.size()=" << trimmed_dups.size() << dendl;
   set<string> to_remove;
   to_remove.swap(trimmed_dups);
   for (auto& t : trimmed) {
@@ -853,7 +898,8 @@ void PGLog::_write_log_and_missing(
     clear_up_to(log_keys_debug, dirty_to.get_key_name());
   }
   if (dirty_to != eversion_t::max() && dirty_from != eversion_t::max()) {
-    //   dout(10) << "write_log_and_missing, clearing from " << dirty_from << dendl;
+    ldpp_dout(dpp, 10) << "write_log_and_missing, clearing from "
+                      << dirty_from << dendl;
     t.omap_rmkeyrange(
       coll, log_oid,
       dirty_from.get_key_name(), eversion_t::max().get_key_name());
@@ -894,6 +940,8 @@ void PGLog::_write_log_and_missing(
   if (dirty_to_dups != eversion_t()) {
     pg_log_dup_t min, dirty_to_dup;
     dirty_to_dup.version = dirty_to_dups;
+    ldpp_dout(dpp, 10) << __func__ << " remove dups min=" << min.get_key_name()
+                      << " to dirty_to_dup=" << dirty_to_dup.get_key_name() << dendl;
     t.omap_rmkeyrange(
       coll, log_oid,
       min.get_key_name(), dirty_to_dup.get_key_name());
@@ -902,11 +950,16 @@ void PGLog::_write_log_and_missing(
     pg_log_dup_t max, dirty_from_dup;
     max.version = eversion_t::max();
     dirty_from_dup.version = dirty_from_dups;
+    ldpp_dout(dpp, 10) << __func__ << " remove dups dirty_from_dup="
+                      << dirty_from_dup.get_key_name()
+                      << " to max=" << max.get_key_name() << dendl;
     t.omap_rmkeyrange(
       coll, log_oid,
       dirty_from_dup.get_key_name(), max.get_key_name());
   }
 
+  ldpp_dout(dpp, 10) << __func__ << " going to encode log.dups.size()="
+                    << log.dups.size() << dendl;
   for (const auto& entry : log.dups) {
     if (entry.version > dirty_to_dups)
       break;
@@ -914,6 +967,8 @@ void PGLog::_write_log_and_missing(
     encode(entry, bl);
     (*km)[entry.get_key_name()].claim(bl);
   }
+  ldpp_dout(dpp, 10) << __func__ << " 1st round encoded log.dups.size()="
+            << log.dups.size() << dendl;
 
   for (list<pg_log_dup_t>::reverse_iterator p = log.dups.rbegin();
        p != log.dups.rend() &&
@@ -925,8 +980,11 @@ void PGLog::_write_log_and_missing(
     (*km)[p->get_key_name()].claim(bl);
   }
 
+  ldpp_dout(dpp, 10) << __func__ << " 2st round encoded log.dups.size()="
+                    << log.dups.size() << dendl;
   if (clear_divergent_priors) {
-    //dout(10) << "write_log_and_missing: writing divergent_priors" << dendl;
+    ldpp_dout(dpp, 10) << "write_log_and_missing: writing divergent_priors"
+                      << dendl;
     to_remove.insert("divergent_priors");
   }
   // since we encode individual missing items instead of a whole
@@ -956,6 +1014,7 @@ void PGLog::_write_log_and_missing(
 
   if (!to_remove.empty())
     t.omap_rmkeys(coll, log_oid, to_remove);
+  ldpp_dout(dpp, 10) << "end of " << __func__ << dendl;
 }
 
 void PGLog::rebuild_missing_set_with_deletes(
diff --git a/src/osd/PGLog.h b/src/osd/PGLog.h
index 297e89d9a40..f36e3b81cd1 100644
--- a/src/osd/PGLog.h
+++ b/src/osd/PGLog.h
@@ -1274,8 +1274,9 @@ public:
     map<string,bufferlist>* km,
     pg_log_t &log,
     const coll_t& coll,
-    const ghobject_t &log_oid, map<eversion_t, hobject_t> &divergent_priors,
-    bool require_rollback);
+    const ghobject_t &log_oid, std::map<eversion_t, hobject_t> &divergent_priors,
+    bool require_rollback,
+    const DoutPrefixProvider *dpp = nullptr);
 
   static void write_log_and_missing(
     ObjectStore::Transaction& t,
@@ -1285,7 +1286,8 @@ public:
     const ghobject_t &log_oid,
     const pg_missing_tracker_t &missing,
     bool require_rollback,
-    bool *rebuilt_missing_set_with_deletes);
+    bool *rebuilt_missing_set_with_deletes,
+    const DoutPrefixProvider *dpp = nullptr);
 
   static void _write_log_and_missing_wo_missing(
     ObjectStore::Transaction& t,
@@ -1302,7 +1304,8 @@ public:
     eversion_t dirty_to_dups,
     eversion_t dirty_from_dups,
     eversion_t write_from_dups,
-    set<string> *log_keys_debug
+    std::set<std::string> *log_keys_debug,
+    const DoutPrefixProvider *dpp = nullptr
     );
 
   static void _write_log_and_missing(
@@ -1323,7 +1326,8 @@ public:
     eversion_t dirty_from_dups,
     eversion_t write_from_dups,
     bool *may_include_deletes_in_missing_dirty,
-    set<string> *log_keys_debug
+    std::set<std::string> *log_keys_debug,
+    const DoutPrefixProvider *dpp = nullptr
     );
 
   void read_log_and_missing(
@@ -1336,7 +1340,7 @@ public:
     bool debug_verify_stored_missing = false
     ) {
     return read_log_and_missing(
-      store, ch, pgmeta_oid, info,
+      cct, store, ch, pgmeta_oid, info,
       log, missing, oss,
       tolerate_divergent_missing_log,
       &clear_divergent_priors,
@@ -1347,6 +1351,7 @@ public:
 
   template <typename missing_type>
   static void read_log_and_missing(
+    CephContext *cct,
     ObjectStore *store,
     ObjectStore::CollectionHandle &ch,
     ghobject_t pgmeta_oid,
@@ -1360,9 +1365,10 @@ public:
     set<string> *log_keys_debug = nullptr,
     bool debug_verify_stored_missing = false
     ) {
-    ldpp_dout(dpp, 20) << "read_log_and_missing coll " << ch->cid
+    ldpp_dout(dpp, 10) << "read_log_and_missing coll " << ch->cid
                       << " " << pgmeta_oid << dendl;
 
+    size_t total_dups = 0;
     // legacy?
     struct stat st;
     int r = store->stat(ch, pgmeta_oid, &st);
@@ -1377,8 +1383,9 @@ public:
     map<eversion_t, hobject_t> divergent_priors;
     bool must_rebuild = false;
     missing.may_include_deletes = false;
-    list<pg_log_entry_t> entries;
-    list<pg_log_dup_t> dups;
+    std::list<pg_log_entry_t> entries;
+    std::list<pg_log_dup_t> dups;
+    const auto NUM_DUPS_WARN_THRESHOLD = 2*cct->_conf->osd_pg_log_dups_tracked;
     if (p) {
       for (p->seek_to_first(); p->valid() ; p->next()) {
        // non-log pgmeta_oid keys are prefixed with _; skip those
@@ -1409,11 +1416,20 @@ public:
          }
          missing.add(oid, std::move(item));
        } else if (p->key().substr(0, 4) == string("dup_")) {
+          ++total_dups;
          pg_log_dup_t dup;
          decode(dup, bp);
          if (!dups.empty()) {
            ceph_assert(dups.back().version < dup.version);
          }
+         if (dups.size() == NUM_DUPS_WARN_THRESHOLD) {
+           ldpp_dout(dpp, 0) << "read_log_and_missing WARN num of dups exceeded "
+                             << NUM_DUPS_WARN_THRESHOLD << "."
+                             << " You can be hit by THE DUPS BUG"
+                             << " https://tracker.ceph.com/issues/53729."
+                             << " Consider ceph-objectstore-tool --op trim-pg-log-dups"
+                             << dendl;
+         }
          dups.push_back(dup);
        } else {
          pg_log_entry_t e;
@@ -1591,7 +1607,9 @@ public:
        (*clear_divergent_priors) = false;
       missing.flush();
     }
-    ldpp_dout(dpp, 10) << "read_log_and_missing done" << dendl;
+    ldpp_dout(dpp, 10) << "read_log_and_missing done coll " << ch->cid
+                      << " total_dups=" << total_dups
+                      << " log.dups.size()=" << log.dups.size() << dendl;
   } // static read_log_and_missing
 
 #ifdef WITH_SEASTAR
diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc
index 3f11eed8188..2aeabfb4e93 100644
--- a/src/osd/osd_types.cc
+++ b/src/osd/osd_types.cc
@@ -5042,8 +5042,9 @@ static void _handle_dups(CephContext* cct, pg_log_t &target, const pg_log_t &oth
 {
   auto earliest_dup_version =
                target.head.version < maxdups ? 0u : target.head.version - maxdups + 1;
-  lgeneric_subdout(cct, osd, 20) << "copy_up_to/copy_after earliest_dup_version " << earliest_dup_version << dendl;
 
+  lgeneric_subdout(cct, osd, 20) << __func__ << " earliest_dup_version "
+                                << earliest_dup_version << dendl;
   for (auto d = other.dups.cbegin(); d != other.dups.cend(); ++d) {
     if (d->version.version >= earliest_dup_version) {
       lgeneric_subdout(cct, osd, 20)
@@ -5072,7 +5073,9 @@ void pg_log_t::copy_after(CephContext* cct, const pg_log_t &other, eversion_t v)
   can_rollback_to = other.can_rollback_to;
   head = other.head;
   tail = other.tail;
-  lgeneric_subdout(cct, osd, 20) << __func__ << " v " << v << dendl;
+  lgeneric_subdout(cct, osd, 20) << __func__ << " v " << v
+                                << " dups.size()=" << dups.size()
+                                << " other.dups.size()=" << other.dups.size() << dendl;
   for (auto i = other.log.crbegin(); i != other.log.crend(); ++i) {
     ceph_assert(i->version > other.tail);
     if (i->version <= v) {
@@ -5084,6 +5087,9 @@ void pg_log_t::copy_after(CephContext* cct, const pg_log_t &other, eversion_t v)
     log.push_front(*i);
   }
   _handle_dups(cct, *this, other, cct->_conf->osd_pg_log_dups_tracked);
+  lgeneric_subdout(cct, osd, 20) << __func__ << " END v " << v
+                                << " dups.size()=" << dups.size()
+                                << " other.dups.size()=" << other.dups.size() << dendl;
 }
 
 void pg_log_t::copy_up_to(CephContext* cct, const pg_log_t &other, int max)
@@ -5093,6 +5099,9 @@ void pg_log_t::copy_up_to(CephContext* cct, const pg_log_t &other, int max)
   head = other.head;
   tail = other.tail;
   lgeneric_subdout(cct, osd, 20) << __func__ << " max " << max << dendl;
+  lgeneric_subdout(cct, osd, 20) << __func__ << " max " << max
+                               << " dups.size()=" << dups.size()
+                               << " other.dups.size()=" << other.dups.size() << dendl;
   for (auto i = other.log.crbegin(); i != other.log.crend(); ++i) {
     ceph_assert(i->version > other.tail);
     if (n++ >= max) {
@@ -5103,6 +5112,9 @@ void pg_log_t::copy_up_to(CephContext* cct, const pg_log_t &other, int max)
     log.push_front(*i);
   }
   _handle_dups(cct, *this, other, cct->_conf->osd_pg_log_dups_tracked);
+  lgeneric_subdout(cct, osd, 20) << __func__ << " END max " << max
+                                << " dups.size()=" << dups.size()
+                                << " other.dups.size()=" << other.dups.size() << dendl;
 }
 
 ostream& pg_log_t::print(ostream& out) const
diff --git a/src/test/osd/TestPGLog.cc b/src/test/osd/TestPGLog.cc
index 99e22f82df9..fc9fa250569 100644
--- a/src/test/osd/TestPGLog.cc
+++ b/src/test/osd/TestPGLog.cc
@@ -2739,8 +2739,8 @@ TEST_F(PGLogTrimTest, TestPartialTrim)
   EXPECT_EQ(eversion_t(19, 160), write_from_dups2);
   EXPECT_EQ(2u, log.log.size());
   EXPECT_EQ(1u, trimmed2.size());
-  EXPECT_EQ(2u, log.dups.size());
-  EXPECT_EQ(1u, trimmed_dups2.size());
+  EXPECT_EQ(3u, log.dups.size());
+  EXPECT_EQ(0u, trimmed_dups2.size());
 }
 
 
@@ -3023,7 +3023,7 @@ TEST_F(PGLogTrimTest, TestTrimDups) {
 
   EXPECT_EQ(eversion_t(20, 103), write_from_dups) << log;
   EXPECT_EQ(2u, log.log.size()) << log;
-  EXPECT_EQ(3u, log.dups.size()) << log;
+  EXPECT_EQ(4u, log.dups.size()) << log;
 }
 
 // This tests trim() to make copies of
@@ -3067,7 +3067,7 @@ TEST_F(PGLogTrimTest, TestTrimDups2) {
 
   EXPECT_EQ(eversion_t(10, 100), write_from_dups) << log;
   EXPECT_EQ(4u, log.log.size()) << log;
-  EXPECT_EQ(5u, log.dups.size()) << log;
+  EXPECT_EQ(6u, log.dups.size()) << log;
 }
 
 // This tests copy_up_to() to make copies of
diff --git a/src/tools/ceph_objectstore_tool.cc b/src/tools/ceph_objectstore_tool.cc
index 2f876f67181..fa663bacb15 100644
--- a/src/tools/ceph_objectstore_tool.cc
+++ b/src/tools/ceph_objectstore_tool.cc
@@ -14,8 +14,10 @@
 
 #include <boost/program_options/variables_map.hpp>
 #include <boost/program_options/parsers.hpp>
+#include <boost/algorithm/string.hpp>
 #include <boost/scoped_ptr.hpp>
 #include <boost/optional.hpp>
+#include <fstream>
 
 #include <stdlib.h>
 
@@ -430,7 +432,7 @@ static int get_fd_data(int fd, bufferlist &bl)
   return 0;
 }
 
-int get_log(ObjectStore *fs, __u8 struct_ver,
+int get_log(CephContext *cct, ObjectStore *fs, __u8 struct_ver,
            spg_t pgid, const pg_info_t &info,
            PGLog::IndexedLog &log, pg_missing_t &missing)
 {
@@ -442,7 +444,7 @@ int get_log(ObjectStore *fs, __u8 struct_ver,
     ostringstream oss;
     ceph_assert(struct_ver > 0);
     PGLog::read_log_and_missing(
-      fs, ch,
+      cct, fs, ch,
       pgid.make_pgmeta_oid(),
       info, log, missing,
       oss,
@@ -1068,7 +1070,8 @@ int add_osdmap(ObjectStore *store, metadata_section &ms)
   return get_osdmap(store, ms.map_epoch, ms.osdmap, ms.osdmap_bl);
 }
 
-int ObjectStoreTool::do_export(ObjectStore *fs, coll_t coll, spg_t pgid,
+int ObjectStoreTool::do_export(
+    CephContext *cct, ObjectStore *fs, coll_t coll, spg_t pgid,
     pg_info_t &info, epoch_t map_epoch, __u8 struct_ver,
     const OSDSuperblock& superblock,
     PastIntervals &past_intervals)
@@ -1078,7 +1081,7 @@ int ObjectStoreTool::do_export(ObjectStore *fs, coll_t coll, spg_t pgid,
 
   cerr << "Exporting " << pgid << " info " << info << std::endl;
 
-  int ret = get_log(fs, struct_ver, pgid, info, log, missing);
+  int ret = get_log(cct, fs, struct_ver, pgid, info, log, missing);
   if (ret > 0)
       return ret;
 
@@ -3147,6 +3150,136 @@ int dup(string srcpath, ObjectStore *src, string dstpath, ObjectStore *dst)
   return r;
 }
 
+
+const int ceph_entity_name_type(const string name)
+{
+  if (name == "mds") return CEPH_ENTITY_TYPE_MDS;
+  if (name == "osd") return CEPH_ENTITY_TYPE_OSD;
+  if (name == "mon") return CEPH_ENTITY_TYPE_MON;
+  if (name == "client") return CEPH_ENTITY_TYPE_CLIENT;
+  if (name == "mgr") return CEPH_ENTITY_TYPE_MGR;
+  if (name == "auth") return CEPH_ENTITY_TYPE_AUTH;
+  return -1;
+}
+
+eversion_t get_eversion_from_str(const string& s) {
+  eversion_t e;
+  vector<string> result;
+  boost::split(result, s, boost::is_any_of("'"));
+  if (result.size() != 2) {
+    cerr << "eversion_t: invalid format: '" << s << "'" << std::endl;
+    return e;
+  }
+  e.epoch   = atoi(result[0].c_str());
+  e.version = atoi(result[1].c_str());
+  return e;
+}
+
+osd_reqid_t get_reqid_from_str(const string& s) {
+  osd_reqid_t reqid;
+
+  vector<string> result;
+  boost::split(result, s, boost::is_any_of(".:"));
+  if (result.size() != 4) {
+    cerr << "reqid: invalid format " << s << std::endl;
+    return osd_reqid_t();
+  }
+  reqid.name._type = ceph_entity_name_type(result[0]);
+  reqid.name._num = atoi(result[1].c_str());
+
+  reqid.inc = atoi(result[2].c_str());
+  reqid.tid = atoi(result[3].c_str());
+  return reqid;
+}
+
+void do_dups_inject_transction(ObjectStore *store, spg_t r_pgid, map<string,bufferlist> *new_dups)
+{
+  ObjectStore::Transaction t;
+  coll_t coll(r_pgid);
+  cerr << "injecting dups into pgid:" << r_pgid << " num of dups:" << new_dups->size() << std::endl;
+  t.omap_setkeys(coll, r_pgid.make_pgmeta_oid(), (*new_dups));
+  auto ch = store->open_collection(coll);
+  store->queue_transaction(ch, std::move(t));
+  new_dups->clear();
+}
+
+int do_dups_inject_object(ObjectStore *store, spg_t r_pgid, json_spirit::mObject &in_json_obj,
+                          map<string,bufferlist> *new_dups, bool debug) {
+  std::map<std::string, json_spirit::mValue>::const_iterator it = in_json_obj.find("generate");
+  int32_t generate = 0;
+  if (it != in_json_obj.end()) {
+    generate = atoi(it->second.get_str().c_str());
+  }
+
+  it = in_json_obj.find("reqid");
+  if (it == in_json_obj.end()) {
+    return 1;
+  }
+  osd_reqid_t reqid(get_reqid_from_str(it->second.get_str()));
+  it = in_json_obj.find("version");
+  if (it == in_json_obj.end()) {
+    return 1;
+  }
+  eversion_t version(get_eversion_from_str(it->second.get_str()));
+  it = in_json_obj.find("user_version");
+  if (it == in_json_obj.end()) {
+    return 1;
+  }
+  version_t user_version = atoi(it->second.get_str().c_str());
+  it = in_json_obj.find("return_code");
+  if (it == in_json_obj.end()) {
+    return 1;
+  }
+  int32_t return_code = atoi(it->second.get_str().c_str());
+  if (generate) {
+    for(auto i = 0; i < generate; ++i) {
+      version.version++;
+      if (debug) {
+        cout << "generate dups reqid " << reqid << " v=" << version << std::endl;
+      }
+      pg_log_dup_t tmp(version, user_version, reqid, return_code);
+      bufferlist bl;
+      encode(tmp, bl);
+      (*new_dups)[tmp.get_key_name()] = std::move(bl);
+      if ( new_dups->size() > 50000 ) {
+        do_dups_inject_transction(store, r_pgid, new_dups);
+       cout << "inject of " << i << " dups into pgid:" << r_pgid << " done..." << std::endl;
+      }
+    }
+    return 0;
+  } else {
+    pg_log_dup_t tmp(version, user_version, reqid, return_code);
+    if (debug) {
+      cout << "adding dup: " << tmp << "into key:" << tmp.get_key_name() << std::endl;
+    }
+    bufferlist bl;
+    encode(tmp, bl);
+    (*new_dups)[tmp.get_key_name()] = std::move(bl);
+  }
+  return 0;
+}
+
+void do_dups_inject_from_json(ObjectStore *store, spg_t r_pgid, json_spirit::mValue &inJson, bool debug)
+{
+  map<string,bufferlist> new_dups;
+  const vector<json_spirit::mValue>& o = inJson.get_array();
+  for (const auto& obj : o) {
+    if (obj.type() == json_spirit::obj_type) {
+      json_spirit::mObject Mobj = obj.get_obj();
+      do_dups_inject_object(store, r_pgid, Mobj, &new_dups, debug);
+    } else {
+      throw std::runtime_error("JSON array/object not allowed type:" + std::to_string(obj.type()));
+      return;
+    }
+  }
+  if (new_dups.size() > 0) {
+    do_dups_inject_transction(store, r_pgid, &new_dups);
+  }
+
+
+  return ;
+}
+
 void usage(po::options_description &desc)
 {
     cerr << std::endl;
@@ -3480,7 +3613,7 @@ int main(int argc, char **argv)
     } else {
       file_fd = open(file.c_str(), O_WRONLY|O_CREAT|O_TRUNC, 0666);
     }
-  } else if (op == "import" || op == "dump-export" || op == "set-osdmap" || op == "set-inc-osdmap") {
+  } else if (op == "import" || op == "dump-export" || op == "set-osdmap" || op == "set-inc-osdmap" || op == "pg-log-inject-dups") {
     if (!vm.count("file") || file == "-") {
       if (isatty(STDIN_FILENO)) {
         cerr << "stdin is a tty and no --file filename specified" << std::endl;
@@ -3859,7 +3992,7 @@ int main(int argc, char **argv)
       || op == "export-remove" || op == "mark-complete"
       || op == "reset-last-complete"
       || op == "trim-pg-log"
-      || op == "trim-pg-log-dups") &&
+      || op == "pg-log-inject-dups") &&
       pgidstr.length() == 0) {
     cerr << "Must provide pgid" << std::endl;
     usage(desc);
@@ -4086,7 +4219,7 @@ int main(int argc, char **argv)
 
   // If not an object command nor any of the ops handled below, then output this usage
   // before complaining about a bad pgid
-  if (!vm.count("objcmd") && op != "export" && op != "export-remove" && op != "info" && op != "log" && op != "mark-complete" && op != "trim-pg-log" && op != "trim-pg-log-dups") {
+  if (!vm.count("objcmd") && op != "export" && op != "export-remove" && op != "info" && op != "log" && op != "mark-complete" && op != "trim-pg-log" && op != "trim-pg-log-dups" && op != "pg-log-inject-dups") {
     cerr << "Must provide --op (info, log, remove, mkfs, fsck, repair, export, export-remove, import, list, fix-lost, list-pgs, dump-journal, dump-super, meta-list, "
       "get-osdmap, set-osdmap, get-inc-osdmap, set-inc-osdmap, mark-complete, reset-last-complete, dump-export, trim-pg-log, trim-pg-log-dups statfs)"
         << std::endl;
@@ -4378,7 +4511,7 @@ int main(int argc, char **argv)
 
     if (op == "export" || op == "export-remove") {
       ceph_assert(superblock != nullptr);
-      ret = tool.do_export(fs, coll, pgid, info, map_epoch, struct_ver, *superblock, past_intervals);
+      ret = tool.do_export(cct.get(), fs, coll, pgid, info, map_epoch, struct_ver, *superblock, past_intervals);
       if (ret == 0) {
         cerr << "Export successful" << std::endl;
         if (op == "export-remove") {
@@ -4397,7 +4530,7 @@ int main(int argc, char **argv)
     } else if (op == "log") {
       PGLog::IndexedLog log;
       pg_missing_t missing;
-      ret = get_log(fs, struct_ver, pgid, info, log, missing);
+      ret = get_log(cct.get(), fs, struct_ver, pgid, info, log, missing);
       if (ret < 0)
           goto out;
 
@@ -4482,6 +4615,34 @@ int main(int argc, char **argv)
       }
       cout << "Reseting last_complete succeeded" << std::endl;
    
+    } else if (op == "pg-log-inject-dups") {
+        if (!vm.count("file") || file == "-") {
+          cerr << "Must provide file containing JSON dups entries" << std::endl;
+          ret = 1;
+          goto out;
+        }
+        if (debug)
+          cerr << "opening file " << file << std::endl;
+
+        ifstream json_file_stream(file , std::ifstream::in);
+        if (!json_file_stream.is_open()) {
+          cerr << "unable to open file " << file << std::endl;
+          ret = -1;
+          goto out;
+        }
+        json_spirit::mValue result;
+        try {
+          if (!json_spirit::read(json_file_stream, result))
+            throw std::runtime_error("unparseable JSON " + file);
+          if (result.type() != json_spirit::array_type) {
+            cerr << "result is not an array_type - type=" << result.type() << std::endl;
+            throw std::runtime_error("not JSON array_type " + file);
+          }
+          do_dups_inject_from_json(fs, pgid, result, debug);
+        } catch (const std::runtime_error &e) {
+          cerr << e.what() << std::endl;;
+          return -1;
+        }
     } else {
       ceph_assert(!"Should have already checked for valid --op");
     }
diff --git a/src/tools/ceph_objectstore_tool.h b/src/tools/ceph_objectstore_tool.h
index aafe886ba69..82aa83e5d2c 100644
--- a/src/tools/ceph_objectstore_tool.h
+++ b/src/tools/ceph_objectstore_tool.h
@@ -27,7 +27,7 @@ class ObjectStoreTool : public RadosDump
     int dump_export(Formatter *formatter);
     int do_import(ObjectStore *store, OSDSuperblock& sb, bool force,
                  std::string pgidstr);
-    int do_export(ObjectStore *fs, coll_t coll, spg_t pgid,
+    int do_export(CephContext *cct, ObjectStore *fs, coll_t coll, spg_t pgid,
           pg_info_t &info, epoch_t map_epoch, __u8 struct_ver,
           const OSDSuperblock& superblock,
           PastIntervals &past_intervals);

@rzarzynski
Copy link
Contributor Author

@nkshirsagar: thanks, looks good!

@nkshirsagar
Copy link
Contributor

@nkshirsagar: thanks, looks good!

Hi @rzarzynski o/

I've been trying to test this patch in Octopus. I'm not sure if I am getting expected results. Please see this link for the detailed testing - https://pastebin.com/MNMWh01i (I used 500k dups in the json file instead of 50 million)

@nkshirsagar
Copy link
Contributor

@nkshirsagar: thanks, looks good!

Hi @rzarzynski o/

I've been trying to test this patch in Octopus. I'm not sure if I am getting expected results. Please see this link for the detailed testing - https://pastebin.com/MNMWh01i (I used 500k dups in the json file instead of 50 million)

I have managed to validate this patch in Octopus. Injected dups using this json and then started IO using rbd bench-write,

root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat ../bin/dups.json 
[
    {"reqid": "client.4177.0:0", 
    "version": "3'0", 
    "user_version": "0", 
    "generate": "500000", 
    "return_code": "0"}
]

Noted these "trim dup " logs appearing after a while in the osd logs, with the same client ID as the json used, (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138) 

root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat osd.1.log | grep -i "trim dup "  | grep 4177 | more
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'1 uv=0 rc=0)
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'2 uv=0 rc=0)
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'3 uv=0 rc=0)
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'4 uv=0 rc=0)
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
...
...
2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)

# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001

output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log shows first 130k dups have been trimmed already,

 "dups": [
            {
                "reqid": "client.4177.0:0",
                "version": "3'130001",
                "user_version": "0",
                "return_code": "0"
            },
            {
                "reqid": "client.4177.0:0",
                "version": "3'130002",
                "user_version": "0",
                "return_code": "0"
            },

Hence, I will now submit this patch (#47046 (comment)) into our downstream Octopus package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants