Skip to content

osdc/Journaler: derr log when header update get (possibly) overwritten out-of-order#55265

Merged
vshankar merged 1 commit intoceph:mainfrom
vshankar:wip-58878
Jul 16, 2024
Merged

osdc/Journaler: derr log when header update get (possibly) overwritten out-of-order#55265
vshankar merged 1 commit intoceph:mainfrom
vshankar:wip-58878

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Jan 22, 2024

See: https://tracker.ceph.com/issues/58878#note-4

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@vshankar vshankar added the core label Jan 22, 2024
@vshankar vshankar requested review from a team and rzarzynski January 22, 2024 06:17
@vshankar
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor Author

vshankar commented Feb 8, 2024

@rzarzynski ping?

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the mds is the only user of this Journaler (rbd uses src/journal). It's in our scope to maintain it, not RADOS.

if (wrote.write_pos < last_committed.write_pos ||
wrote.expire_pos < last_committed.expire_pos ||
wrote.trimmed_pos < last_committed.trimmed_pos) {
lderr(cct) << __func__ << ": not updating last_committed" << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of just skipping the header update. Any objection to an actual assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - I'll add the assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick why are you wary of skipping the header update here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick why are you wary of skipping the header update here ?

Because the Journaler had previously, successfully written a header with last_committed beyond the header write that just succeeded. This may indicate that we have clobbered the newest update to head. We cannot happily continue after having done so.

Looking at this code again, there's little to prevent this kind of error. It really should be harnessing the smarts of the OSD by encoding a sequence code -- the current write_pos -- in the write (probably stored in the omap) which fails the op if the write's sequence code is behind what the OSD has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this code again, there's little to prevent this kind of error. It really should be harnessing the smarts of the OSD by encoding a sequence code -- the current write_pos -- in the write (probably stored in the omap) which fails the op if the write's sequence code is behind what the OSD has.

I agree and IMO this is something we would want to add at least when flushing journal header.

@batrick batrick requested review from a team and removed request for a team and rzarzynski February 28, 2024 17:34
@batrick batrick added cephfs Ceph File System needs-review and removed core labels Feb 28, 2024
@vshankar
Copy link
Contributor Author

BTW, the mds is the only user of this Journaler (rbd uses src/journal). It's in our scope to maintain it, not RADOS.

In that case we should get the journaler sources in src/mds (just as an logical dependency).

Comment on lines +496 to +510
if (wrote.write_pos < last_committed.write_pos ||
wrote.expire_pos < last_committed.expire_pos ||
wrote.trimmed_pos < last_committed.trimmed_pos) {
lderr(cct) << __func__ << ": not updating last_committed: "
<< "(wrote.write_pos/last_committed.write_pos="
<< wrote.write_pos << "," << last_committed.write_pos < "), "
<< "(wrote.expire_pos/last_committed.expire_pos="
<< wrote.expire_pos << "," << last_committed.expire_pos < "), "
<< "(wrote.trimmed_pos/last_committed.trimmed_pos="
<< wrote.trimmed_pos << "," << last_committed.trimmed_pos < ")"
<< dendl;
ceph_abort();
} else {
last_committed = wrote;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the old values are already written to the disk and we have no clue whether the write op is out of sync or the just the update notification or both. But the log should help as a reference for setting the header field values, using cephfs-journal-tool, when the assertion failure ceph_assert(trim_to >= trimming_pos) is hit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the assertion failure in MDLog::_trim() will not happen due to the ceph_abort() here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, right. This is just a hunch and right now we have no idea what could have happened that caused the journal pointer values.

@vshankar vshankar requested a review from batrick March 12, 2024 01:34
@vshankar vshankar requested a review from mchangir April 12, 2024 06:18
@vshankar
Copy link
Contributor Author

jenkins retest this please

@vshankar
Copy link
Contributor Author

@ceph/cephfs gentle nudge on this...

@vshankar
Copy link
Contributor Author

vshankar commented Jun 5, 2024

@ceph/cephfs gentle nudge (needs approval to put this through testing)

@vshankar
Copy link
Contributor Author

vshankar commented Jun 5, 2024

jenkins retest this please

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but we'll want a follow-up ticket for

#55265 (comment)

@batrick
Copy link
Member

batrick commented Jun 5, 2024

FAILED: src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o 
/usr/bin/ccache /usr/bin/clang++-14 -DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION -DBOOST_ASIO_HAS_IO_URING -DBOOST_ASIO_NO_TS_EXECUTORS -DBOOST_DATE_TIME_DYN_LINK -DBOOST_DATE_TIME_NO_LIB -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_IOSTREAMS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_RANDOM_DYN_LINK -DBOOST_RANDOM_NO_LIB -DBOOST_SYSTEM_DYN_LINK -DBOOST_SYSTEM_NO_LIB -DBOOST_THREAD_DYN_LINK -DBOOST_THREAD_NO_LIB -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_REENTRANT -D_THREAD_SAFE -D__CEPH__ -D__STDC_FORMAT_MACROS -D__linux__ -I/home/jenkins-build/build/workspace/ceph-pull-requests/build/src/include -I/home/jenkins-build/build/workspace/ceph-pull-requests/src -I/usr/include/lua5.3 -isystem /opt/ceph/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/build/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/api/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/exporters/jaeger/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/ext/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/sdk/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/xxHash -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/fmt/include -g -Werror -fPIC -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -DBOOST_PHOENIX_STL_TUPLE_H_ -Wall -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wno-inconsistent-missing-override -Wno-mismatched-tags -Wno-unused-private-field -Wno-address-of-packed-member -Wno-unused-function -Wno-unused-local-typedef -Wno-varargs -Wno-gnu-designator -Wno-missing-braces -Wno-parentheses -Wno-deprecated-register -DCEPH_DEBUG_MUTEX -D_GLIBCXX_ASSERTIONS -fdiagnostics-color=auto -std=c++20 -MD -MT src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o -MF src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o.d -o src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o -c /home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:502:9: error: invalid operands to binary expression ('const char[4]' and 'const char[45]')
               << "(wrote.expire_pos/last_committed.expire_pos="
               ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:504:9: error: invalid operands to binary expression ('const char[4]' and 'const char[47]')
               << "(wrote.trimmed_pos/last_committed.trimmed_pos="
               ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:506:12: error: reference to overloaded function could not be resolved; did you mean to call it?
               << dendl;
                  ^~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/dout.h:216:15: note: expanded from macro 'dendl'
#define dendl dendl_impl
              ^~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/dout.h:189:20: note: expanded from macro 'dendl_impl'
#define dendl_impl std::flush;                                          \
                   ^~~~~~~~~~
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ostream:706:5: note: possible target for call
    flush(basic_ostream<_CharT, _Traits>& __os)
    ^

@vshankar
Copy link
Contributor Author

vshankar commented Jun 5, 2024

FAILED: src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o 
/usr/bin/ccache /usr/bin/clang++-14 -DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION -DBOOST_ASIO_HAS_IO_URING -DBOOST_ASIO_NO_TS_EXECUTORS -DBOOST_DATE_TIME_DYN_LINK -DBOOST_DATE_TIME_NO_LIB -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_IOSTREAMS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_RANDOM_DYN_LINK -DBOOST_RANDOM_NO_LIB -DBOOST_SYSTEM_DYN_LINK -DBOOST_SYSTEM_NO_LIB -DBOOST_THREAD_DYN_LINK -DBOOST_THREAD_NO_LIB -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_REENTRANT -D_THREAD_SAFE -D__CEPH__ -D__STDC_FORMAT_MACROS -D__linux__ -I/home/jenkins-build/build/workspace/ceph-pull-requests/build/src/include -I/home/jenkins-build/build/workspace/ceph-pull-requests/src -I/usr/include/lua5.3 -isystem /opt/ceph/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/build/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/api/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/exporters/jaeger/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/ext/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/jaegertracing/opentelemetry-cpp/sdk/include -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/xxHash -isystem /home/jenkins-build/build/workspace/ceph-pull-requests/src/fmt/include -g -Werror -fPIC -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -DBOOST_PHOENIX_STL_TUPLE_H_ -Wall -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wno-inconsistent-missing-override -Wno-mismatched-tags -Wno-unused-private-field -Wno-address-of-packed-member -Wno-unused-function -Wno-unused-local-typedef -Wno-varargs -Wno-gnu-designator -Wno-missing-braces -Wno-parentheses -Wno-deprecated-register -DCEPH_DEBUG_MUTEX -D_GLIBCXX_ASSERTIONS -fdiagnostics-color=auto -std=c++20 -MD -MT src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o -MF src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o.d -o src/mds/CMakeFiles/mds.dir/__/osdc/Journaler.cc.o -c /home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:502:9: error: invalid operands to binary expression ('const char[4]' and 'const char[45]')
               << "(wrote.expire_pos/last_committed.expire_pos="
               ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:504:9: error: invalid operands to binary expression ('const char[4]' and 'const char[47]')
               << "(wrote.trimmed_pos/last_committed.trimmed_pos="
               ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osdc/Journaler.cc:506:12: error: reference to overloaded function could not be resolved; did you mean to call it?
               << dendl;
                  ^~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/dout.h:216:15: note: expanded from macro 'dendl'
#define dendl dendl_impl
              ^~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/dout.h:189:20: note: expanded from macro 'dendl_impl'
#define dendl_impl std::flush;                                          \
                   ^~~~~~~~~~
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ostream:706:5: note: possible target for call
    flush(basic_ostream<_CharT, _Traits>& __os)
    ^

I'll fix that up real quick.

…overwritten out-of-order

See: https://tracker.ceph.com/issues/58878#note-4

Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

vshankar commented Jun 6, 2024

jenkins test make check

@vshankar
Copy link
Contributor Author

vshankar commented Jun 6, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented Jun 10, 2024

jenkins test make check arm64

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/66521.

@vshankar
Copy link
Contributor Author

@vshankar vshankar merged commit f4ac1f4 into ceph:main Jul 16, 2024
NitzanMordhai pushed a commit to NitzanMordhai/ceph that referenced this pull request Aug 1, 2024
osdc/Journaler: derr log when header update get (possibly) overwritten out-of-order

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Dhairya Parmar <dparmar@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System needs-qa

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants