Bug #54741: crash: MDSTableClient::got_journaled_ack(unsigned long) - CephFS - Ceph

Actions

Copy link

Bug #54741

closed

crash: MDSTableClient::got_journaled_ack(unsigned long)

Added by Telemetry Bot about 4 years ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v21.0.0

% Done:

Source:

Telemetry

Backport:

tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.4, Ceph - v16.2.7

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

62554

Tags (freeform):

Merge Commit:

742cb595719d5cc4c96edeefdff5d19009b9e506

Fixed In:

v20.3.0-1890-g742cb59571

Released In:

Upkeep Timestamp:

2025-08-12T05:09:42+00:00

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=366ab44cb3c1d002359d6d1dbe2b4287e261f13517807620ef5a83625e3c0a2b

Sanitized backtrace:

    MDSTableClient::got_journaled_ack(unsigned long)
    ETableClient::replay(MDSRank*)
    MDLog::_replay_thread()
    MDLog::ReplayThread::entry()

Crash dump sample:

{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f7500e67c20]",
        "(std::_Hashtable<unsigned long, unsigned long, std::allocator<unsigned long>, std::__detail::_Identity, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_erase(std::integral_constant<bool, true>, unsigned long const&)+0x37) [0x55651af0bb97]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x179) [0x55651af09159]",
        "(ETableClient::replay(MDSRank*)+0x1b3) [0x55651afd8d03]",
        "(MDLog::_replay_thread()+0xcd1) [0x55651af688e1]",
        "(MDLog::ReplayThread::entry()+0x11) [0x55651ac6a311]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f7500e5d17a]",
        "clone()" 
    ],
    "ceph_version": "16.2.7",
    "crash_id": "2022-03-01T10:31:29.990280Z_af75c1cc-080f-4364-9410-565aa6812923",
    "entity_name": "mds.79d0a07c1449cab02f6cf00fb829f33d1b9f49e6",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "c009c98d5387ea202da11cd36585a1c06b8d48ee740496c707f1171a5332598c",
    "timestamp": "2022-03-01T10:31:29.990280Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-74-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021" 
}

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Telemetry Bot about 4 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v16.2.4, v16.2.7 added

Actions

Copy link

Updated by Telemetry Bot almost 4 years ago

Project changed from RADOS to CephFS
Crash signature (v1) updated (diff)

Actions

Copy link

Updated by Milind Changire over 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Venky Shankar over 2 years ago

Assignee changed from Xiubo Li to Venky Shankar
Target version set to v19.0.0
Backport set to quincy,reef
Component(FS) MDS added
Labels (FS) crash added

Xiubo, I'm taking this one. This has been hit in a downstream build lately.

Actions

Copy link

Updated by Venky Shankar over 2 years ago

BTW, this crash is seen in the standby-replay MDS daemon.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Venky Shankar wrote:

BTW, this crash is seen in the standby-replay MDS daemon.

Sure Venky, not getting a chance to check it yet.

Actions

Copy link

Updated by Venky Shankar over 2 years ago

... and we have a core to inspect. Hopefully things go well.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Venky Shankar wrote:

... and we have a core to inspect. Hopefully things go well.

Sounds good.

Actions

Copy link

Updated by Venky Shankar over 2 years ago

The LogSegment has an empty `pending_commit_tids` map

(gdb) p pending_commit
$9 = std::map with 1 element = {[174037] = 0x55919f128300}
(gdb) p (LogSegment *)0x55919f128300
$10 = (LogSegment *) 0x55919f128300
(gdb) p *(LogSegment *)0x55919f128300
$11 = {seq = 94081258618881, offset = 94083927409456, end = 0, num_events = 0, dirty_dirfrags = {_head = {_prev = 0x2a0b, _next = 0x0}, item_offset = 94081258618881}, new_dirfrags = {_head = {_prev = 0x55919f1282d0, _next = 0x55919f128300},
    item_offset = 94083927409504}, dirty_inodes = {_head = {_prev = 0x2a0c, _next = 0x0}, item_offset = 94081258618881}, dirty_dentries = {_head = {_prev = 0x55919f128330, _next = 0x0}, item_offset = 0}, open_files = {_head = {_prev = 0x2a0d, _next = 0x0},
    item_offset = 94081258618881}, dirty_parent_inodes = {_head = {_prev = 0x55919f128690, _next = 0x55919f128210}, item_offset = 94083927409936}, dirty_dirfrag_dir = {_head = {_prev = 0x2a0e, _next = 0x0}, item_offset = 94081258618881}, dirty_dirfrag_nest = {_head = {
      _prev = 0x55919f1283f0, _next = 0x0}, item_offset = 0}, dirty_dirfrag_dirfragtree = {_head = {_prev = 0x2a0f, _next = 0x0}, item_offset = 94081258618881},
  truncating_inodes = std::set with 94081258618881 elements<error reading variable: Cannot access memory at address 0x2a28>, purging_inodes = {_size = 94083927409648,
    m = std::map with 94083927409936 elements<error reading variable: Cannot access memory at address 0x18>}, purged_cb = 0x55919f1283f0, pending_commit_tids = std::map with 0 elements,
  uncommitted_leaders = std::set with 94083927409792 elements<error reading variable: Cannot access memory at address 0x400010>, uncommitted_peers = std::set with 0 elements,
  uncommitted_fragments = std::set with 94083927409744 elements<error reading variable: Cannot access memory at address 0x400010>, last_client_tids = std::map with 0 elements,
  touched_sessions = std::set with 94083927409984 elements<error reading variable: Cannot access memory at address 0x400010>, inotablev = 94083927410080, sessionmapv = 10776,
  tablev = std::map with 10777 elements<error reading variable: Cannot access memory at address 0x18>, expiry_waiters = std::vector of length 0, capacity 11760490926242 = {<error reading variable>
(gdb) p ((LogSegment *)0x55919f128300)->pending_commit_tids
$12 = std::map with 0 elements

but `MDSTableClient::got_journaled_ack` tries to remove an element from the map value.

void MDSTableClient::got_journaled_ack(version_t tid)
{
  dout(10) << "got_journaled_ack " << tid << dendl;
  if (pending_commit.count(tid)) {
    pending_commit[tid]->pending_commit_tids[table].erase(tid);  <-- here
    pending_commit.erase(tid);
  }
}

Actions

Copy link

#10

Updated by Venky Shankar over 2 years ago

Related to Bug #62962: mds: standby-replay daemon crashes on replay added

Actions

Copy link

#11

Updated by Venky Shankar over 2 years ago

So, this happens in a standby-replay daemon

(gdb) bt
#0  0x00007f6a667c854c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f6a6677bd46 in raise () from /lib64/libc.so.6
#2  0x0000558ffa73187a in reraise_fatal (signum=11) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/global/signal_handler.cc:88
#3  handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/global/signal_handler.cc:363
#4  <signal handler called>
#5  0x00007f6a66aee0d3 in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /lib64/libstdc++.so.6
#6  0x0000558ffa4078e5 in std::_Rb_tree_iterator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::operator-- (this=<synthetic pointer>) at /usr/include/c++/11/bits/stl_tree.h:302
#7  std::_Rb_tree<int, std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<int>, std::allocator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_get_insert_unique_pos (this=<optimized out>, __k=@0x559077906bc0: 1) at /usr/include/c++/11/bits/stl_tree.h:2080
#8  0x0000558ffa7ccea2 in std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_get_insert_hint_unique_pos (__k=@0x559077906bc0: 1, __position={first = 0, second = std::unordered_set with 94081258618881 elements}, this=0x55919f128468) at /usr/include/c++/11/bits/stl_tree.h:2209
#9  std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) [clone .constprop.0] [clone .isra.0] (this=0x55919f128468, __pos={first = 0, second = std::unordered_set with 94081258618881 elements}) at /usr/include/c++/11/bits/stl_tree.h:2435
#10 0x0000558ffa67d9fc in std::map<int, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::operator[] (__k=@0x558ffb691450: 1, this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:350
#11 MDSTableClient::got_journaled_ack (this=0x558ffb691440, tid=<optimized out>) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/mds/MDSTableClient.cc:222
#12 0x0000558ffa6c8923 in MDLog::_replay_thread (this=0x558ffc40c300) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/mds/MDLog.cc:1436
#13 0x0000558ffa43ff41 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/mds/MDLog.h:192
#14 0x00007f6a667c6802 in start_thread () from /lib64/libc.so.6
#15 0x00007f6a66766450 in clone3 () from /lib64/libc.so.6
(gdb) f 12
#12 0x0000558ffa6c8923 in MDLog::_replay_thread (this=0x558ffc40c300) at /usr/src/debug/ceph-17.2.6-148.el9cp.x86_64/src/mds/MDLog.cc:1436
1436            le->replay(mds);
(gdb) p mds->state
$36 = MDSMap::STATE_STANDBY_REPLAY

EMetaBlob::table_tids is persisted in mdlog which holds the commits tids, however, the standby-replay daemon didn't seem to replay that log segment (perhaps that log segment was trimmed by the active MDS?). If the next log segment has the `TABLESERVER_OP_ACK` log event, this crash will happen.

Actions

Copy link

#12

Updated by Patrick Donnelly over 2 years ago

Related to Bug #59119: mds: segmentation fault during replay of snaptable updates added

Actions

Copy link

#13

Updated by Patrick Donnelly almost 2 years ago

Target version changed from v19.0.0 to v20.0.0

Actions

Copy link

#14

Updated by Venky Shankar almost 2 years ago

/a/vshankar-2024-05-10_15:47:15-fs-wip-vshankar-testing-20240509.114851-debug-testing-default-smithi/7701383

The underlying cause is describe in https://tracker.ceph.com/issues/59119 - the active MDS trims a log segment from the journal but the s-r daemon still think it can still continue replaying the log segments which is incorrect and leads to this crash.

Actions

Copy link

#15

Updated by Venky Shankar over 1 year ago

So, the series of log segment replay that leads to this crash is as follows (in standby-replay daemon):

A journal read is done by the replay thread

2024-05-10T22:03:01.148+0000 7fdf6e200700 20 mds.0.0 _standby_replay_restart_finish: old_read_pos=42766129 trimmed_pos=37748736
2024-05-10T22:03:01.148+0000 7fdf6e200700 10 mds.0.log standby_trim_segments
2024-05-10T22:03:01.148+0000 7fdf6e200700 10 mds.0.log  expire_pos=42319376
2024-05-10T22:03:01.148+0000 7fdf6e200700 10 mds.0.log  maybe trim LogSegment(10205/0x285be10 events=21)
2024-05-10T22:03:01.148+0000 7fdf6e200700 10 mds.0.log  won't remove, not expired!
2024-05-10T22:03:01.148+0000 7fdf6e200700 20 mds.0.log  removed no segments!
2024-05-10T22:03:01.148+0000 7fdf6e200700  2 mds.0.0 Booting: 2: replaying mds log
2024-05-10T22:03:01.148+0000 7fdf6e200700 10 mds.0.log replay start, from 42766129 to 43141352
2024-05-10T22:03:01.149+0000 7fdf6c400700 10 mds.0.log _replay_thread start

This log segment has some a mksnap and an rmsnap event, though the sub-events are spread across subsequent log segments.

OP_PREPARE (snap server):

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42766129~121 / 43141352 2024-05-10T22:03:00.678242+0000: ETableServer snaptable prepare reqid 34 mds.0 tid 81 version 81 mutation=57 bytes
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableServer.replay snaptable prepare event 81 - 1 == table 80
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.snap prepare v81 create snap(22 0x1 'scheduled-2024-05-10-22_03_00_UTC' 2024-05-10T22:03:00.678077+0000)

MKSNAP:

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42766270~4226 / 43141352 2024-05-10T22:03:00.684379+0000: EUpdate mksnap [metablob table_tids=[^A,81]]
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EUpdate::replay
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay 0 dirlumps by unknown.0
2024-05-10T22:03:01.153+0000 7fdf6c400700 20 mds.0.cache.ino(0x1) decode_snap_blob snaprealm(0x1 seq 22 lc 22 cr 22 cps 1 snaps={16=snap(16 0x1 'scheduled-2024-05-10-21_57_00_UTC' 2024-05-10T21:57:00.411295+0000),18=snap(18 0x1 'scheduled-2024-05-10-21_58_00_UTC' 2024-\
05-10T21:58:00.795387+0000),1a=snap(1a 0x1 'scheduled-2024-05-10-21_59_00_UTC' 2024-05-10T21:59:00.287682+0000),1c=snap(1c 0x1 'scheduled-2024-05-10-22_00_00_UTC' 2024-05-10T22:00:00.625804+0000),1e=snap(1e 0x1 'scheduled-2024-05-10-22_01_00_UTC' 2024-05-10T22:01:00.81\
2167+0000),20=snap(20 0x1 'scheduled-2024-05-10-22_02_00_UTC' 2024-05-10T22:02:00.140519+0000),22=snap(22 0x1 'scheduled-2024-05-10-22_03_00_UTC' 2024-05-10T22:03:00.477972+0000)} last_modified 2024-05-10T22:03:00.477972+0000 change_attr 33 0x557a01ab2000)
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay  updated root [inode 0x1 [...22,head] / auth v170 snaprealm=0x557a01ab2000 f(v0 m2024-05-10T21:40:01.354241+0000 1=0+1) n(v4 rc2024-05-10T22:03:00.477972+0000 b1286564962 rs7 66878=63038+3840)/\
n(v0 rc2024-05-10T21:38:23.018285+0000 1=0+1) old_inodes=6 | dirfrag=1 dirty=1 0x557a01aa7080]
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay noting snaptable transaction 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.tableclient(snaptable) got_journaled_agree 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.snapclient notify_commit tid 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay request client.15360:236 trim_to 235
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42770516~64 / 43141352 2024-05-10T22:03:00.689268+0000: ETableServer snaptable commit tid 81 version 82
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableServer.replay snaptable commit event 82 - 1 == table 81
2024-05-10T22:03:01.153+0000 7fdf6c400700  7 mds.0.snap commit 81 create snap(22 0x1 'scheduled-2024-05-10-22_03_00_UTC' 2024-05-10T22:03:00.678077+0000)

OP_COMMIT (snap server):

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.snapclient notify_commit tid 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay request client.15360:236 trim_to 235
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42770516~64 / 43141352 2024-05-10T22:03:00.689268+0000: ETableServer snaptable commit tid 81 version 82
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableServer.replay snaptable commit event 82 - 1 == table 81
2024-05-10T22:03:01.153+0000 7fdf6c400700  7 mds.0.snap commit 81 create snap(22 0x1 'scheduled-2024-05-10-22_03_00_UTC' 2024-05-10T22:03:00.678077+0000)

OP_ACK (snap client):

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42770704~40 / 43141352 2024-05-10T22:03:00.888532+0000: ETableClient snaptable ack tid 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableClient.replay snaptable op ack tid 81
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.tableclient(snaptable) got_journaled_ack 81

This is fine. Now, let's see the rmsnap event.

OP_PREPARE (snap server):

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42770600~84 / 43141352 2024-05-10T22:03:00.884034+0000: ETableServer snaptable prepare reqid 35 mds.0 tid 83 version 83 mutation=20 bytes
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableServer.replay snaptable prepare event 83 - 1 == table 82
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.snap prepare v83 destroy 16 seq 23

RMSNAP:

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42770764~4143 / 43141352 2024-05-10T22:03:00.889489+0000: EUpdate rmsnap [metablob table_tids=[^A,83]]
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EUpdate::replay
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay 0 dirlumps by unknown.0
2024-05-10T22:03:01.153+0000 7fdf6c400700 20 mds.0.cache.ino(0x1) decode_snap_blob snaprealm(0x1 seq 23 lc 22 cr 22 cps 1 snaps={18=snap(18 0x1 'scheduled-2024-05-10-21_58_00_UTC' 2024-05-10T21:58:00.795387+0000),1a=snap(1a 0x1 'scheduled-2024-05-10-21_59_00_UTC' 2024-\
05-10T21:59:00.287682+0000),1c=snap(1c 0x1 'scheduled-2024-05-10-22_00_00_UTC' 2024-05-10T22:00:00.625804+0000),1e=snap(1e 0x1 'scheduled-2024-05-10-22_01_00_UTC' 2024-05-10T22:01:00.812167+0000),20=snap(20 0x1 'scheduled-2024-05-10-22_02_00_UTC' 2024-05-10T22:02:00.14\
0519+0000),22=snap(22 0x1 'scheduled-2024-05-10-22_03_00_UTC' 2024-05-10T22:03:00.477972+0000)} last_modified 2024-05-10T22:03:00.876094+0000 change_attr 34 0x557a01ab2000)
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay  updated root [inode 0x1 [...22,head] / auth v171 snaprealm=0x557a01ab2000 f(v0 m2024-05-10T21:40:01.354241+0000 1=0+1) n(v4 rc2024-05-10T22:03:00.876094+0000 b1286564962 rs6 66878=63038+3840)/\
n(v0 rc2024-05-10T21:38:23.018285+0000 1=0+1) old_inodes=6 | dirfrag=1 dirty=1 0x557a01aa7080]
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay noting snaptable transaction 83
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.tableclient(snaptable) got_journaled_agree 83
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.snapclient notify_commit tid 83
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal EMetaBlob.replay request client.15360:248 trim_to 247

OP_COMMIT (snap server):

2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.log _replay 42774927~64 / 43141352 2024-05-10T22:03:00.896226+0000: ETableServer snaptable commit tid 83 version 84
2024-05-10T22:03:01.153+0000 7fdf6c400700 10 mds.0.journal  ETableServer.replay snaptable commit event 84 - 1 == table 83
2024-05-10T22:03:01.153+0000 7fdf6c400700  7 mds.0.snap commit 83 destroy 16 seq 23

That's the end of the log segment - note, no OP_ACK yet. Later log segment LogSegment(10205/0x285be10 events=21) is trimmed since the expire position has moved ahead of the end of the log segment. From MDLog::standby_trim_segment():

    if (ls->end > expire_pos) {
      dout(10) << " won't remove, not expired!" << dendl;
      break;
    }

    if (segments.size() == 1) {
      dout(10) << " won't remove, last segment!" << dendl;
      break;
    }

    dout(10) << " removing segment" << dendl;
    mds->mdcache->standby_trim_segment(ls);
    remove_oldest_segment();
    if (pre_segments_size > 0) {
      --pre_segments_size;
    }
    removed_segment = true;
  }

So, the log segment is removed and trimmed:

2024-05-10T22:03:03.159+0000 7fdf6e200700 20 mds.0.0 _standby_replay_restart_finish: old_read_pos=43141352 trimmed_pos=41943040
2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log standby_trim_segments
2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log  expire_pos=42775011
2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log  maybe trim LogSegment(10205/0x285be10 events=28)
2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log  removing segment

and the journal is read again to replay events:

2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log  maybe trim LogSegment(10233/0x28cb1e3 events=1)
2024-05-10T22:03:03.159+0000 7fdf6e200700 10 mds.0.log  won't remove, not expired!
2024-05-10T22:03:03.159+0000 7fdf6e200700 20 mds.0.log  calling mdcache->trim!
2024-05-10T22:03:03.159+0000 7fdf6e200700  5 mds.0.cache trim bytes_used=2MB limit=4GB reservation=0.05% count=0
2024-05-10T22:03:03.159+0000 7fdf6e200700  7 mds.0.cache trim_lru trimming 0 items from LRU size=541 mid=158 pintail=0 pinned=290
2024-05-10T22:03:03.159+0000 7fdf6e200700 20 mds.0.cache bottom_lru: 0 items, 0 top, 0 bot, 0 pintail, 0 pinned
2024-05-10T22:03:03.159+0000 7fdf6e200700 20 mds.0.cache lru: 541 items, 158 top, 383 bot, 0 pintail, 290 pinned
2024-05-10T22:03:03.159+0000 7fdf6e200700  7 mds.0.cache trim_lru trimmed 0 items
2024-05-10T22:03:03.159+0000 7fdf6e200700  2 mds.0.0 Booting: 2: replaying mds log

The OP_ACK event (from RMSNAP) is in this log segment:

2024-05-10T22:03:03.160+0000 7fdf6c400700 10 mds.0.log _replay 43141352~40 / 43141412 2024-05-10T22:03:00.994052+0000: ETableClient snaptable ack tid 83
2024-05-10T22:03:03.160+0000 7fdf6c400700 10 mds.0.journal  ETableClient.replay snaptable op ack tid 83
2024-05-10T22:03:03.160+0000 7fdf6c400700 10 mds.0.tableclient(snaptable) got_journaled_ack 83

However, the log segment which noted (got_journaled_agree 83) the tid (83) is already removed and trimmed. This should be addressed by the MDLog refactoring change https://github.com/ceph/ceph/pull/53086 which makes LogSegment a share pointer. Interestingly, that's what I'm reviewing this week :)

Actions

Copy link

#16