Skip to content

mds: batch backtrace updates by pool-id when expiring a log segment#55421

Merged
vshankar merged 4 commits intoceph:mainfrom
vshankar:wip-63259
Sep 17, 2024
Merged

mds: batch backtrace updates by pool-id when expiring a log segment#55421
vshankar merged 4 commits intoceph:mainfrom
vshankar:wip-63259

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Feb 2, 2024

Otherwise, a backtrace update failure due to a removed data pool would cause the entire batch to be considered as a failed backtrace update (depending on when the first failure happens), thereby causing the MDS to go read-only when the error (-ENOENT) is trickeled up for a backtrace update for the metadata pool or a undeleted data pool.

Fixes: http://tracker.ceph.com/issues/63259

NOTE - this hasn't been tested yet - will get to that in a while.

@lxbsz - this change does away with the vector resize bits, I think we can bring that back in. Question: was that (vector resize) done since you expected the MDS to spend much time in that (holding the mds_lock) so preallocating space would lower the time spend?

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@vshankar vshankar added the cephfs Ceph File System label Feb 2, 2024
@vshankar vshankar requested a review from a team February 2, 2024 05:03
@lxbsz
Copy link
Member

lxbsz commented Feb 21, 2024

Otherwise, a backtrace update failure due to a removed data pool would cause the entire batch to be considered as a failed backtrace update (depending on when the first failure happens), thereby causing the MDS to go read-only when the error (-ENOENT) is trickeled up for a backtrace update for the metadata pool or a undeleted data pool.

Fixes: http://tracker.ceph.com/issues/63259

NOTE - this hasn't been tested yet - will get to that in a while.

@lxbsz - this change does away with the vector resize bits, I think we can bring that back in. Question: was that (vector resize) done since you expected the MDS to spend much time in that (holding the mds_lock) so preallocating space would lower the time spend?

@vshankar I think the resize() could be removed. We have already reserved enough space before the loop.

Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vshankar vshankar requested a review from a team February 21, 2024 09:25
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me from reading the patch how "thereby causing the MDS
to go read-only when the error (-ENOENT) is trickeled up for a backtrace
update for the metadata pool or a undeleted data pool." is avoided by this change. Could you explain in a comment?

@vshankar
Copy link
Contributor Author

It's not clear to me from reading the patch how "thereby causing the MDS to go read-only when the error (-ENOENT) is trickeled up for a backtrace update for the metadata pool or a undeleted data pool." is avoided by this change. Could you explain in a comment?

Sure. Will update the change explaining the fix.

@vshankar
Copy link
Contributor Author

Dropped from testing till #55421 (comment) gets fixed.

@vshankar
Copy link
Contributor Author

jenkins test api

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test should be somewhat trivial to sythesize, no?

  • create dir with ceph.dir.layout.pool == some-new-pool
  • create empty file in that dir
  • flush mds journal
  • set ceph.file.layout.pool == some-new-pool2 on the file
  • restore default layout for dir
  • delete some-new-pool
  • flush the mds log (would fail before but now does not)

? I don't think a failover is required?

You can use ceph-dencoder of course to verify the backtrace updates.

// dispatch separate ops for backtrace updates for old pools
in->store_backtrace(ops_vec_map[pool_id].back(), op_prio, true);
for (auto p : in->get_inode()->old_pools) {
in->store_backtrace(ops_vec_map[p].back(), op_prio, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in->store_backtrace(ops_vec_map[p].back(), op_prio, true);
ops_vec_map[p].push_back(CInodeCommitOperations());
in->store_backtrace(ops_vec_map[p].back(), op_prio, true);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean to suggest. It's done for the current pool - can be the same for old_pools.

vshankar added a commit to vshankar/ceph that referenced this pull request Apr 12, 2024
* refs/pull/55421/head:
	mds: batch backtrace updates by pool-id when expiring a log segment

Reviewed-by: Xiubo Li <xiubli@redhat.com>
@vshankar
Copy link
Contributor Author

I think a test should be somewhat trivial to sythesize, no?

Make sense. I didn't bother adding a test since this issue was pretty much getting reproduced on fs suite test branches. I'll add a test and update.

  • create dir with ceph.dir.layout.pool == some-new-pool
  • create empty file in that dir
  • flush mds journal
  • set ceph.file.layout.pool == some-new-pool2 on the file
  • restore default layout for dir
  • delete some-new-pool
  • flush the mds log (would fail before but now does not)

? I don't think a failover is required?

Correct.

You can use ceph-dencoder of course to verify the backtrace updates.

vshankar added a commit to vshankar/ceph that referenced this pull request Apr 30, 2024
* refs/pull/55421/head:
	mds: batch backtrace updates by pool-id when expiring a log segment

Reviewed-by: Xiubo Li <xiubli@redhat.com>
@github-actions github-actions bot added the tests label May 9, 2024
@vshankar
Copy link
Contributor Author

This change is ready for re-review. I change the implementation a bit since the old implementation was buggy - backtrace updated to old pools were not dispatched. So, now, journal.cc prepares a separate CInodeCommitOperations() instance of each old pool for an inode and uses the backtrace generated for the default data pool. Also, I added a check for STATE_DIRTYPOOL (as done in CInode::_store_backtrace) since layout and backtraces to old pools should not be updated under certain circumstances (parent changing, etc..).

Note: The idea of dispatching per-pool backtrace updates still remains the same.

@vshankar
Copy link
Contributor Author

vshankar commented Jun 5, 2024

@lxbsz fixed and updated.

@vshankar
Copy link
Contributor Author

vshankar commented Jun 6, 2024

jenkins test make check

@vshankar
Copy link
Contributor Author

vshankar commented Jun 6, 2024

jenkins test make check arm64

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base for this branch is weird, it's not on main?

otherwise lgtm

@vshankar
Copy link
Contributor Author

base for this branch is weird, it's not on main?

It should be. I fixed and refreshed the change.

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/66521.

@vshankar
Copy link
Contributor Author

vshankar commented Jul 4, 2024

@vshankar
Copy link
Contributor Author

vshankar commented Jul 8, 2024

This change seems to be a bit buggy and causing failures like https://pulpito.ceph.com/vshankar-2024-06-30_16:43:29-fs-wip-vshankar-testing-20240628.170835-debug-testing-default-smithi/7779994/

I take this back. The issue is in another change. See: #54725 (comment)

(back into test branch)

@vshankar
Copy link
Contributor Author

I see one failed test job with similar effect - mds going read-only. Deferring merge till its investigated.

/a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/)/7791798

@vshankar
Copy link
Contributor Author

I see one failed test job with similar effect - mds going read-only. Deferring merge till its investigated.

/a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/)/7791798

Some osd_op's are returning -2 (ENOENT) for commit operations sent by the MDS.

./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.690+0000 7f15b3c00640 20 osd.3 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=2 lpr=75 luod=0'0 lua=101'771 crt=102'772 lcod 101'771 mlcod 96'754 active mbc={}] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.693+0000 7f15b5000640 20 osd.3 pg_epoch: 102 pg[4.1bs1( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=1 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.697+0000 7f15b0000640 20 osd.3 pg_epoch: 102 pg[4.1bs1( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=1 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.9.log.gz:2024-07-09T19:51:18.692+0000 7fc2ac800640 20 osd.9 pg_epoch: 102 pg[4.1bs2( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=2 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.9.log.gz:2024-07-09T19:51:18.697+0000 7fc2a7800640 20 osd.9 pg_epoch: 102 pg[4.1bs2( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=2 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.690+0000 7f02deb59640 20 osd.5 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=0 lpr=75 luod=101'771 lua=101'771 crt=102'772 lcod 96'770 mlcod 96'770 active+clean] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.694+0000 7f02dfb5b640 20 osd.5 pg_epoch: 102 pg[4.1bs3( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=3 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.698+0000 7f02dbb53640 20 osd.5 pg_epoch: 102 pg[4.1bs3( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=3 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.4.log.gz:2024-07-09T19:51:18.691+0000 7f236c2c9640 20 osd.4 pg_epoch: 102 pg[4.1bs0( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=0 lpr=84 luod=101'2885 lua=101'2885 crt=101'2884 lcod 101'2884 mlcod 101'2884 active+clean] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.4.log.gz:2024-07-09T19:51:18.698+0000 7f236c2c9640 20 osd.4 pg_epoch: 102 pg[4.1bs0( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=0 lpr=84 crt=102'2886 lcod 101'2885 mlcod 101'2885 active+clean] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.10.log.gz:2024-07-09T19:51:18.690+0000 7fd231dac640 20 osd.10 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=1 lpr=75 luod=0'0 lua=101'771 crt=102'772 lcod 101'771 mlcod 96'754 active mbc={}] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false

@vshankar
Copy link
Contributor Author

vshankar commented Aug 5, 2024

I see one failed test job with similar effect - mds going read-only. Deferring merge till its investigated.
/a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/)/7791798

Some osd_op's are returning -2 (ENOENT) for commit operations sent by the MDS.

./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.690+0000 7f15b3c00640 20 osd.3 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=2 lpr=75 luod=0'0 lua=101'771 crt=102'772 lcod 101'771 mlcod 96'754 active mbc={}] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.693+0000 7f15b5000640 20 osd.3 pg_epoch: 102 pg[4.1bs1( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=1 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.3.log.gz:2024-07-09T19:51:18.697+0000 7f15b0000640 20 osd.3 pg_epoch: 102 pg[4.1bs1( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=1 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.9.log.gz:2024-07-09T19:51:18.692+0000 7fc2ac800640 20 osd.9 pg_epoch: 102 pg[4.1bs2( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=2 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi135/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.9.log.gz:2024-07-09T19:51:18.697+0000 7fc2a7800640 20 osd.9 pg_epoch: 102 pg[4.1bs2( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=2 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.690+0000 7f02deb59640 20 osd.5 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=0 lpr=75 luod=101'771 lua=101'771 crt=102'772 lcod 96'770 mlcod 96'770 active+clean] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.694+0000 7f02dfb5b640 20 osd.5 pg_epoch: 102 pg[4.1bs3( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=3 lpr=84 luod=0'0 lua=101'2885 crt=101'2884 lcod 101'2885 mlcod 96'2882 active mbc={}] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi163/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.5.log.gz:2024-07-09T19:51:18.698+0000 7f02dbb53640 20 osd.5 pg_epoch: 102 pg[4.1bs3( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=3 lpr=84 luod=0'0 crt=102'2886 mlcod 101'2884 active mbc={}] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.4.log.gz:2024-07-09T19:51:18.691+0000 7f236c2c9640 20 osd.4 pg_epoch: 102 pg[4.1bs0( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=0 lpr=84 luod=101'2885 lua=101'2885 crt=101'2884 lcod 101'2884 mlcod 101'2884 active+clean] rollforward: entry=101'2884 (0'0) error    4:dbab607a:dirns::10000005908.00000019:head by mds.0.14:129838 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.4.log.gz:2024-07-09T19:51:18.698+0000 7f236c2c9640 20 osd.4 pg_epoch: 102 pg[4.1bs0( v 102'2886 (0'0,102'2886] local-lis/les=84/85 n=1323 ec=84/84 lis/c=84/84 les/c/f=85/85/0 sis=84) [4,3,9,5]p4(0) r=0 lpr=84 crt=102'2886 lcod 101'2885 mlcod 101'2885 active+clean] rollforward: entry=101'2885 (0'0) error    4:dae0eac8:dirns::10000005908.00000020:head by mds.0.14:129845 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
./remote/smithi160/log/ec2cb1ec-3e28-11ef-bcac-c7b262605968/ceph-osd.10.log.gz:2024-07-09T19:51:18.690+0000 7fd231dac640 20 osd.10 pg_epoch: 102 pg[2.d( v 102'772 (0'0,102'772] local-lis/les=75/76 n=93 ec=75/75 lis/c=75/75 les/c/f=76/76/0 sis=75) [5,10,3] r=1 lpr=75 luod=0'0 lua=101'771 crt=102'772 lcod 101'771 mlcod 96'754 active mbc={}] rollforward: entry=101'771 (0'0) error    2:b5eee495:::100000058fa.00000000:head by mds.0.14:129792 0.000000 -2 [r=-2+0b] ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false

This looks like a new bug and unrelated to this change. I don't see any backtrace related error during segment expiry.

➜  7791798 pwd
/a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/7791798
➜  7791798 find . -name "ceph-mds*" | xargs zgrep "store backtrace error"

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
LogSegment::try_to_expire() batches backtrace updations for inodes in
dirty_parent_inodes list. If a backtrace update operations fails for
one inode due to missing (removed) data pool, which is specially
handled by treating the operation as a success, however, the errno
(-ENOENT) is stored by the gather context and passed on as the return
value to subsequent operations (even for successful backtrace update
operations in the same gather context).

Fixes: http://tracker.ceph.com/issues/63259
Signed-off-by: Venky Shankar <vshankar@redhat.com>
…a pool

Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/67711.

@vshankar
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants