Skip to content

mds: drop locks and retry when lock set changes#52522

Merged
vshankar merged 1 commit intoceph:mainfrom
batrick:i62052
Aug 31, 2023
Merged

mds: drop locks and retry when lock set changes#52522
vshankar merged 1 commit intoceph:mainfrom
batrick:i62052

Conversation

@batrick
Copy link
Member

@batrick batrick commented Jul 18, 2023

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick force-pushed the i62052 branch 2 times, most recently from 5d7949c to 6df48d8 Compare July 19, 2023 16:06
An optimization was added to avoid an unnecessary gather on the inode
filelock when the client can safely get the file size without also
getting issued the requested caps. However, if a retry of getattr
is necessary, this conditional inclusion of the inode filelock
can cause lock-order violations resulting in deadlock.

So, if we've already acquired some of the inode's locks then we must
drop locks and retry.

Fixes: https://tracker.ceph.com/issues/62052
Fixes: c822b3e
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Jul 19, 2023

Pulled debugging commits out into: #52547

@batrick batrick requested a review from a team July 19, 2023 19:01
@batrick
Copy link
Member Author

batrick commented Aug 3, 2023

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented Aug 3, 2023

jenkins test windows

@vshankar
Copy link
Contributor

vshankar commented Aug 9, 2023

jenkins test make check

@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

Test runs in ~2h: https://shaman.ceph.com/builds/ceph/wip-vshankar-testing-20230814.052859/

Build did not go well.

@vshankar
Copy link
Contributor

@batrick
Copy link
Member Author

batrick commented Aug 16, 2023

jenkins test make check arm64

@vshankar
Copy link
Contributor

@batrick
Copy link
Member Author

batrick commented Aug 17, 2023

https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/

The run was messy - lot many failures due to POOL_APP_NOT_ENABLED warnings (26/52 failed jobs). Some fs related test failure which needs to be looked at are

#53037

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369749 ("test_filesystem_sync_stuck_for_around_5s")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369756 ("test_volume_rm")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369694 ("test_snapshot_remove") -- this should have been fixed by [mds: remove calculating caps after adding revokes back #52176](https://github.com/ceph/ceph/pull/52176), but it reappears.

Are you asking me to look at those more closely or ... ?

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/

The run was messy - lot many failures due to POOL_APP_NOT_ENABLED warnings (26/52 failed jobs). Some fs related test failure which needs to be looked at are

#53037

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369749 ("test_filesystem_sync_stuck_for_around_5s")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369756 ("test_volume_rm")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369694 ("test_snapshot_remove") -- this should have been fixed by [mds: remove calculating caps after adding revokes back #52176](https://github.com/ceph/ceph/pull/52176), but it reappears.

Are you asking me to look at those more closely or ... ?

I'll have a look maybe tomorrow or next week, so I just updated here with the details so that I don't loose them. Feel free to peek and debug if time permits.

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/

The run was messy - lot many failures due to POOL_APP_NOT_ENABLED warnings (26/52 failed jobs). Some fs related test failure which needs to be looked at are

#53037

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369749 ("test_filesystem_sync_stuck_for_around_5s")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369756 ("test_volume_rm")

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369694 ("test_snapshot_remove") -- this should have been fixed by [mds: remove calculating caps after adding revokes back #52176](https://github.com/ceph/ceph/pull/52176), but it reappears.

Are you asking me to look at those more closely or ... ?

I'll have a look maybe tomorrow or next week, so I just updated here with the details so that I don't loose them. Feel free to peek and debug if time permits.

@batrick - I couldn't get to reviewing this last Friday. Would you be able to look at the failures (which seem a bit high as compared to other fs suite runs and this is the only PR in the branch)?

@batrick
Copy link
Member Author

batrick commented Aug 29, 2023

I will try to have a look tomorrow @vshankar

@batrick
Copy link
Member Author

batrick commented Aug 30, 2023

https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/

The run was messy - lot many failures due to POOL_APP_NOT_ENABLED warnings (26/52 failed jobs). Some fs related test failure which needs to be looked at are

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369749 ("test_filesystem_sync_stuck_for_around_5s")

Infrastructure noise I think.

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369756 ("test_volume_rm")

https://tracker.ceph.com/issues/62648

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369694 ("test_snapshot_remove") -- this should have been fixed by [mds: remove calculating caps after adding revokes back #52176](https://github.com/ceph/ceph/pull/52176), but it reappears.

https://tracker.ceph.com/issues/62580

I think it's a testing kernel bug.

Do you think this is good to merge otherwise?

batrick added a commit to batrick/ceph that referenced this pull request Aug 30, 2023
* refs/pull/52522/head:
	mds: drop locks and retry when lock set changes
@vshankar
Copy link
Contributor

https://tracker.ceph.com/issues/62580

I think it's a testing kernel bug.

ACK. Spoke to @lxbsz too regarding this.

@lxbsz
Copy link
Member

lxbsz commented Aug 31, 2023

https://tracker.ceph.com/issues/62580
I think it's a testing kernel bug.

ACK. Spoke to @lxbsz too regarding this.

Venky, it's a known issue and I have fixed it but not backport to Pacific yet.

@lxbsz
Copy link
Member

lxbsz commented Aug 31, 2023

jenkins test make check arm64

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/

The run was messy - lot many failures due to POOL_APP_NOT_ENABLED warnings (26/52 failed jobs). Some fs related test failure which needs to be looked at are

* https://pulpito.ceph.com/vshankar-2023-08-16_10:55:44-fs-wip-vshankar-testing-20230816.054905-testing-default-smithi/7369749 ("test_filesystem_sync_stuck_for_around_5s")

Infrastructure noise I think.

yeh - ConnectionError: Failed to reconnect to smithi193

@lxbsz
Copy link
Member

lxbsz commented Aug 31, 2023

https://tracker.ceph.com/issues/62580
I think it's a testing kernel bug.

ACK. Spoke to @lxbsz too regarding this.

Venky, it's a known issue and I have fixed it but not backport to Pacific yet.

Sorry, It's just a similar issue, not the exactly the same one, but I don't think it's related to this PR. I will update the PR later for my finding.

@lxbsz
Copy link
Member

lxbsz commented Aug 31, 2023

https://tracker.ceph.com/issues/62580
I think it's a testing kernel bug.

ACK. Spoke to @lxbsz too regarding this.

Venky, it's a known issue and I have fixed it but not backport to Pacific yet.

Sorry, It's just a similar issue, not the exactly the same one, but I don't think it's related to this PR. I will update the PR later for my finding.

Fixed it in MDS side #53238.

@vshankar
Copy link
Contributor

Merging this in the interest of time - will prepare run wiki asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants