Conversation
9620f7b to
d323023
Compare
|
|
||
| case IMPORT_ACKING: | ||
| ceph_assert(dir); | ||
| // hrm. make this an ambiguous import, and wait for exporter recovery to disambiguate |
There was a problem hiding this comment.
this line says that the export will do the disambiguation. You mentioned "should reverse the process to clear the importing state if the exporter restarts at this time, but it doesn't do that actually". So an exporter once restarted, doesn't perform recovery and thus no disambiguation, is it?
There was a problem hiding this comment.
this line says that the export will do the disambiguation. You mentioned "should reverse the process to clear the importing state if the exporter restarts at this time, but it doesn't do that actually". So an exporter once restarted, doesn't perform recovery and thus no disambiguation, is it?
Yes, it is. import_reverse can reverse the process when the state is IMPORT_ACKING. But import_reverse has a bug and we need to fix it.
There was a problem hiding this comment.
If that is the case then I feel rather than changing the way we handle the cases, we should address the bug in import_reverse.
There was a problem hiding this comment.
If that is the case then I feel rather than changing the way we handle the cases, we should address the bug in
import_reverse.
The bug in import_reverse is that it contains the code to handle state IMPORT_ACKING but it will never be executed because the state is modified to IMPORT_ABORTING at the beginning. Move stat.state = IMPORT_ABORTING to the end of import_reverse so that it can handle the state IMPORT_ACKING. And then we call import_reverse when the state is IMPORT_ACKING and the process needs to be reversed.
There was a problem hiding this comment.
I do not believe this change is correct. The comment is correct. It depends if EExport is logged by the exporter after failover to mark the importer as auth. If that event is not logged, then the exporter retains auth. This is cleaned up in:
Lines 3270 to 3272 in d146d2b
Note that the import may be finished otherwise! So yes, the subtree needs to be marked amibguous here.
Your other change to fix import_reverse is correct though!
There was a problem hiding this comment.
I do not believe this change is correct. The comment is correct. It depends if
EExportis logged by the exporter after failover to mark the importer as auth. If that event is not logged, then the exporter retains auth. This is cleaned up in:Lines 3270 to 3272 in d146d2b
Note that the import may be finished otherwise! So yes, the subtree needs to be marked amibguous here.
Your other change to fix
import_reverseis correct though!
Yes, you are right. I'm sorry I didn't read this part of the code carefully before. And we only need to modify import_reverse to fix this bug, so I have removed this part of the change.
batrick
left a comment
There was a problem hiding this comment.
Few recommended changes:
- Test case should be a parent commit (i.e. before) the fix. This makes testing the fix before and after easier.
Add importing_count for session dump.should bemds: add importing_count to session dump. This can go before theqa:commit so the test works.
qa/tasks/cephfs/test_exports.py
Outdated
| def test_clear_session(self): | ||
| source = 0 | ||
| target = 1 | ||
| kill = 9 |
There was a problem hiding this comment.
Importantly to confirm this fix: also test kill = 11.
There was a problem hiding this comment.
comment this magic number
I used more human readable variable names.
There was a problem hiding this comment.
I think patrick meant to test both 9 and 11? I see its now changed to 11
There was a problem hiding this comment.
Importantly to confirm this fix: also test
kill = 11.
I add a new test case for it, but it cannot trigger this bug.
There was a problem hiding this comment.
Line 1928 in 227ea20
The event will not be persisted yet by the time this assert is hit (the assert really should be before the journal entry is submitted, otherwise it's racy). Since EExport will not be logged by the exporter, the importer should be forced to reverse the import during recovery.
There was a problem hiding this comment.
Line 1928 in 227ea20
The event will not be persisted yet by the time this assert is hit (the assert really should be before the journal entry is submitted, otherwise it's racy). Since
EExportwill not be logged by the exporter, the importer should be forced to reverse the import during recovery.
I thought so too until I tested it in my test environment. I found the replay log of EExport in the exporter's log after the test was done, which means that EExport had been logged. I had tested it multiple times and got the same result every time. I added sleep code in MDLog::_submit_thread and tested it again, the replay log of EExport would no longer be found in the exporter's log.
There was a problem hiding this comment.
Right, that's what I meant by it being racy. The assert should really be before the journal entry submit.
There was a problem hiding this comment.
(You should fix that in this PR!)
There was a problem hiding this comment.
(You should fix that in this PR!)
OK, I've added a new commit to fix it.
|
|
||
| case IMPORT_ACKING: | ||
| ceph_assert(dir); | ||
| // hrm. make this an ambiguous import, and wait for exporter recovery to disambiguate |
There was a problem hiding this comment.
I do not believe this change is correct. The comment is correct. It depends if EExport is logged by the exporter after failover to mark the importer as auth. If that event is not logged, then the exporter retains auth. This is cleaned up in:
Lines 3270 to 3272 in d146d2b
Note that the import may be finished otherwise! So yes, the subtree needs to be marked amibguous here.
Your other change to fix import_reverse is correct though!
d323023 to
27008e3
Compare
|
Your third commit is empty, the code to fix |
OK, I've modified them. |
c396967 to
aa84a75
Compare
I have fixed them, thanks! |
|
Minor nit: the actual fix commit is after the testcase, mind shuffling it? |
I adjusted the order because #51766 (review), or am I having a wrong understanding of this? |
67c67f7 to
84f15a1
Compare
qa/tasks/cephfs/test_exports.py
Outdated
| self.fs.rank_asok(['export', 'dir', path, str(target)], rank=source, status=self.status) | ||
|
|
||
| def _wait_laggy(self, rank): | ||
| self.wait_until_true(lambda: "laggy_since" in self.fs.get_rank(rank=rank), timeout=self.fs.beacon_timeout) |
There was a problem hiding this comment.
| self.wait_until_true(lambda: "laggy_since" in self.fs.get_rank(rank=rank), timeout=self.fs.beacon_timeout) | |
| self.wait_until_true(lambda: self.fs.status().hadfailover(self.status), timeout=self.fs.beacon_timeout) |
There was a problem hiding this comment.
If the rank is replaced then it won't be laggy.
There was a problem hiding this comment.
Yes, it's a better way and I will use it.
qa/tasks/cephfs/test_exports.py
Outdated
| client_id = self.mount_a.get_global_id() | ||
| self.mount_a.remount() | ||
|
|
||
| # timeout if buggy |
There was a problem hiding this comment.
| # timeout if buggy | |
| # timeout if buggy; s/evicted/removed/ |
There was a problem hiding this comment.
the session is not actually evicted, just removed from the importer's session list.
qa/tasks/cephfs/test_exports.py
Outdated
| self.mount_a.remount() | ||
|
|
||
| # timeout if buggy | ||
| self.wait_until_evicted(client_id, importer_rank) |
There was a problem hiding this comment.
In my own tests, I get:
2023-06-20 15:41:32,451.451 INFO:__main__:======================================================================
2023-06-20 15:41:32,451.451 INFO:__main__:ERROR: test_clear_session (tasks.cephfs.test_exports.TestKillExports)
2023-06-20 15:41:32,451.451 INFO:__main__:----------------------------------------------------------------------
2023-06-20 15:41:32,451.451 INFO:__main__:Traceback (most recent call last):
2023-06-20 15:41:32,452.452 INFO:__main__: File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_exports.py", line 643, in test_clear_session
2023-06-20 15:41:32,452.452 INFO:__main__: self.wait_until_evicted(client_id, importer_rank)
2023-06-20 15:41:32,452.452 INFO:__main__: File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/cephfs_test_case.py", line 270, in wait_until_evicted
2023-06-20 15:41:32,452.452 INFO:__main__: self.wait_until_true(is_client_evicted, timeout)
2023-06-20 15:41:32,452.452 INFO:__main__: File "/home/pdonnell/scm/ceph/qa/tasks/ceph_test_case.py", line 223, in wait_until_true
2023-06-20 15:41:32,452.452 INFO:__main__: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-06-20 15:41:32,452.452 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
Importer MDS shows:
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.mig import_reverse [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=1,0 state=1610876931|complete|frozentree|importing f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.cache adjust_subtree_auth 1,0 -> 0,-2 on [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=1,0 state=1610874883|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache show_subtrees
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |____ 1 auth [dir 0x101 ~mds1/ [2,head] auth v=1 cv=1/1 dir_auth=1 state=1073741825|complete f(v0 10=0+10) n(v0 rc2023-06-20T15:40:08.896877-0400 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x55a9a5526d80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |_.__ 0 rep [dir 0x1 / [2,head] rep@0.1 REP dir_auth=0 state=0 f(v0 m2023-06-20T15:40:12.645693-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 2=0+2)/n(v0 rc2023-06-20T15:40:12.645693-0400 1=0+1) hs=1+0,ss=0+0 | child=1 subtree=1 0x55a9aa3b1200]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |__ 1, 0 auth [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=1,0 state=1610874883|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.cache current root is [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=1,0 state=1610874883|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.dir(0x10000000000) setting dir_auth=0,-2 from 1,0 on [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=1,0 state=1610874883|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache show_subtrees
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |____ 1 auth [dir 0x101 ~mds1/ [2,head] auth v=1 cv=1/1 dir_auth=1 state=1073741825|complete f(v0 10=0+10) n(v0 rc2023-06-20T15:40:08.896877-0400 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x55a9a5526d80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |_.__ 0 rep [dir 0x1 / [2,head] rep@0.1 REP dir_auth=0 state=0 f(v0 m2023-06-20T15:40:12.645693-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 2=0+2)/n(v0 rc2023-06-20T15:40:12.645693-0400 1=0+1) hs=1+0,ss=0+0 | child=1 subtree=1 0x55a9aa3b1200]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |__ 0 auth [dir 0x10000000000 /test/ [2,head] auth{0=1} v=3 cv=0/0 dir_auth=0 state=1610874883|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.dir(0x10000000000) mark_clean [dir 0x10000000000 /test/ [2,head] rep@0.1 dir_auth=0 state=537133059|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=0 dirty=1 0x55a9aa3b1b00] version 3
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.den(0x10000000000 export) mark_clean [dentry #0x1/test/export [2,head] rep@0.1 (dversion lock) v=2 ino=0x10000000001 state=536870912 | inodepin=1 replicated=0 dirty=1 0x55a9aa386500]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.ino(0x10000000001) mark_clean [inode 0x10000000001 [...2,head] /test/export/ rep@0.1 v2 DIRTYPARENT f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) caps={5264=-/-@0},l=5264 | importingcaps=1 dirfrag=1 caps=1 dirtyparent=1 replicated=0 dirty=1 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.ino(0x10000000001) clear_dirty_parent
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.dir(0x10000000001) mark_clean [dir 0x10000000001 /test/export/ [2,head] rep@0.1 state=536870913|complete f() n() hs=0+0,ss=0+0 | replicated=0 dirty=1 0x55a9aa3b2400] version 1
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache.ino(0x10000000001) remove_client_cap last cap, leaving realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2023-06-20T15:39:57.664552-0400 change_attr 0 0x55a9aa281440)
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.openfiles remove_dirfrag [dir 0x10000000000 /test/ [2,head] rep@0.1 dir_auth=0 state=262147|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=0 dirty=0 waiter=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache create_subtree_map 3 subtrees, 1 fullauth
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache show_subtrees
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |____ 1 auth [dir 0x101 ~mds1/ [2,head] auth v=1 cv=1/1 dir_auth=1 state=1073741825|complete f(v0 10=0+10) n(v0 rc2023-06-20T15:40:08.896877-0400 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x55a9a5526d80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |_.__ 0 rep [dir 0x1 / [2,head] rep@0.1 REP dir_auth=0 state=0 f(v0 m2023-06-20T15:40:12.645693-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 2=0+2)/n(v0 rc2023-06-20T15:40:12.645693-0400 1=0+1) hs=1+0,ss=0+0 | child=1 subtree=1 0x55a9aa3b1200]
2023-06-20T15:40:34.962-0400 7fbcf4389700 10 mds.1.cache |__ 0 rep [dir 0x10000000000 /test/ [2,head] rep@0.1 dir_auth=0 state=3|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=0 dirty=0 waiter=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache auth subtree [dir 0x101 ~mds1/ [2,head] auth v=1 cv=1/1 dir_auth=1 state=1073741825|complete f(v0 10=0+10) n(v0 rc2023-06-20T15:40:08.896877-0400 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x55a9a5526d80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 20 mds.1.journal EMetaBlob::add_dir_context final:
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache subtrees {0x101=[]}
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache ambiguous_subtrees
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.cache trim bytes_used=58kB limit=4GB reservation=0.05% count=1
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.cache trim_lru trimming 1 items from LRU size=12 mid=0 pintail=0 pinned=11
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache trim_dentry [dentry #0x1/test/export [2,head] rep@0.1 (dversion lock) v=2 ino=0x10000000001 state=0 0x55a9aa386500]
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache in container [dir 0x10000000000 /test/ [2,head] rep@0.1 dir_auth=0 state=3|complete|frozentree f(v0 m2023-06-20T15:40:12.691087-0400 1=0+1) n(v0 rc2023-06-20T15:40:12.691087-0400 1=0+1) hs=1+0,ss=0+0 | ptrwaiter=0 child=1 frozen=1 subtree=1 importing=0 replicated=0 dirty=0 waiter=1 0x55a9aa3b1b00]
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache trim_inode [inode 0x10000000001 [...2,head] /test/export/ rep@0.1 v2 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 7 mds.1.locker rdlock_try on (idft sync) on [inode 0x10000000001 [...2,head] /test/export/ rep@0.1 v2 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 15 mds.1.cache trim_dirfrag [dir 0x10000000001 /test/export/ [2,head] rep@0.1 state=1|complete f() n() hs=0+0,ss=0+0 0x55a9aa3b2400]
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache sending expire to mds.0 on [dir 0x10000000001 /test/export/ [2,head] rep@0.1 state=1|complete f() n() hs=0+0,ss=0+0 0x55a9aa3b2400]
2023-06-20T15:40:34.962-0400 7fbcf4389700 14 mds.1.cache.ino(0x10000000001) close_dirfrag *
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache.dir(0x10000000001) remove_null_dentries [dir 0x10000000001 /test/export/ [2,head] rep@0.1 state=1|complete f() n() hs=0+0,ss=0+0 0x55a9aa3b2400]
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache sending expire to mds.0 on [inode 0x10000000001 [...2,head] /test/export/ rep@0.1 v2 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 12 mds.1.cache.dir(0x10000000000) unlink_inode [dentry #0x1/test/export [2,head] rep@0.1 (dversion lock) v=2 ino=0x10000000001 state=0 0x55a9aa386500] [inode 0x10000000001 [...2,head] /test/export/ rep@0.1 v2 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcf4389700 14 mds.1.cache remove_inode [inode 0x10000000001 [...2,head] #10000000001/ rep@-2.1 v2 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) 0x55a9aa3ebb80]
2023-06-20T15:40:34.962-0400 7fbcedb7c700 5 mds.1.log _submit_thread 4199216~41 : EImportFinish 0x10000000000 failed
So your patch worked but the session is not removed. Why did this work for you?
There was a problem hiding this comment.
I only tested with the kernel client but forgot about the ceph-fuse client, and the failure was caused by another bug.
The importer will force open a session provided by the exporter but the client does not know about the new session until the exporter notifies it, and the notification cannot be sent if the exporter is interrupted. Since the client does not know about the new session in the importer, it does not send a message to the importer to close the session during unmounting. The client will be evicted by the importer after session_autoclose seconds (300 seconds by default) even if not unmounted.
I think the sessions forced open by the importer should be closed in import_reverse, and I'm trying to fix the bug in this way.
84f15a1 to
b50d16b
Compare
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
|
@zhsgao sorry this fell by the wayside. Are you willing to reopen? |
Yes, I'm willing to reopen it. |
done! |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
b50d16b to
3a02bb0
Compare
|
rebase |
3a02bb0 to
6ed6174
Compare
Appreciate you working on this PR @zhsgao. which unfortunately didn't get attention after a round of review from @batrick and @dparmar18. Will be revisiting this soon. |
qa/tasks/cephfs/test_exports.py
Outdated
| def tearDown(self): | ||
| super().tearDown() | ||
|
|
||
| def _kill_export(self, rank, kill): |
There was a problem hiding this comment.
nit: suggest naming this as _kill_export_as (_kill_export could be a bit misleading).
There was a problem hiding this comment.
nit: suggest naming this as
_kill_export_as(_kill_exportcould be a bit misleading).
I have modified it, thanks!
6ed6174 to
d578a1d
Compare
|
This PR is under test in https://tracker.ceph.com/issues/68744. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@zhsgao This is ready for merging. Can you rebase the change please? |
Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
…t's racy Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
…ask is interrupted Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
…btree task is interrupted while the state of importer is acking The related sessions in the importer are in the importing state(`Session::is_importing` return true) when the state of importer is `acking`, `Migrator::import_reverse` called by `MDCache::handle_resolve` should reverse the process to clear the importing state if the exporter restarts at this time, but it doesn't do that actually because of its bug. And it will cause these sessions to not be cleared when the client is unmounted(evicted or timeout) until the mds is restarted. The bug in `import_reverse` is that it contains the code to handle state `IMPORT_ACKING` but it will never be executed because the state is modified to `IMPORT_ABORTING` at the beginning. Move `stat.state = IMPORT_ABORTING` to the end of import_reverse so that it can handle the state `IMPORT_ACKING`. Fixes: https://tracker.ceph.com/issues/61459 Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
…t subtree task is interrupted Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
The importer will force open some sessions provided by the exporter but the client does not know about the new sessions until the exporter notifies it, and the notifications cannot be sent if the exporter is interrupted. The client does not renew the sessions regularly that it does not know about, so the client will be evicted by the importer after `session_autoclose` seconds (300 seconds by default). The sessions that are forced opened in the importer need to be closed when the import process is reversed. Signed-off-by: Zhansong Gao <zhsgao@hotmail.com>
d578a1d to
00b0711
Compare
OK, it's done. |
The related sessions in the importer are in the importing state(
Session::is_importingreturn true) when the state of importer isacking,Migrator::import_reversecalled byMDCache::handle_resolveshould reverse the process to clear the importing state if the exporter restarts at this time, but it doesn't do that actually because of its bug. And it will cause these sessions to not be cleared when the client is unmounted(evicted or timeout) until the mds is restarted.Fixes: https://tracker.ceph.com/issues/61459
Signed-off-by: Zhansong Gao zhsgao@hotmail.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows