Revert "quincy: ceph_fs.h: add separate owner_{u,g}id fields"#54108
Revert "quincy: ceph_fs.h: add separate owner_{u,g}id fields"#54108
Conversation
df24f7a to
cca2217
Compare
This seems to be causing test failures:
https://pulpito.ceph.com/yuriw-2023-10-19_16:09:31-smoke-quincy-release-distro-default-smithi/7432399
Where the MDS crashes with
In function 'CInode* Server::prepare_new_inode(MDRequestRef&, CDir*, inodeno_t, unsigned int, const file_layout_t*)'
3449: FAILED ceph_assert(_inode->gid != (unsigned)-1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f991631c89e]
2: /usr/lib/ceph/libceph-common.so.2(+0x27eab0) [0x7f991631cab0]
3: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t const*)+0x2791) [0x55563bec4831]
4: (Server::handle_client_mkdir(boost::intrusive_ptr<MDRequestImpl>&)+0x1ec) [0x55563bede88c]
Revert this change till the failure is fixed.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
cca2217 to
71a526c
Compare
|
Test runs in ~3hrs. |
ugh. So, this failure is hardly reproducible and can be reproduced only after many iterations, right? This assertion was added in my PR. Before that we had no analogical assertion in this place. I will take a look closely on the test code and try to understand what happens here. |
|
jenkins test make check |
Not really. We saw this smoke tests and not in fs suite. So, there is something that smoke test that's not in fs suite tests - I'm going to audit that. We probably didn't see this in the main branch run since smoke tests are not run :/
ACK. Thx! |
|
Ok, I have analyzed the ceph code and test code too. At this point I haven't found anything suspicious that can make But, there is one serious thing about this Probably we need to revert #53138 for now too. Speaking about smoke test failure, I wasn't able to understand how As it follows from test logs we have |
Correct.
See my comment below - this failure isn't seen in main branch smoke test.
We have a core. With some effort we can setup an env and gdb the core to examine frames.
@mihalicyn I haven't looked at the failures closely but the very recent smoke run on main branch doesn't have this failure - https://pulpito.ceph.com/?branch=wip-vshankar-testing-20231018.065603 - its possible that its something in the quincy that tripping it off... Just FYI - I'm on vacation next week, so won't be able to look much into this. |
Ok, then I have to send a new PR to fix that and replace So we can replace Do I need to create a new issue in the ceph tracker, or I can send PR and reuse an old https://tracker.ceph.com/issues/62217 ?
Great! This would be awesome to analyze core.
It's very very interesting information...
Yep it's possible. Buffer overflow somewhere or something else. Ugh. :( By the way, how much smoke tests runs we had for quincy branch? Only one? Or more?
Have a great vacation, Venky! I'll start from replacing the unsafe |
Having ceph_assert calls is unsafe in the Server::prepare_new_inode
when assertion depends on a data which is a client-provided.
A malicious client can easily crash MDS by purpose.
Let's replace two ceph_assert calls for owner_{u,g}id's with
log prints.
See also:
ceph#54108
Fixes: commit 46cb244 ("ceph_fs.h: add separate owner_{u,g}id fields")
Fixes: https://tracker.ceph.com/issues/62217
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
I meant to say that the assert will obviously crash the mds, but, I'm not sure if the assert was incorrectly placed or not - that would require to carefully look at the original PR. I can do that when I'm back. |
You could run this through smoke test in quincy and see what shows up - that would be interesting.
The automatic cron jobs that trigger smoke runs were disabled until it was run manually and its when we saw this. Those cron jobs are been enabled now, so, we can see if the crash is reproducible. |
|
Reviewed-by: Venky Shankar vshankar@redhat.com |
mchangir
left a comment
There was a problem hiding this comment.
Reviewed-by: Milind Changire mchangir@redhat.com
|
I think I found where the issue is. Issue is that two recent protocol extensions cbd7e30 and 46cb244 are not handling very old clients of ceph. Let me show that. We have (https://github.com/ceph/ceph/blob/c672d7aab0240cefa8a3decdffd498e0ade3d5be/src/messages/MClientRequest.h#L230C13-L230C13): сс @lxbsz I'm ready to prepare fixes for both The reproducer in the smoke test includes Linux kernel 5.4 which uses a very old cephfs protocol with |
|
Fixed in #54149 |
Reverts #53139