Project

General

Profile

Actions

Bug #66581

open

client: if write call is waiting for caps it can starve subsequent read call(s)

Added by Dhairya Parmar almost 2 years ago. Updated 9 months ago.

Status:
Triaged
Priority:
High
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, ceph-fuse
Labels (FS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:

$ while date | tee -a ./mnt/foo ; do sleep 1 ; done

terminal 1:

$ ./bin/ceph tell mds.0 quiesce db / --timeout 5 --expiration 50
{
    "epoch": 5,
    "leader": 4249,
    "set_version": 13,
    "sets": {
        "ae2d6bc1": {
            "version": 13,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 0.0
            },
            "timeout": 5.0,
            "expiration": 50.0,
            "members": {
                "file:/": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}
$ cat ./mnt/foo  # -----> stuck

if we check the client log, it is waiting for the caps:

client.admin.1377984.log:87629:2024-06-20T20:02:11.447+0530 7f8f62a006c0 10 client.4289 waiting for caps 0x10000000000.head(faked_ino=0 nref=12 ll_ref=727 cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=0,2=1} mode=100644 size=26030/4194304 nlink=1 btime=2024-06-20T19:39:29.065482+0530 mtime=2024-06-20T20:02:10.417351+0530 ctime=2024-06-20T20:02:10.417351+0530 change_attr=690 caps=pAsLsXsFscrb(0=pAsLsXsFscrb) objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 0] parents=0x1.head["foo"] 0x7f8f34009aa0) need AsFw want Fb

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1] so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released. Technically this shouldn't happen since the intention when writing this feature was always to allow the reads to go through. Downstream QE team run these kind of tests daily, and they are running fine which asserts that kclient is good therefore it is only the fuse client that has this "sleep while locked" happening. We need to bail out client from this to perform the reads.

Therefore this has nothing to do with quiesce, quiesce is just a very easy way of creating this scenario.

[0] https://github.com/ceph/ceph/blob/ef2084806c6ba05662cfdabf93a8ac4b505fdbf9/src/client/Client.cc#L11449
[1] https://github.com/ceph/ceph/blob/ef2084806c6ba05662cfdabf93a8ac4b505fdbf9/src/client/Client.cc#L11535


Files

client.log.tar.gz (66.5 KB) client.log.tar.gz Dhairya Parmar, 08/29/2024 01:04 PM
active_mds.log.tar.gz (240 KB) active_mds.log.tar.gz Dhairya Parmar, 08/29/2024 01:05 PM
Actions #1

Updated by Venky Shankar over 1 year ago

  • Target version set to v20.0.0
  • Backport set to squid
  • Component(FS) ceph-fuse added
Actions #2

Updated by Venky Shankar over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Dhairya Parmar
  • Priority changed from Normal to High
Actions #3

Updated by Patrick Donnelly over 1 year ago

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1] so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

write does not , nor does any typical operation, sleep with the client_lock.

Technically this shouldn't happen since the intention when writing this feature was always to allow the reads to go through. Downstream QE team run these kind of tests daily, and they are running fine which asserts that kclient is good therefore it is only the fuse client that has this "sleep while locked" happening. We need to bail out client from this to perform the reads.

A read is not guaranteed to immediately process even with the recent enhancements to the quiesce protocol. Consider:

client A: has pFrwbc (and is doing buffered writes)
client B: has p
mds quiesces
client A: now has pFrbc
client B: getattr pFrc -> blocks on flush of client A's Fb

Without forcing client A to flush its buffers as part of the quiesce protocol (which would be expensive in the general case), we cannot permit client B to read the file.

For a single client case, this should not be an issue unless a write requesting Frwbc caps is somehow blocking a subsequent read on the same client. That's probably the issue you're encountering.

Actions #4

Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

I don't think this is what's causing the read call to block. Can you attach client debug logs please? @Dhairya Parmar

Actions #5

Updated by Dhairya Parmar over 1 year ago

Patrick Donnelly wrote in #note-3:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1] so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

write does not , nor does any typical operation, sleep with the client_lock.

by "sleep" i meant the write acquired the lock, quiesce revoked the write cap and so the client was waiting for the caps at `get_cap_ref` here https://github.com/ceph/ceph/blob/ef2084806c6ba05662cfdabf93a8ac4b505fdbf9/src/client/Client.cc#L11535 and therefore the subsequent read stalled.

Technically this shouldn't happen since the intention when writing this feature was always to allow the reads to go through. Downstream QE team run these kind of tests daily, and they are running fine which asserts that kclient is good therefore it is only the fuse client that has this "sleep while locked" happening. We need to bail out client from this to perform the reads.

A read is not guaranteed to immediately process even with the recent enhancements to the quiesce protocol. Consider:

oh, from leonid's syncup he was pretty sure that this is an issue i.e. he never mentioned the reads are never guaranteed 0_o

client A: has pFrwbc (and is doing buffered writes)
client B: has p
mds quiesces
client A: now has pFrbc
client B: getattr pFrc -> blocks on flush of client A's Fb

Without forcing client A to flush its buffers as part of the quiesce protocol (which would be expensive in the general case), we cannot permit client B to read the file.

For a single client case, this should not be an issue unless a write requesting Frwbc caps is somehow blocking a subsequent read on the same client. That's probably the issue you're encountering.

right, it's a single client case.

Actions #6

Updated by Dhairya Parmar over 1 year ago

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

void Client::wait_on_context_list(list<Context*>& ls)
{
  ceph::condition_variable cond;
  bool done = false;
  int r;
  ls.push_back(new C_Cond(cond, &done, &r));
  std::unique_lock l{client_lock, std::adopt_lock};
  cond.wait(l, [&done] { return done;});
  l.release();
}

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

I don't think this is what's causing the read call to block. Can you attach client debug logs please? @Dhairya Parmar

client log say this:

client.admin.1377984.log:87629:2024-06-20T20:02:11.447+0530 7f8f62a006c0 10 client.4289 waiting for caps 0x10000000000.head(faked_ino=0 nref=12 ll_ref=727 cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=0,2=1} mode=100644 size=26030/4194304 nlink=1 btime=2024-06-20T19:39:29.065482+0530 mtime=2024-06-20T20:02:10.417351+0530 ctime=2024-06-20T20:02:10.417351+0530 change_attr=690 caps=pAsLsXsFscrb(0=pAsLsXsFscrb) objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 0] parents=0x1.head["foo"] 0x7f8f34009aa0) need AsFw want Fb

Actions #7

Updated by Patrick Donnelly over 1 year ago

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

Right, because the caller expects the client_lock to be locked. The point of the unique_lock here is to hold ownership of the lock for the call to cond.wait. While waiting in cond.wait, the lock is unlocked and reacquired when the condvar is notified. Because we have the lock after the wait completes (and the predicate is true), we need to release ownership of the unique_lock (so the "ownership" passes back to the caller).

so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

I don't think this is what's causing the read call to block. Can you attach client debug logs please? @Dhairya Parmar

client log say this:
[...]

Actions #8

Updated by Dhairya Parmar over 1 year ago

Patrick Donnelly wrote in #note-7:

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

Right, because the caller expects the client_lock to be locked. The point of the unique_lock here is to hold ownership of the lock for the call to cond.wait. While waiting in cond.wait, the lock is unlocked and reacquired when the condvar is notified. Because we have the lock after the wait completes (and the predicate is true), we need to release ownership of the unique_lock (so the "ownership" passes back to the caller).

so that means write is not actually "sleeping" with the clint_lock but waiting for the caps at wait_on_context_list(in->waitfor_caps) which makes me think then why didn't the read op go through? why did it get stalled? read call too is checking for caps before making any read, does that mean that since the write is waiting for some caps, the read call is in the queue to get caps (because obviously it can't just skip the line it would haunt the stability and lead to potential data corruption?) and so until the write op is served with the caps, read op's get_caps is blocked?

Actions #9

Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote in #note-8:

Patrick Donnelly wrote in #note-7:

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

Right, because the caller expects the client_lock to be locked. The point of the unique_lock here is to hold ownership of the lock for the call to cond.wait. While waiting in cond.wait, the lock is unlocked and reacquired when the condvar is notified. Because we have the lock after the wait completes (and the predicate is true), we need to release ownership of the unique_lock (so the "ownership" passes back to the caller).

so that means write is not actually "sleeping" with the clint_lock but waiting for the caps at wait_on_context_list(in->waitfor_caps) which makes me think then why didn't the read op go through? why did it get stalled? read call too is checking for caps before making any read, does that mean that since the write is waiting for some caps, the read call is in the queue to get caps (because obviously it can't just skip the line it would haunt the stability and lead to potential data corruption?) and so until the write op is served with the caps, read op's get_caps is blocked?

If the client has the required "read" caps, the operation would have gone through. You need to check what caps the client has and if it requested a Frc cap from the MDS to satisfy the read.

Actions #10

Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

I don't think this is what's causing the read call to block. Can you attach client debug logs please? @Dhairya Parmar

client log say this:
[...]

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

so when trying to read the file, the read call sees that the client lock(which I think is a shared/global lock) is already acquired, it waits for it be released.

I don't think this is what's causing the read call to block. Can you attach client debug logs please? @Dhairya Parmar

client log say this:
[...]

Just by looking at that line, its hard to say what's going on.

Actions #11

Updated by Dhairya Parmar over 1 year ago

Venky Shankar wrote in #note-9:

Dhairya Parmar wrote in #note-8:

Patrick Donnelly wrote in #note-7:

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

Right, because the caller expects the client_lock to be locked. The point of the unique_lock here is to hold ownership of the lock for the call to cond.wait. While waiting in cond.wait, the lock is unlocked and reacquired when the condvar is notified. Because we have the lock after the wait completes (and the predicate is true), we need to release ownership of the unique_lock (so the "ownership" passes back to the caller).

so that means write is not actually "sleeping" with the clint_lock but waiting for the caps at wait_on_context_list(in->waitfor_caps) which makes me think then why didn't the read op go through? why did it get stalled? read call too is checking for caps before making any read, does that mean that since the write is waiting for some caps, the read call is in the queue to get caps (because obviously it can't just skip the line it would haunt the stability and lead to potential data corruption?) and so until the write op is served with the caps, read op's get_caps is blocked?

If the client has the required "read" caps, the operation would have gone through. You need to check what caps the client has and if it requested a Frc cap from the MDS to satisfy the read.

I'll check it but just from looking at the consequence, it looks like the read call is blocked to get the caps which makes me wonder that are caps allocated in synchronised manner? I'm thinking that since it's just one client in picture, the same client is not allowed get read caps if the write call is waiting for caps and vice verse?

Actions #12

Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote in #note-11:

Venky Shankar wrote in #note-9:

Dhairya Parmar wrote in #note-8:

Patrick Donnelly wrote in #note-7:

Dhairya Parmar wrote in #note-6:

Venky Shankar wrote in #note-4:

Dhairya Parmar wrote:

So this came up while quiesce operation demonstration where a quiesce call on the root of fs was initiated while a script was writing to a file in the root which as expected successfully blocks the writes but it starved the subsequent read call too. So to reproduce this(I'll follow the way leonid did it), run a script to keep writing to a file that will be quiseced, in another terminal window run the quiesce op and then try to read the file. It should be as follows:

terminal 0:
[...]

terminal 1:
[...]

if we check the client log, it is waiting for the caps:
[...]

so what happened is when the root is quiesced, client is deprived from doing any writes to the inodes under root, so the writes should stall which is good but the write call has acquired the client lock [0] and is sleeping on it while waiting for caps here [1]

When get_caps() waits for a cap from the MDS, it'll call wait_on_context_list() which would wait on a condvar (on stack) dropping client_lock.

[...]

the l.release() would mean releasing the ownership of the client_lock without unlocking(i.e. retaining its state), no?

Right, because the caller expects the client_lock to be locked. The point of the unique_lock here is to hold ownership of the lock for the call to cond.wait. While waiting in cond.wait, the lock is unlocked and reacquired when the condvar is notified. Because we have the lock after the wait completes (and the predicate is true), we need to release ownership of the unique_lock (so the "ownership" passes back to the caller).

so that means write is not actually "sleeping" with the clint_lock but waiting for the caps at wait_on_context_list(in->waitfor_caps) which makes me think then why didn't the read op go through? why did it get stalled? read call too is checking for caps before making any read, does that mean that since the write is waiting for some caps, the read call is in the queue to get caps (because obviously it can't just skip the line it would haunt the stability and lead to potential data corruption?) and so until the write op is served with the caps, read op's get_caps is blocked?

If the client has the required "read" caps, the operation would have gone through. You need to check what caps the client has and if it requested a Frc cap from the MDS to satisfy the read.

I'll check it but just from looking at the consequence, it looks like the read call is blocked to get the caps which makes me wonder that are caps allocated in synchronised manner? I'm thinking that since it's just one client in picture, the same client is not allowed get read caps if the write call is waiting for caps and vice verse?

The client should be able to get a reach cap even if the MDS has disallowed handing the write cap due to quiesce in picture.

Actions #14

Updated by Dhairya Parmar over 1 year ago

I'm seeing:

2024-09-06T19:31:14.696+0530 7f2e18a006c0  1 -- [v2:127.0.0.1:6830/4194777954,v1:127.0.0.1:6831/4194777954] >> 127.0.0.1:0/1733762733 conn(0x564bc7c3a480 msgr2=0x564bc7c4db80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2024-09-06T19:31:14.696+0530 7f2e18a006c0  1 --2- [v2:127.0.0.1:6830/4194777954,v1:127.0.0.1:6831/4194777954] >> 127.0.0.1:0/1733762733 conn(0x564bc7c3a480 0x564bc7c4db80 crc :-1 s=READY pgs=4 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
Actions #15

Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote in #note-14:

I'm seeing:

[...]

Are you suspecting the network for the client to be stuck in read call?

Actions #16

Updated by Venky Shankar over 1 year ago

@Dhairya Parmar You had pinged me on slack regarding the following

2024-09-06T19:31:14.713+0530 7effb7e006c0 10 client.4280 check_caps on 0x10000000000.head(faked_ino=0 nref=4 ll_ref=14 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100644 size=492/4194304 nlink=1 btime=2024-09-06T19:31:03.000210+0530 mtime=2024-09-06T19:31:14.077860+0530 ctime=2024-09-06T19:31:14.077860+0530 change_attr=12 caps=pAsLsXsFcb(0=pAsLsXsFcb) dirty_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 0] parents=0x1.head["foo"] 0x7eff7c0078c0) wanted - used Fc issued pAsLsXsFcb revoking AxXxFsxrw flags=1

Yeh, so the client AxXxFsxrw as revoking (in Client::check_caps). The MDS has asked the client to drop those caps and those caps are currently under revocation by the MDS.

Actions #17

Updated by Dhairya Parmar over 1 year ago · Edited

Venky Shankar wrote in #note-16:

@Dhairya Parmar You had pinged me on slack regarding the following

[...]

Yeh, so the client AxXxFsxrw as revoking (in Client::check_caps). The MDS has asked the client to drop those caps and those caps are currently under revocation by the MDS.

yeah but i wanted to get better understanding on what the caps are in AxXxFsxrw

(from my naive understanding, it looks to be revoking all read/write/execute perms??)

Actions #18

Updated by Dhairya Parmar over 1 year ago · Edited

RCA (so far):

Writing to a file in parallel (cephfs mounted using ceph-fuse):

$ while date | tee -a foo ; do sleep 1 ; done
Thursday 19 September 2024 05:23:37 PM IST
Thursday 19 September 2024 05:23:38 PM IST
Thursday 19 September 2024 05:23:39 PM IST
Thursday 19 September 2024 05:23:40 PM IST
Thursday 19 September 2024 05:23:41 PM IST
Thursday 19 September 2024 05:23:42 PM IST  # quiesce kicks in
Thursday 19 September 2024 05:24:13 PM IST  # quiesce ends
Thursday 19 September 2024 05:24:14 PM IST
Thursday 19 September 2024 05:24:15 PM IST

Quiesce cmd:

$ ./bin/ceph tell mds.0 quiesce db / --timeout 5 --expiration 30

read can only be seen once quiesce ends:

2024-09-19T17:24:12.143+0530 7f4bc16006c0  3 client.4290 ll_read 0x7f4b68024150 0x10000000000  0~4096
2024-09-19T17:24:12.143+0530 7f4bc16006c0 10 client.4290 get_caps 0x10000000000.head(faked_ino=0 nref=8 ll_ref=9 cap_refs={4=0,1024=1,4096=0,8192=1} open={1=1,2=1} mode=100644 size=258/4194304 nlink=1 btime=2024-09-19T17:23:37.608561+0530 mtime=2024-09-19T17:24:12.143596+0530 ctime=2024-09-19T17:24:12.143596+0530 change_attr=6 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 43] parents=0x1.head["foo"] 0x7f4b8800a110) have pAsxLsXsxFsxcrwb need Fr want Fc revoking -
2024-09-19T17:24:12.143+0530 7f4bc16006c0 10 client.4290 _read_async 0x10000000000.head(faked_ino=0 nref=8 ll_ref=9 cap_refs={4=0,1024=1,2048=1,4096=0,8192=1} open={1=1,2=1} mode=100644 size=258/4194304 nlink=1 btime=2024-09-19T17:23:37.608561+0530 mtime=2024-09-19T17:24:12.143596+0530 ctime=2024-09-19T17:24:12.143596+0530 change_attr=6 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 43] parents=0x1.head["foo"] 0x7f4b8800a110) 0~4096
2024-09-19T17:24:12.143+0530 7f4bc20006c0  3 client.4290 ll_release (fh)0x7f4b5c0264f0 0x10000000000 
2024-09-19T17:24:12.143+0530 7f4bc16006c0 10 client.4290  min_bytes=131072 max_bytes=16777216 max_periods=4
2024-09-19T17:24:12.143+0530 7f4bc16006c0  3 client.4290 ll_read 0x7f4b68024150 0~4096 = 258

When quiesce kicks in, we can see mds revoking AxXxFsxrw while client is dropping Fb:

2024-09-19T17:23:42.108+0530 7f4bce0006c0 10 client.4290 send_cap 0x10000000000.head(faked_ino=0 nref=10 ll_ref=7 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100644 size=215/4194304 nlink=1 btime=2024-09-19T17:23:37.608561+0530 mtime=2024-09-19T17:23:41.705906+0530 ctime=2024-09-19T17:23:41.705906+0530 change_attr=5 caps=pAsLsXsFcb(0=pAsLsXsFcb) flushing_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 0] parents=0x1.head["foo"] 0x7f4b8800a110) mds.0 seq 2 used Fc want - flush Fw retain pAsLsXsFc held pAsxLsXsxFsxcrwb revoking AxXxFsxrw dropping Fb

From mds.c logs, quiesce db kicks in at:

2024-09-19T17:23:42.106+0530 7f037be006c0  1 -- [v2:127.0.0.1:6830/1938766956,v1:127.0.0.1:6831/1938766956] <== client.4310 127.0.0.1:0/547372204 1 ==== command(tid 0: {"prefix": "quiesce db", "roots": ["/"], "timeout": 5.0, "expiration": 30.0}) ==== 100+0+0 (crc 0 0 0) 0x564ef830e1a0 con 0x564ef822b600
2024-09-19T17:23:42.107+0530 7f037fa006c0  1 mds.c asok_command: quiesce db {expiration=30,prefix=quiesce db,roots=[/],timeout=5} (starting...)
2024-09-19T17:23:42.107+0530 7f037fa006c0 10 mds.0.4 handle_asok_command: quiesce db

this is the last log line containing allowed caps before quiesce ends:
2024-09-19T17:23:42.714+0530 7f037be006c0  7 mds.0.locker get_allowed_caps loner client.4290 allowed=pAsLsXsFscrb, xlocker allowed=pAsLsXsFscrb, others allowed=pAsLsXs on [inode 0x10000000000 [2,head] /foo auth v12 ap=2 DIRTYPARENT s=215 n(v0 rc2024-09-19T17:23:41.705906+0530 b215 1=1+0) (ifile excl) (iquiesce lock x=1 by request(mds.0:3 nref=3)) caps={4290=pAsLsXsFscr/pAsxXsxFxwb@3},l=4290 | ptrwaiter=0 request=2 lock=3 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x564ef81e0000]

client after seven seconds (why I feel like this is the time when I called read):

2024-09-19T17:23:49.315+0530 7f4bc20006c0 10 client.4290  cap mds.0 issued pAsLsXsFscrb implemented pAsLsXsFscrb revoking -

after ^^, client logs are dead until quiesce ends and same goes with mds logs, all I see is ticks..

Greg and I discussed that since the read call is seen only at the time quiesce ends, this has to do something at an upper layer (libfuse) and not at cephfs.

Actions #19

Updated by Venky Shankar over 1 year ago

@Dhairya Parmar - could you please update here with the way forward with what we discussed on Friday?

Actions #20

Updated by Dhairya Parmar over 1 year ago

Venky Shankar wrote in #note-19:

@Dhairya Parmar - could you please update here with the way forward with what we discussed on Friday?

writes from a bash terminal:

$ while date | tee -a foo ; do sleep 1 ; done
Friday 20 September 2024 12:55:35 AM IST
Friday 20 September 2024 12:55:36 AM IST
Friday 20 September 2024 12:55:37 AM IST
Friday 20 September 2024 12:55:39 AM IST
Friday 20 September 2024 12:55:40 AM IST # quiese kicks in (30 secs)
Friday 20 September 2024 12:56:10 AM IST # quiesce ends
Friday 20 September 2024 12:56:11 AM IST 
Friday 20 September 2024 12:56:12 AM IST

read req from another bash terminal:

[dparmar 20240920-00:55:46]$ cat foo

the important time frames are:
1) when quiesce kicks in: 2024-09-20T00:55:39.674+0530
2) when read is requested (CLI): 2024-09-20T00:55:46
3) when read is actually processed: 2024-09-20T00:56:09.736+0530

so it looks like this:

Friday 20 September 2024 12:55:35 AM IST
Friday 20 September 2024 12:55:36 AM IST
Friday 20 September 2024 12:55:37 AM IST
Friday 20 September 2024 12:55:39 AM IST
Friday 20 September 2024 12:55:40 AM IST  # quiesce starts (30 secs)
Friday 20 September 2024 12:55:46 AM IST  # read reqeusted on CLI

# actual read processed
Friday 20 September 2024 12:56:09.736 AM IST ll_read 0x7fba58028380 0x10000000000  0~4096
Friday 20 September 2024 12:56:09.736 AM IST ll_read 0x7fba58028380 0~4096 = 205

Friday 20 September 2024 12:56:10 AM IST  # quiesce ends
Friday 20 September 2024 12:56:11 AM IST
Friday 20 September 2024 12:56:12 AM IST

so what actually happened during this window, narrowing our search to 00:55:46 to 00:56:09...

from client log, we see:

2024-09-20T00:55:48.668+0530 7fbac0c006c0 10 client.4285  cap mds.0 issued pAsLsXsFscrb implemented pAsLsXsFscrb revoking -

so what happened from mds's side, lets search for pAsLsXsFscrb:

2024-09-20T00:55:40.019+0530 7f31cb2006c0  7 mds.0.locker get_allowed_caps loner client.4285 allowed=pAsLsXsFscrb, xlocker allowed=pAsLsXsFscrb, others allowed=pAsLsXs on [inode 0x10000000000 [2,head] /foo auth v10 ap=2 DIRTYPARENT s=164 n(v0 rc2024-09-20T00:55:39.006954+0530 b164 1=1+0) (ifile excl) (iquiesce lock x=1 by request(mds.0:3 nref=3)) caps={4285=pAsLsXsFscr/pAsxXsxFxwb@3},l=4285 | ptrwaiter=0 request=2 lock=3 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x555c4b5ca000]
2024-09-20T00:55:40.019+0530 7f31cb2006c0 20 mds.0.locker  client.4285 pending pAsLsXsFscr allowed pAsLsXsFscrb wanted pAsxXsxFxwb

i.e. as soon as quiesce was started, mds allowed pAsLsXsFscrb and there is no trace of the Fb caps actually being dropped/revoked. The client just says it's dropping it. It looks like the write call in loop raced with quiesce, blocked the buffer flush (and the Fb caps), is holding the client lock and that's the reason why read call is seen only when the quiesce ends.

Just FYI:
- Same test case with multi-clients setup work fine.
- The case where write is stopped - > quiesce initiated - > read call works.
- If a read call is initiated in parallel i.e. ./bin/ceph tell mds.0 quiesce db / --timeout 5 --expiration 30 ; cat ./cephfs1/foo works as well, thought it must be noted a later read call stalls (which is an indication that it came before the quiesce muted the racy write call?).

This means the only failing case is of loner client with continuous writes.

Actions #21

Updated by Dhairya Parmar over 1 year ago

The thread reporting write at the time quiesce is called:

Thread ID: 26
#0  0x00007f144fca5919 in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f144fca82e9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007f14510ab7a6 in ceph::condition_variable_debug::wait (this=this@entry=0x7f14255fdb00, lock=...) at /home/dparmar/CephRepoForRunningTests/ceph/src/common/condition_variable_debug.cc:28
#3  0x000055a81249b976 in ceph::condition_variable_debug::wait<Client::wait_on_context_list(std::vector<Context*>&)::<lambda()> > (this=0x7f14255fdb00, lock=..., pred=...) at /home/dparmar/CephRepoForRunningTests/ceph/src/common/condition_variable_debug.h:34
#4  Client::wait_on_context_list (this=this@entry=0x55a82d29a790, ls=std::vector of length 1, capacity 1 = {...}) at /home/dparmar/CephRepoForRunningTests/ceph/src/client/Client.cc:4276
#5  0x000055a8124e493f in Client::get_caps (this=this@entry=0x55a82d29a790, fh=fh@entry=0x7f13c8025530, need=need@entry=4100, want=8192, phave=phave@entry=0x7f14255fdd04, endoff=endoff@entry=968) at /home/dparmar/CephRepoForRunningTests/ceph/src/client/Client.cc:3761
#6  0x000055a8124eaddf in Client::_write (this=this@entry=0x55a82d29a790, f=f@entry=0x7f13c8025530, offset=offset@entry=924, size=size@entry=44, buf=buf@entry=0x7f13c8002c50 "Wednesday 25 September 2024 08:50:56 PM IST\n", iov=iov@entry=0x0, iovcnt=0, onfinish=0x0, do_fsync=false, syncdataonly=false) at /home/dparmar/CephRepoForRunningTests/ceph/src/client/Client.cc:11576
#7  0x000055a8124ecebb in Client::ll_write (this=0x55a82d29a790, fh=0x7f13c8025530, off=off@entry=924, len=len@entry=44, data=data@entry=0x7f13c8002c50 "Wednesday 25 September 2024 08:50:56 PM IST\n") at /home/dparmar/CephRepoForRunningTests/ceph/src/client/Client.cc:16001
#8  0x000055a81247f62d in fuse_ll_write (req=0x7f13c0003b90, ino=<optimized out>, buf=0x7f13c8002c50 "Wednesday 25 September 2024 08:50:56 PM IST\n", size=44, off=924, fi=0x7f14255fdef0) at /home/dparmar/CephRepoForRunningTests/ceph/src/client/fuse_ll.cc:906
#9  0x00007f1451a85701 in do_write (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at /usr/src/debug/fuse-2.9.9-21.fc40.x86_64/lib/fuse_lowlevel.c:1258
#10 0x00007f1451a91c49 in fuse_ll_process_buf (data=0x55a82d2c7680, buf=0x7f14255fe0b0, ch=<optimized out>) at /usr/src/debug/fuse-2.9.9-21.fc40.x86_64/lib/fuse_lowlevel.c:2443
#11 0x00007f1451a81ef3 in fuse_do_work (data=0x7f13c8002bc0) at /usr/src/debug/fuse-2.9.9-21.fc40.x86_64/lib/fuse_loop_mt.c:117
#12 0x00007f144fca91b7 in start_thread () from /lib64/libc.so.6
#13 0x00007f144fd2b3cc in clone3 () from /lib64/libc.so.6

Actions #22

Updated by Patrick Donnelly 9 months ago

  • Target version deleted (v20.0.0)
Actions

Also available in: Atom PDF