mds/Server: mark a cap acquisition throttle event in the request#52676
mds/Server: mark a cap acquisition throttle event in the request#52676
Conversation
9553da1 to
e190bd0
Compare
|
I need to find a way to run the test. For now, I can't run it locally and I don't have the VPN connection to the sepia environment to run tests remotely. This will have to wait until I verify that the test passes and fix it if it doesn't |
batrick
left a comment
There was a problem hiding this comment.
Please make sure to update the ticket to "Fix under review" and the PR field with 52676. You should have permissions now after I edited them when you brought it up in standup.
78aeed3 to
2cd8eda
Compare
|
jenkins test make check |
74fb5d5 to
426e2cd
Compare
426e2cd to
d534994
Compare
d534994 to
693afcf
Compare
693afcf to
eb3a351
Compare
46db51b to
b67820c
Compare
|
Integration branch - https://shaman.ceph.com/builds/ceph/wip-vshankar-testing-20230808.093601/ Test runs in ~3 hrs. |
|
@vshankar is there anything else needed here? |
At times, tests fail due to infrastructure issues with sepia. These tests have to be rerun. Then a run verification is done and recorded in the run wiki. If nothing stands out related to the change, the PR is merged. The fs suite tests (~200 tests, or at times much more) take around a day to finish running (the tests are scheduled with a certain priority) and then it takes around half a day to verify the results. More infra issues results in more wait time which we have experienced in the past. |
| std::cout << "error while parsing dump_historic_ops: " << e.what() << std::endl; | ||
| } | ||
|
|
||
| ASSERT_TRUE(seen_cap_throttle_in_recent_op_events); |
There was a problem hiding this comment.
@leonid-s-usov PTAL test failures here - https://pulpito.ceph.com/vshankar-2023-08-09_05:51:38-fs-wip-vshankar-testing-20230808.093601-testing-default-smithi/7364204/
2023-08-09T06:50:26.512 INFO:tasks.workunit.client.0.smithi002.stdout:/build/ceph-18.0.0-5396-ge318c197/src/test/libcephfs/snapdiff.cc:655: Failure
2023-08-09T06:50:26.513 INFO:tasks.workunit.client.0.smithi002.stdout:Value of: seen_cap_throttle_in_recent_op_events
2023-08-09T06:50:26.514 INFO:tasks.workunit.client.0.smithi002.stdout: Actual: false
2023-08-09T06:50:26.514 INFO:tasks.workunit.client.0.smithi002.stdout:Expected: true
The throttle didn't get hit and caused the assert. I haven't looked at the MDS logs to infer why. In case you don't have access to sepia yet, the (MDS) logs can be accessed via: http://qa-proxy.ceph.com/teuthology/vshankar-2023-08-09_05:51:38-fs-wip-vshankar-testing-20230808.093601-testing-default-smithi/7364204/remote/
There was a problem hiding this comment.
I see evidence of the throttle being activated, look at the timestamps:
2023-08-09T06:50:24.477 INFO:tasks.workunit.client.0.smithi002.stdout:---------snap2 vs. snap1 diff listing verification for /dirC
2023-08-09T06:50:25.485 INFO:tasks.workunit.client.0.smithi002.stdout:---------snap1 vs. snap2 diff listing verification for /dirD
Is this test running against a shared ceph cluster? Could it be that the historic ops are overflown by the time we're issuing the dump_historic_ops command?
There was a problem hiding this comment.
found the mds logs, we can see the evidence also there:
2023-08-09T06:50:24.477+0000 7fe25c929700 20 mds.0.server snapdiff throttled. max_caps_per_client: 1 num_caps: 11 session_cap_acquistion: 1.98895 cap_acquisition_throttle: 1
There was a problem hiding this comment.
Is this test running against a shared ceph cluster? Could it be that the historic ops are overflown by the time we're issuing the
dump_historic_opscommand?
# Max number of completed ops to track
- name: mds_op_history_size
type: uint
level: advanced
desc: maximum size for list of historical operations
default: 20
services:
- mds
with_legacy: true
You are probably correct.
b67820c to
a541019
Compare
|
@vshankar I've added some debug info. could you please help me restart the testing? |
a541019 to
9ce9abe
Compare
Fixes: https://tracker.ceph.com/issues/59067 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
9ce9abe to
749c770
Compare
* refs/pull/52676/head: mds/Server: mark a cap acquisition throttle event in the request Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Kotresh Hiremath Ravishankar <khiremat@redhat.com>
|
(will review when the run finishes) |
@vshankar do I understand correctly that the problem we saw once hasn't reproduced since? If that's the case, is there anything else that needs to be done here? UPD: sorry I haven't noticed that it was merged already. Thanks! |
Fixes: https://tracker.ceph.com/issues/59067
Also, add to the client limits test to verify that the event has been recorded on the recent ops
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows