Skip to content

mds/Server: mark a cap acquisition throttle event in the request#52676

Merged
vshankar merged 1 commit intoceph:mainfrom
leonid-s-usov:cap_throttle_event
Aug 24, 2023
Merged

mds/Server: mark a cap acquisition throttle event in the request#52676
vshankar merged 1 commit intoceph:mainfrom
leonid-s-usov:cap_throttle_event

Conversation

@leonid-s-usov
Copy link
Contributor

Fixes: https://tracker.ceph.com/issues/59067

Also, add to the client limits test to verify that the event has been recorded on the recent ops

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added cephfs Ceph File System tests labels Jul 27, 2023
@leonid-s-usov leonid-s-usov changed the title [mds] Server: mark a cap acquisition throttle event in the request mds/Server: mark a cap acquisition throttle event in the request Jul 27, 2023
@vshankar vshankar requested a review from a team July 28, 2023 10:19
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@leonid-s-usov
Copy link
Contributor Author

I need to find a way to run the test. For now, I can't run it locally and I don't have the VPN connection to the sepia environment to run tests remotely. This will have to wait until I verify that the test passes and fix it if it doesn't

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to update the ticket to "Fix under review" and the PR field with 52676. You should have permissions now after I edited them when you brought it up in standup.

@leonid-s-usov leonid-s-usov force-pushed the cap_throttle_event branch 2 times, most recently from 78aeed3 to 2cd8eda Compare July 29, 2023 10:24
@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

@leonid-s-usov leonid-s-usov force-pushed the cap_throttle_event branch 3 times, most recently from 74fb5d5 to 426e2cd Compare July 31, 2023 19:47
@leonid-s-usov leonid-s-usov force-pushed the cap_throttle_event branch 6 times, most recently from 46db51b to b67820c Compare August 2, 2023 12:03
@leonid-s-usov leonid-s-usov requested a review from vshankar August 2, 2023 12:06
@vshankar
Copy link
Contributor

vshankar commented Aug 8, 2023

Integration branch - https://shaman.ceph.com/builds/ceph/wip-vshankar-testing-20230808.093601/

Test runs in ~3 hrs.

@leonid-s-usov
Copy link
Contributor Author

@vshankar is there anything else needed here?

@vshankar
Copy link
Contributor

vshankar commented Aug 9, 2023

@vshankar
Copy link
Contributor

vshankar commented Aug 9, 2023

@vshankar is there anything else needed here?

At times, tests fail due to infrastructure issues with sepia. These tests have to be rerun. Then a run verification is done and recorded in the run wiki. If nothing stands out related to the change, the PR is merged. The fs suite tests (~200 tests, or at times much more) take around a day to finish running (the tests are scheduled with a certain priority) and then it takes around half a day to verify the results. More infra issues results in more wait time which we have experienced in the past.

std::cout << "error while parsing dump_historic_ops: " << e.what() << std::endl;
}

ASSERT_TRUE(seen_cap_throttle_in_recent_op_events);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leonid-s-usov PTAL test failures here - https://pulpito.ceph.com/vshankar-2023-08-09_05:51:38-fs-wip-vshankar-testing-20230808.093601-testing-default-smithi/7364204/

2023-08-09T06:50:26.512 INFO:tasks.workunit.client.0.smithi002.stdout:/build/ceph-18.0.0-5396-ge318c197/src/test/libcephfs/snapdiff.cc:655: Failure
2023-08-09T06:50:26.513 INFO:tasks.workunit.client.0.smithi002.stdout:Value of: seen_cap_throttle_in_recent_op_events
2023-08-09T06:50:26.514 INFO:tasks.workunit.client.0.smithi002.stdout:  Actual: false
2023-08-09T06:50:26.514 INFO:tasks.workunit.client.0.smithi002.stdout:Expected: true

The throttle didn't get hit and caused the assert. I haven't looked at the MDS logs to infer why. In case you don't have access to sepia yet, the (MDS) logs can be accessed via: http://qa-proxy.ceph.com/teuthology/vshankar-2023-08-09_05:51:38-fs-wip-vshankar-testing-20230808.093601-testing-default-smithi/7364204/remote/

Copy link
Contributor Author

@leonid-s-usov leonid-s-usov Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see evidence of the throttle being activated, look at the timestamps:

2023-08-09T06:50:24.477 INFO:tasks.workunit.client.0.smithi002.stdout:---------snap2 vs. snap1 diff listing verification for /dirC
2023-08-09T06:50:25.485 INFO:tasks.workunit.client.0.smithi002.stdout:---------snap1 vs. snap2 diff listing verification for /dirD

Is this test running against a shared ceph cluster? Could it be that the historic ops are overflown by the time we're issuing the dump_historic_ops command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the mds logs, we can see the evidence also there:

2023-08-09T06:50:24.477+0000 7fe25c929700 20 mds.0.server snapdiff throttled. max_caps_per_client: 1 num_caps: 11 session_cap_acquistion: 1.98895 cap_acquisition_throttle: 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test running against a shared ceph cluster? Could it be that the historic ops are overflown by the time we're issuing the dump_historic_ops command?

# Max number of completed ops to track
- name: mds_op_history_size
  type: uint
  level: advanced
  desc: maximum size for list of historical operations
  default: 20
  services:
  - mds
  with_legacy: true

You are probably correct.

@leonid-s-usov
Copy link
Contributor Author

@vshankar I've added some debug info. could you please help me restart the testing?
Also, has this test passed in other instances?

vshankar added a commit to vshankar/ceph that referenced this pull request Aug 22, 2023
* refs/pull/52676/head:
	mds/Server: mark a cap acquisition throttle event in the request

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Kotresh Hiremath Ravishankar <khiremat@redhat.com>
@vshankar
Copy link
Contributor

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 726e5d7 into ceph:main Aug 24, 2023
@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Aug 24, 2023

https://tracker.ceph.com/projects/cephfs/wiki/Main#24-Aug-2023

@vshankar do I understand correctly that the problem we saw once hasn't reproduced since? If that's the case, is there anything else that needs to be done here?

UPD: sorry I haven't noticed that it was merged already. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants