Skip to content

global, osd/crimson: improve handling of the crimson-osd instance duplication#35337

Merged
tchaikov merged 3 commits intoceph:masterfrom
rzarzynski:wip-crimson-pidfile-err-handling
Jun 3, 2020
Merged

global, osd/crimson: improve handling of the crimson-osd instance duplication#35337
tchaikov merged 3 commits intoceph:masterfrom
rzarzynski:wip-crimson-pidfile-err-handling

Conversation

@rzarzynski
Copy link
Contributor

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@rzarzynski rzarzynski requested a review from a team as a code owner June 1, 2020 11:51
@rzarzynski
Copy link
Contributor Author

@rzarzynski rzarzynski requested review from ideepika and tchaikov June 1, 2020 11:52
local_conf().parse_config_files(conf_file_list).get();
local_conf().parse_argv(ceph_args).get();
pidfile_write(local_conf()->pid_file);
if (auto ret = pidfile_write(local_conf()->pid_file);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto.

@rzarzynski rzarzynski force-pushed the wip-crimson-pidfile-err-handling branch from c1decc1 to bdb1fc5 Compare June 1, 2020 12:07
@tchaikov
Copy link
Contributor

tchaikov commented Jun 1, 2020

jenkins test crimson perf

@tchaikov
Copy link
Contributor

tchaikov commented Jun 1, 2020

write: 1 20855212~5213803
discard: 22158662~2606901
write: 2 41710424~5213803
discard: 43013874~2606901
write: 3 62565636~5213803
discard: 63869086~2606901
write: 4 83420848~5213803
discard: 84724298~2606901
write: 5 104276061~5213803
discard: 105579511~2606901
migration_commit
after commit snap: user0, block 71303168~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user0, block 75497472~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user0, block 79691776~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user1, block 71303168~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user1, block 79691776~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user2, block 71303168~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user2, block 79691776~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user3, block 71303168~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: user3, block 79691776~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: null, block 71303168~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
after commit snap: null, block 79691776~4194304 differs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_Migration.cc:153: Failure
Value of: src_bl.contents_equal(dst_bl)
  Actual: false
Expected: true
[  FAILED  ] TestMigration.StressLive (6345 ms)

tracked by https://tracker.ceph.com/issues/45694

@tchaikov
Copy link
Contributor

tchaikov commented Jun 1, 2020

jenkins test make check

@tchaikov
Copy link
Contributor

tchaikov commented Jun 1, 2020

15:53:43 - DEBUG    - cbt      - Nodes : incerta04.front.sepia.ceph.com
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 24.5147/29.7948  => rejected
15:53:43 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 6275.0/7627.0  => rejected
15:53:43 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 286.172/416.184  => accepted
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.00252878/0.00208343  => rejected
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 24.3361/30.7752  => rejected
15:53:43 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 6230.0/7878.0  => rejected
15:53:43 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 257.979/400.17  => accepted
15:53:43 - WARNING  - cbt      - prefill/incerta04.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.00255697/0.0020186  => rejected
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 167.518/251.517  => rejected
15:53:43 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 42884.0/64388.0  => rejected
15:53:43 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 2038.53/2919.26  => accepted
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.000369593/0.000245246  => rejected
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 167.122/253.748  => rejected
15:53:43 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 42783.0/64959.0  => rejected
15:53:43 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 2414.23/2916.38  => accepted
15:53:43 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.000370556/0.000243136  => rejected
15:53:43 - WARNING  - cbt      - 12 tests failed out of 20

this test failed, but it proves ceph/ceph-build#1576 .

@tchaikov
Copy link
Contributor

tchaikov commented Jun 2, 2020

jenkins test crimson perf

@rzarzynski
Copy link
Contributor Author

09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 29.8286/30.6839  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 7636.0/7855.0  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 142.042/280.541  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.00208058/0.00201177  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 29.3752/29.8126  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 7520.0/7632.0  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 97.3721/286.658  => accepted
09:25:10 - INFO     - cbt      - prefill/incerta04.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.00212226/0.00207735  => accepted
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 240.402/252.817  => accepted
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 61542.0/64721.0  => accepted
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 1850.56/1824.72  => accepted
09:25:10 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.0002568/0.000244263  => rejected
09:25:10 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 240.773/258.788  => rejected
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
09:25:10 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 61637.0/66249.0  => rejected
09:25:10 - INFO     - cbt      - rand/incerta04.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 2022.82/1552.39  => accepted
09:25:10 - WARNING  - cbt      - rand/incerta04.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.000256431/0.000238528  => rejected
09:25:10 - WARNING  - cbt      - 4 tests failed out of 20

@rzarzynski
Copy link
Contributor Author

jenkins test crimson perf

Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment

ceph_abort_msg(
"likely there is another crimson-osd instance with the same id");
} else if (ret < 0) {
ceph_abort_msg(fmt::format("pidfile_write failed with {}", ret));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use cpp_strerror(-ret) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will switch in a moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched.

The motivation is to assert-with-human-readable-error on spawning
a crimson-osd with already occupied ID.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
The goal is to never implicitly ignore errors that the function can
return, particularly the failure on pidfile locking due to the file
being hold by another instance. This problem happened recently in
crimson-osd.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@tchaikov
Copy link
Contributor

tchaikov commented Jun 3, 2020

14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 30.7213/28.8815  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 7864.0/7393.0  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 314.332/289.403  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.00200481/0.00214704  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 29.7184/30.7389  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 7607.0/7869.0  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 284.018/294.575  => accepted
14:46:25 - INFO     - cbt      - prefill/incerta02.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.00208709/0.00201448  => accepted
14:46:25 - WARNING  - cbt      - rand/incerta02.front.sepia.ceph.com/0: bandwidth: (or (greater) (near 0.05)):: 243.257/266.367  => rejected
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/0: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
14:46:25 - WARNING  - cbt      - rand/incerta02.front.sepia.ceph.com/0: iops_avg: (or (greater) (near 0.05)):: 62273.0/68190.0  => rejected
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/0: iops_stddev: (or (less) (near 2.00)):: 1254.55/735.366  => accepted
14:46:25 - WARNING  - cbt      - rand/incerta02.front.sepia.ceph.com/0: latency_avg: (or (less) (near 0.05)):: 0.000253665/0.000231696  => rejected
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/1: bandwidth: (or (greater) (near 0.05)):: 257.363/264.736  => accepted
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/1: cpu_cycles_per_op: (or (less) (near 0.05)):: 0.0/0.0  => accepted
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/1: iops_avg: (or (greater) (near 0.05)):: 65884.0/67772.0  => accepted
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/1: iops_stddev: (or (less) (near 2.00)):: 950.992/2252.34  => accepted
14:46:25 - INFO     - cbt      - rand/incerta02.front.sepia.ceph.com/1: latency_avg: (or (less) (near 0.05)):: 0.000239981/0.000233186  => accepted
14:46:25 - WARNING  - cbt      - 3 tests failed out of 20

@tchaikov tchaikov merged commit 6adbce2 into ceph:master Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants