Skip to content

osd/PeeringState: Add logs around can_serve_replica_read() / last_complete_ondisk()#49380

Merged
yuriw merged 5 commits intoceph:mainfrom
Matan-B:wip-matanb-lrod-logs
Mar 22, 2023
Merged

osd/PeeringState: Add logs around can_serve_replica_read() / last_complete_ondisk()#49380
yuriw merged 5 commits intoceph:mainfrom
Matan-B:wip-matanb-lrod-logs

Conversation

@Matan-B
Copy link
Contributor

@Matan-B Matan-B commented Dec 11, 2022

Few commits which were moved out of #49116 into a separate PR. (For easier QA/CR)

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@Matan-B Matan-B requested a review from a team as a code owner December 11, 2022 14:03
@github-actions github-actions bot added the core label Dec 11, 2022
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Matan-B Matan-B mentioned this pull request Dec 11, 2022
20 tasks
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 6978ad1 to b5f5102 Compare December 12, 2022 10:54
@Matan-B Matan-B requested a review from ronen-fr December 12, 2022 10:54
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from b5f5102 to 6c0478d Compare December 12, 2022 11:37
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 6c0478d to 647b275 Compare December 13, 2022 09:45
@Matan-B Matan-B requested a review from athanatos December 13, 2022 10:18
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 647b275 to 051e564 Compare December 14, 2022 11:32
@Matan-B Matan-B requested a review from athanatos December 14, 2022 11:34
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than switching the logging from 10->20. Those lines are printed at a rate proportional to incoming IO, so they need to be at 20. As a suggestion for the future, your commits combine moving a method between two files, updating the loop structure, and adding logging. Separating those three things into distinct commits makes it much simpler to review since I wouldn't have to go line-by-line to identify your changes.

@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 051e564 to 85c0101 Compare December 15, 2022 10:13
@Matan-B Matan-B requested a review from athanatos December 15, 2022 10:15
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a rados suite test run.

@ljflores
Copy link
Member

@Matan-B I see two repeated failures from the testing batch that I believe are related to this PR. Can you take a look?

https://pulpito.ceph.com/yuriw-2023-02-07_16:11:40-rados-wip-yuri10-testing-2023-02-06-0846-distro-default-smithi/7154278/

2023-02-10T11:48:28.167 INFO:tasks.workunit.client.0.smithi072.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  grep 'pg[[]1.0.*backfilling.*update_calc_stats degraded ' td/osd-backfill-stats/osd.1.log
2023-02-10T11:48:28.167 INFO:tasks.workunit.client.0.smithi072.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  grep -F ' [1,2]/[1] '
2023-02-10T11:48:28.168 INFO:tasks.workunit.client.0.smithi072.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  grep -v est
2023-02-10T11:48:28.168 INFO:tasks.workunit.client.0.smithi072.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  head -1
2023-02-10T11:48:28.168 INFO:tasks.workunit.client.0.smithi072.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  sed 's/.* \([0-9]*\)$/\1/'
2023-02-10T11:48:28.175 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:99: check:  FIRST=
2023-02-10T11:48:28.176 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:100: check:  '[' -n '' ']'
2023-02-10T11:48:28.176 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:100: check:  return 1
2023-02-10T11:48:28.176 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-stats.sh:316: TEST_backfill_down_out:  return 1

https://pulpito.ceph.com/yuriw-2023-02-07_16:11:40-rados-wip-yuri10-testing-2023-02-06-0846-distro-default-smithi/7154355/

2023-02-10T11:17:54.535 INFO:tasks.workunit.client.0.smithi052.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  grep 'pg[[]1.0.*recovering.*update_calc_stats degraded ' td/osd-recovery-stats/osd.1.log
2023-02-10T11:17:54.535 INFO:tasks.workunit.client.0.smithi052.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  grep -F ' [1,3,2]/[1,NONE,2]p'
2023-02-10T11:17:54.535 INFO:tasks.workunit.client.0.smithi052.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  grep -v est
2023-02-10T11:17:54.535 INFO:tasks.workunit.client.0.smithi052.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  head -1
2023-02-10T11:17:54.535 INFO:tasks.workunit.client.0.smithi052.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  sed 's/.* \([0-9]*\)$/\1/'
2023-02-10T11:17:54.554 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:89: check:  FIRST=
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:90: check:  below_margin 200
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:46: below_margin:  local -i check=200
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:47: below_margin:  shift
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:48: below_margin:  local -i target=
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:50: below_margin:  return 1
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:90: check:  return 1
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:162: do_recovery_out1:  return 1
2023-02-10T11:17:54.555 INFO:tasks.workunit.client.0.smithi052.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-recovery-stats.sh:177: TEST_recovery_erasure_out1:  return 1

There are two different workunits that failed here, but they failed in a very similar way. It's not super obvious how it might be caused by this PR, as I see you are pretty much only adding extra logging. But both workunits seem to use update_calc_stats, which I see has been modified in this PR.

@ljflores
Copy link
Member

I have asked @yuriw to remove the PR from the batch and rerun those two tests to verify. If they turn out unrelated, I'll let you know.

This is the Trello card BTW: https://trello.com/c/FzMz7O3S/1683-wip-yuri10-testing-2023-02-06-0846-old-wip-yuri10-testing-2023-02-06-0809

@Matan-B
Copy link
Contributor Author

Matan-B commented Feb 16, 2023

Thank you Laura!
Looks like osd-recovery-stats.sh/osd-backfill-stats.sh should be adjusted as well since specfic log lines (which were changed here) are being grepped. I'll address it.

FIND_UPACT='grep "pg[[]${PG}.*recovering.*update_calc_stats " $log | tail -1 | sed "s/.*[)] \([[][^ p]*\).*$/\1/"'
FIND_FIRST='grep "pg[[]${PG}.*recovering.*update_calc_stats $which " $log | grep -F " ${UPACT}${addp}" | grep -v est | head -1 | sed "s/.* \([0-9]*\)$/\1/"'
FIND_LAST='grep "pg[[]${PG}.*recovering.*update_calc_stats $which " $log | tail -1 | sed "s/.* \([0-9]*\)$/\1/"'

Edit:
The failure causing lines have a warning // NOTE: Tests use these messages to verify this code. I will edit the commit to leave those log lines intact.

@Matan-B
Copy link
Contributor Author

Matan-B commented Feb 19, 2023

@ljflores
Copy link
Member

@Matan-B message me in gchat if you want it back in one of Yuri's batches

@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch 3 times, most recently from 82f557b to 273a0ff Compare March 7, 2023 12:23
Matan-B added 3 commits March 8, 2023 13:23
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Minor refactor

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 273a0ff to 7484c82 Compare March 8, 2023 14:09
Matan-B added 2 commits March 8, 2023 14:17
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-lrod-logs branch from 7484c82 to 14a1819 Compare March 8, 2023 14:19
@Matan-B
Copy link
Contributor Author

Matan-B commented Mar 9, 2023

@ljflores
Copy link
Member

https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-03-15-1418

Failures, unrelated:
1. https://tracker.ceph.com/issues/58946
2. https://tracker.ceph.com/issues/56393
3. https://tracker.ceph.com/issues/59123
4. https://tracker.ceph.com/issues/58585
5. https://tracker.ceph.com/issues/58560
6. https://tracker.ceph.com/issues/59127

Details:
1. cephadm: KeyError: 'osdspec_affinity' - Ceph - Orchestrator
2. thrash-erasure-code-big: failed to complete snap trimming before timeout - Ceph - RADOS
3. Timeout opening channel - Infrastructure
4. rook: failed to pull kubelet image - Ceph - Orchestrator
5. test_envlibrados_for_rocksdb.sh failed to subscribe to repo - Infrastructure
6. Job that normally complete much sooner last almost 12 hours - Infrastructure

@yuriw yuriw merged commit 2042e95 into ceph:main Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants