Skip to content

log: concatenate thread names and print once per thread#60464

Merged
batrick merged 4 commits intoceph:mainfrom
batrick:i68691
Nov 13, 2024
Merged

log: concatenate thread names and print once per thread#60464
batrick merged 4 commits intoceph:mainfrom
batrick:i68691

Conversation

@batrick
Copy link
Member

@batrick batrick commented Oct 24, 2024

Fixes: https://tracker.ceph.com/issues/68691

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@batrick
Copy link
Member Author

batrick commented Oct 24, 2024

This "improves" the situation with:

--- pthread ID / name mapping for recent threads ---
  7fa9b7b26640 /
  7fa9b8327640 /
  7fa9b8b28640 /
  7fa9b9b2a640 /
  7fa9ba32b640 /
  7fa9bab2c640 /
  7fa9bb32d640 / , md_log_replay, md_submit
  7fa9bc32f640 /
  7fa9bcb30640 /
  7fa9bd331640 /
  7fa9bdb32640 /
  7fa9bfb36640 /
  7fa9c0337640 /
  7fa9c0b38640 /
  7fa9c1b3a640 / , MR_Finisher, OpHistorySvc, quiesce.agt, quiesce_db_mgr, safe_timer
  7fa9c233b640 /
  7fa9c2b3c640 /
  7fa9c3b3e640 /
  7fa9c433f640 /
  7fa9c4b40640 /
  7fa9c5341640 /
  7fa9c5b42640 /
  7fa9c7322c00 / , log, ms_local, safe_timer, service, signal_handler

but the thread names are on the same pthread_t id. Not sure why and it's too late for me to dig further.

@mchangir Did you not simulate a core dump / assertion to check this? The code was blatantly wrong:

https://github.com/ceph/ceph/pull/60058/files#diff-12955fa50220234ca6a6ac45287d69871dc082c446b54ad544af715195308e86R496

what's worse, the const char* references memory in an array that will be freed.

I'm confess to being disappointed this was merged. @ljflores @rzarzynski RADOS QA should not have been considered adequate for testing.

@batrick
Copy link
Member Author

batrick commented Oct 24, 2024

Now looks right in my testing:

--- pthread ID / name mapping for recent threads ---
  7fe82d1c3640 / quiesce_db_mgr
  7fe82e1c5640 / ms_dispatch
  7fe82e9c6640 / ms_dispatch
  7fe82f1c7640 / md_submit
  7fe82f9c8640 / MR_Finisher
  7fe8319cc640 / ms_dispatch
  7fe8321cd640 / ms_dispatch
  7fe8339d0640 / safe_timer
  7fe8341d1640 / beacon
  7fe8349d2640 / safe_timer
  7fe8351d3640 / safe_timer
  7fe8359d4640 / ms_local
  7fe8361d5640 / ms_dispatch
  7fe8371d7640 / io_context_pool
  7fe8381d9640 / ms_dispatch, msgr-worker-2
  7fe8389da640 / msgr-worker-2
  7fe8391db640 / admin_socket, msgr-worker-1
  7fe8399dc640 / msgr-worker-0
  7fe83a1dd640 / msgr-worker-1
  7fe83ae5ccc0 / ceph-mds
  max_recent     10000
  max_new         1000
  log_file /home/pdonnell/ceph/build/out/mds.b.log
--- end dump of recent events ---

(note some thread ids were reused.)

@batrick
Copy link
Member Author

batrick commented Oct 24, 2024

however:

--- pthread ID / name mapping for recent threads ---
  7f861b38c640 / md_submit
  7f861c38e640 / MR_Finisher
  7f8622b9b640 / ms_dispatch
  7f8627c2ccc0 / ceph-mds
  max_recent     10000
  max_new         1000
  log_file /home/pdonnell/ceph/build/out/mds.e.log
--- end dump of recent events ---

I think we need to cache recent threads even if there's no debug messages for the thread recently.

@mchangir
Copy link
Contributor

@batrick I realized my folly about the dangling pointers after you commented on my PR

@batrick
Copy link
Member Author

batrick commented Oct 24, 2024

however:

--- pthread ID / name mapping for recent threads ---
  7f861b38c640 / md_submit
  7f861c38e640 / MR_Finisher
  7f8622b9b640 / ms_dispatch
  7f8627c2ccc0 / ceph-mds
  max_recent     10000
  max_new         1000
  log_file /home/pdonnell/ceph/build/out/mds.e.log
--- end dump of recent events ---

I think we need to cache recent threads even if there's no debug messages for the thread recently.

done.

some thread names need updated, going tof ix that next

@batrick
Copy link
Member Author

batrick commented Oct 24, 2024

Now looks like:

--- pthread ID / name mapping for recent threads ---
  7ffa743c1640 / mds-q-db
  7ffa74bc2640 / mds-q-agt
  7ffa753c3640 / mds-ping
  7ffa75bc4640 / mds-metrics
  7ffa763c5640 / mds-log-replay
  7ffa76bc6640 / mds-log-submit
  7ffa773c7640 / mds-log-recvr
  7ffa77bc8640 / mds-rank-fin
  7ffa78bca640 / mds-pq-fin
  7ffa793cb640 / mds-rank-progr
  7ffa79bcc640 / mds-log-trim
  7ffa7a3cd640 / mds-cache-trim
  7ffa7bbd0640 / safe_timer
  7ffa7c3d1640 / mds-beacon
  7ffa7cbd2640 / safe_timer
  7ffa7d3d3640 / safe_timer
  7ffa7dbd4640 / ms_local
  7ffa7e3d5640 / ms_dispatch
  7ffa7ebd6640 / io_context_pool
  7ffa7f3d7640 / io_context_pool
  7ffa803d9640 / msgr-worker-2, ms_dispatch
  7ffa80bda640 / msgr-worker-2
  7ffa813db640 / admin_socket, msgr-worker-1
  7ffa81bdc640 / msgr-worker-0
  7ffa823dd640 / msgr-worker-1
  7ffa8306acc0 / ceph-mds
  max_recent     10000
  max_new         1000
  log_file /home/pdonnell/ceph/build/out/mds.e.log
--- end dump of recent events ---

I consider this PR complete.

@batrick batrick requested a review from mchangir October 24, 2024 19:00
@batrick
Copy link
Member Author

batrick commented Oct 25, 2024

windows failure is legitimate. Looking into it now.

Fixes: 0be8d01
Fixes: https://tracker.ceph.com/issues/68691
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This provides common ceph entrypoints for the pthread_[gs]name functions which
will also cache a thread_local copy. This also removes the pthread_t parameter
which precipitated the bug i50743.

Obviously, the overall goal here is to avoid system calls.

See-also: https://tracker.ceph.com/issues/50743
Fixes: 0be8d01
Fixes: https://tracker.ceph.com/issues/68691
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
In general, a pthread implementation reuses thread ids routinely so the
circular_buffer is there to help us see what the thread name was historically
(capped at 4 entries).

However, to guard against this map from growing without bound, discard entries
that are more than a day old. This would happen if a thread logged an Entry and
has since disappeared for more than a day.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
To be consistent and sensical.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
@batrick
Copy link
Member Author

batrick commented Oct 25, 2024

jenkins test make check

@batrick
Copy link
Member Author

batrick commented Oct 27, 2024

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/run-rbd-unit-tests.sh: line 20: 1377238 Segmentation fault      RBD_FEATURES=$i unittest_librbd

@batrick
Copy link
Member Author

batrick commented Oct 27, 2024

jenkins test make check

1 similar comment
@vshankar
Copy link
Contributor

jenkins test make check

@batrick
Copy link
Member Author

batrick commented Oct 28, 2024

	
CTest Failure

2 tests failed out of 302

Total Test time (real) = 1897.31 sec

The following tests FAILED:
	  4 - run-tox-mgr (Failed)
	 22 - run-tox-cephadm (Failed)

@batrick
Copy link
Member Author

batrick commented Oct 28, 2024

jenkins test make check

@batrick
Copy link
Member Author

batrick commented Oct 31, 2024

This PR is under test in https://tracker.ceph.com/issues/68786.

@batrick
Copy link
Member Author

batrick commented Nov 6, 2024

This PR is under test in https://tracker.ceph.com/issues/68859.

@batrick
Copy link
Member Author

batrick commented Nov 13, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants