mgr/prometheus: Use RLock to fix deadlock in HealthHistory by NitzanMordhai · Pull Request #66571 · ceph/ceph

NitzanMordhai · 2025-12-09T12:49:25Z

This PR fixes a few issues:

The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR mgr: fix PyObject* refcounting in TTLCache and cleanup logic #65245 added the locks.
Added ThreadSafeLRUCacheDict to prevent mutation during iteration
Restored missing Content-Type: text/plain; charset header for standby module

Fixes: https://tracker.ceph.com/issues/74148

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

rzarzynski

FWIW LGTM. The final word goes to the Dashboard Folks as, AFAIK, they maintain the module.

rzarzynski · 2025-12-11T16:17:59Z

src/pybind/mgr/prometheus/module.py

    def __init__(self, mgr: MgrModule):
        self.mgr = mgr
-        self.lock = threading.Lock()
+        self.lock = threading.RLock()


The other users of the lock property are:

def reset(self) -> None: """Reset the healthcheck history.""" with self.lock: self.mgr.set_store(self.kv_name, "{}") self.healthcheck = {} def save(self) -> None: """Save the current in-memory healthcheck history to the KV store.""" with self.lock: self.mgr.set_store(self.kv_name, self.as_json())

Please note that RLock in Python is not the read-write lock, it's the reentrant lock.

ljflores

@bluikko @ceph/dashboard can you have a look?

src/pybind/mgr/prometheus/module.py

tchaikov

lgtm

tchaikov · 2026-01-01T14:35:54Z

jenkins test make check

tchaikov · 2026-01-01T14:39:13Z

jenkins test api

tchaikov · 2026-01-04T03:18:49Z

jenkins test windows

tchaikov

might want to fix the flake8 issues.

tchaikov · 2026-01-04T05:37:18Z

src/pybind/mgr/prometheus/module.py

        return yaml.safe_dump(self.as_dict(), explicit_start=True, default_flow_style=False)


+class ThreadSafeLRUCacheDict(LRUCacheDict[K, V], Generic[K, V]):


@NitzanMordhai could you please fix the lint issues reported by flake8?

flake8: install_deps /ceph/src/pybind/mgr> python -I -m pip install flake8 flake8: commands[0] /ceph/src/pybind/mgr> flake8 --config=tox.ini alerts balancer cephadm cli_api crash devicehealth diskprediction_local hello iostat localpool mgr_module.py mgr_util.py nfs object_format.py orchestrator prometheus rbd_support rgw selftest smb prometheus/module.py:372:5: E301 expected 1 blank line, found 0 prometheus/module.py:375:5: E301 expected 1 blank line, found 0 prometheus/module.py:378:5: E301 expected 1 blank line, found 0 prometheus/module.py:381:5: E301 expected 1 blank line, found 0 prometheus/module.py:384:5: E301 expected 1 blank line, found 0 prometheus/module.py:387:5: E301 expected 1 blank line, found 0 prometheus/module.py:390:5: E301 expected 1 blank line, found 0 prometheus/module.py:393:5: E301 expected 1 blank line, found 0 prometheus/module.py:397:1: E302 expected 2 blank lines, found 1 8 E301 expected 1 blank line, found 0 1 E302 expected 2 blank lines, found 1

see https://jenkins.ceph.com/job/ceph-pull-requests/172275/consoleFull#-533983586e840cee4-f4a4-4183-81dd-42855615f2c1

@tchaikov fixed, thanks!

NitzanMordhai · 2026-01-06T15:25:06Z

@afreen23 @Pegonzal can you take a look?

NitzanMordhai · 2026-01-07T10:55:37Z

jenkins test make check

tchaikov · 2026-01-07T11:12:20Z

src/pybind/mgr/prometheus/module.py

-                    name,
-                    str(info.severity))
-            )
+        with self.health_history.lock:


with the builtin lock in self.health_history.healthcheck, we don't need to hold self.health_history.lock when accessing it anymore.

Right, removed

The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>

NitzanMordhai · 2026-02-02T13:54:25Z

@NitzanMordhai was trying this out locally but the prometheus target is not coming up in the prometheus UI with an error but the metrics were succesfully exported and can be seen in the /metrics endpoint. but since the target is down, we couldn't get the alerts.

fixed

NitzanMordhai · 2026-02-02T13:54:39Z

I added unit-tests as well

nizamial09 · 2026-02-03T15:00:01Z

@NitzanMordhai was trying this out locally but the prometheus target is not coming up in the prometheus UI with an error but the metrics were succesfully exported and can be seen in the /metrics endpoint. but since the target is down, we couldn't get the alerts.

fixed

@NitzanMordhai i just tested it again and still see the same error in the ceph target

~/projects/ceph-dev (main*) » curl http://192.168.100.100:9095/api/v1/targets | jq .                                                                                                                                   
{
  "status": "success",
  "data": {
    "activeTargets": [
      {
        "discoveredLabels": {
          "__address__": "ceph-node-00:9283",
          "__meta_url": "http://192.168.100.100:8765/sd/prometheus/sd-config?service=ceph",
          "__metrics_path__": "/metrics",
          "__scheme__": "http",
          "__scrape_interval__": "10s",
          "__scrape_timeout__": "10s",
          "job": "ceph"
        },
        "labels": {
          "cluster": "9ee8994a-010f-11f1-91e4-525400c337c6",
          "instance": "ceph_cluster",
          "job": "ceph"
        },
        "scrapePool": "ceph",
        "scrapeUrl": "http://ceph-node-00:9283/metrics",
        "globalUrl": "http://ceph-node-00:9283/metrics",
        "lastError": "received unsupported Content-Type \"text/html;charset=utf-8\" and no fallback_scrape_protocol specified for target",
        "lastScrape": "2026-02-03T14:57:04.930802312Z",
        "lastScrapeDuration": 0.002269357,
        "health": "down",
        "scrapeInterval": "10s",
        "scrapeTimeout": "10s"
      },

nizamial09 · 2026-02-04T05:35:03Z

Saw this thread: prometheus/prometheus#15777, so maybe its more about adding the version along with header like cherrypy.response.headers['Content-Type'] = 'text/plain; version=0.0.4; charset=utf-8' or adding a fallback fallback_scrape_protocol to the prometheus.yml template

nizamial09 · 2026-02-04T05:46:42Z

okay, what worked for me local is adding the fallback

index 2afbf606af2..2f584bbceb2 100644
--- a/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
+++ b/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
@@ -45,6 +45,7 @@ scrape_configs:
 {% for service, urls in service_discovery_cfg.items() %}
  {% if service != 'alertmanager' %}
   - job_name: '{{ service }}'
+    fallback_scrape_protocol: PrometheusText0.0.4
     relabel_configs:
     - source_labels: [__address__]

NitzanMordhai · 2026-02-05T12:10:57Z

okay, what worked for me local is adding the fallback

@nizamial09 can you please review it again? i just made that change

nizamial09

thanks @NitzanMordhai

nizamial09 · 2026-02-11T02:13:34Z

@NitzanMordhai I see that you are adding cherrypy.response.headers['Content-Type'] = 'text/plain; charset=utf-8' to the StandbyModule but not to the active one. But I think we should add that to the active one as well. Else it could fail on non-cephadm deployment unless people add the fallback_scrape_protocol to their prometheus.yml configuration.

If we add that header to both active and standby then we can safely remove the fallback_scrape_protocol from the prometheus.yml.j2 file. and that would actually fix https://tracker.ceph.com/issues/74819 because from what I see its failing on scraping the /metrics endpoint with a 400 error similar to what we see if that header is missing while the root prefix and server itself is working

2026-02-10T13:43:33.363 DEBUG:tasks.mgr.mgr_test_case:Found prometheus at http://10.20.193.21:7789/ (daemon x/5411)
2026-02-10T13:43:33.366 INFO:tasks.mgr.test_prometheus:/: 200 (176 bytes)
2026-02-10T13:43:33.368 INFO:tasks.mgr.test_prometheus:/metrics: 400 (50 bytes)

PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>

Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>

NitzanMordhai · 2026-02-11T09:21:18Z

@nizamial09 please see my changes, thanks!

nizamial09 · 2026-02-11T10:31:08Z

@NitzanMordhai looks good to me. we can pick up the new change for the branch of the day runs and see if the test_urls failure is gone.

lee-j-sanders · 2026-02-18T15:30:24Z

RADOS Approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74634-rerun-with-PR-dropped

SrinivasaBharath · 2026-02-23T09:27:02Z

jenkins test make check

SrinivasaBharath · 2026-03-04T00:25:14Z

@NitzanMordhai -The RADOS test execution has been completed and approved. Kindly let me know if retesting is required.

NitzanMordhai · 2026-03-05T06:37:54Z

@tchaikov hey! you still have pending request of change that i done and its blocking that PR from merge, can you please respond?

tchaikov

lgtm.

github-actions · 2026-03-05T06:50:36Z

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

https://tracker.ceph.com/issues/74149

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22705980464

NitzanMordhai requested review from ljflores and rzarzynski December 9, 2025 12:49

github-actions bot added monitoring pybind labels Dec 9, 2025

This was referenced Dec 9, 2025

squid: mgr: fix PyObject* refcounting in TTLCache and cleanup logic #66483

Open

tentacle: mgr: fix PyObject* refcounting in TTLCache and cleanup logic #66482

Open

Revert "mgr/prometheus: prune stale health checks, compress output" #66570

Closed

rzarzynski requested review from a team, Pegonzal and cloudbehl December 10, 2025 13:09

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch 2 times, most recently from cf00dc7 to 4530b73 Compare December 11, 2025 06:36

rzarzynski approved these changes Dec 11, 2025

View reviewed changes

ljflores requested a review from bluikko December 11, 2025 21:15

ljflores approved these changes Dec 11, 2025

View reviewed changes

tchaikov reviewed Dec 15, 2025

View reviewed changes

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 4530b73 to 1017ede Compare December 18, 2025 12:36

NitzanMordhai requested a review from tchaikov December 18, 2025 12:44

tchaikov approved these changes Jan 1, 2026

View reviewed changes

tchaikov requested changes Jan 4, 2026

View reviewed changes

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch 3 times, most recently from 870c36b to 3efd5f1 Compare January 6, 2026 08:37

NitzanMordhai requested a review from tchaikov January 7, 2026 10:55

tchaikov reviewed Jan 7, 2026

View reviewed changes

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 36e2d1e to 0d971b2 Compare February 2, 2026 13:53

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 0d971b2 to b31df46 Compare February 5, 2026 12:09

NitzanMordhai requested a review from a team as a code owner February 5, 2026 12:09

github-actions bot added the cephadm label Feb 5, 2026

nizamial09 approved these changes Feb 6, 2026

View reviewed changes

NitzanMordhai and others added 2 commits February 11, 2026 09:08

mgr/prometheus: metrics header for standby module

2ef12b2

PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>

mgr/prometheus/test_module: Adding unit-test for new classes

e3de8c9

Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>

NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from b31df46 to e3de8c9 Compare February 11, 2026 09:20

yaarith added the rocky10 label Feb 12, 2026

adk3798 added the wip-adk3-testing label Feb 20, 2026

SrinivasaBharath added the Rados-Tested label Mar 4, 2026

SrinivasaBharath removed the wip-bharath5-testing label Mar 4, 2026

tchaikov approved these changes Mar 5, 2026

View reviewed changes

NitzanMordhai merged commit 7d9f8f3 into ceph:main Mar 5, 2026
14 checks passed

NitzanMordhai deleted the wip-nitzan-prometheus-HealthHistory-deadlock branch March 5, 2026 06:48

		return yaml.safe_dump(self.as_dict(), explicit_start=True, default_flow_style=False)


		class ThreadSafeLRUCacheDict(LRUCacheDict[K, V], Generic[K, V]):

Conversation

NitzanMordhai commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

rzarzynski Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Jan 1, 2026

Uh oh!

tchaikov commented Jan 1, 2026

Uh oh!

tchaikov commented Jan 4, 2026

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai commented Jan 6, 2026

Uh oh!

NitzanMordhai commented Jan 7, 2026

Uh oh!

tchaikov Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai commented Feb 2, 2026

Uh oh!

NitzanMordhai commented Feb 2, 2026

Uh oh!

nizamial09 commented Feb 3, 2026

Uh oh!

nizamial09 commented Feb 4, 2026

Uh oh!

nizamial09 commented Feb 4, 2026

Uh oh!

NitzanMordhai commented Feb 5, 2026

Uh oh!

nizamial09 left a comment

Choose a reason for hiding this comment

Uh oh!

nizamial09 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NitzanMordhai commented Feb 11, 2026

Uh oh!

nizamial09 commented Feb 11, 2026

Uh oh!

lee-j-sanders commented Feb 18, 2026

Uh oh!

SrinivasaBharath commented Feb 23, 2026

Uh oh!

SrinivasaBharath commented Mar 4, 2026

Uh oh!

NitzanMordhai commented Mar 5, 2026

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2026

NitzanMordhai commented Dec 9, 2025 •

edited

Loading

rzarzynski Dec 11, 2025 •

edited

Loading

nizamial09 commented Feb 11, 2026 •

edited

Loading