Skip to content

mgr/prometheus: Use RLock to fix deadlock in HealthHistory#66571

Merged
NitzanMordhai merged 3 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-prometheus-HealthHistory-deadlock
Mar 5, 2026
Merged

mgr/prometheus: Use RLock to fix deadlock in HealthHistory#66571
NitzanMordhai merged 3 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-prometheus-HealthHistory-deadlock

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Dec 9, 2025

This PR fixes a few issues:

  1. The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR mgr: fix PyObject* refcounting in TTLCache and cleanup logic #65245 added the locks.
  2. Added ThreadSafeLRUCacheDict to prevent mutation during iteration
  3. Restored missing Content-Type: text/plain; charset header for standby module

Fixes: https://tracker.ceph.com/issues/74148

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW LGTM. The final word goes to the Dashboard Folks as, AFAIK, they maintain the module.

def __init__(self, mgr: MgrModule):
self.mgr = mgr
self.lock = threading.Lock()
self.lock = threading.RLock()
Copy link
Contributor

@rzarzynski rzarzynski Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other users of the lock property are:

    def reset(self) -> None:
        """Reset the healthcheck history."""
        with self.lock:
            self.mgr.set_store(self.kv_name, "{}")
            self.healthcheck = {}

    def save(self) -> None:
        """Save the current in-memory healthcheck history to the KV store."""
        with self.lock:
            self.mgr.set_store(self.kv_name, self.as_json())

Please note that RLock in Python is not the read-write lock, it's the reentrant lock.

@ljflores ljflores requested a review from bluikko December 11, 2025 21:15
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluikko @ceph/dashboard can you have a look?

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 4530b73 to 1017ede Compare December 18, 2025 12:36
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tchaikov
Copy link
Contributor

tchaikov commented Jan 1, 2026

jenkins test make check

@tchaikov
Copy link
Contributor

tchaikov commented Jan 1, 2026

jenkins test api

@tchaikov
Copy link
Contributor

tchaikov commented Jan 4, 2026

jenkins test windows

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to fix the flake8 issues.

return yaml.safe_dump(self.as_dict(), explicit_start=True, default_flow_style=False)


class ThreadSafeLRUCacheDict(LRUCacheDict[K, V], Generic[K, V]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NitzanMordhai could you please fix the lint issues reported by flake8?

flake8: install_deps /ceph/src/pybind/mgr> python -I -m pip install flake8
flake8: commands[0] /ceph/src/pybind/mgr> flake8 --config=tox.ini alerts balancer cephadm cli_api crash devicehealth diskprediction_local hello iostat localpool mgr_module.py mgr_util.py nfs object_format.py orchestrator prometheus rbd_support rgw selftest smb
prometheus/module.py:372:5: E301 expected 1 blank line, found 0
prometheus/module.py:375:5: E301 expected 1 blank line, found 0
prometheus/module.py:378:5: E301 expected 1 blank line, found 0
prometheus/module.py:381:5: E301 expected 1 blank line, found 0
prometheus/module.py:384:5: E301 expected 1 blank line, found 0
prometheus/module.py:387:5: E301 expected 1 blank line, found 0
prometheus/module.py:390:5: E301 expected 1 blank line, found 0
prometheus/module.py:393:5: E301 expected 1 blank line, found 0
prometheus/module.py:397:1: E302 expected 2 blank lines, found 1
8     E301 expected 1 blank line, found 0
1     E302 expected 2 blank lines, found 1

see https://jenkins.ceph.com/job/ceph-pull-requests/172275/consoleFull#-533983586e840cee4-f4a4-4183-81dd-42855615f2c1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov fixed, thanks!

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch 3 times, most recently from 870c36b to 3efd5f1 Compare January 6, 2026 08:37
@NitzanMordhai
Copy link
Contributor Author

@afreen23 @Pegonzal can you take a look?

@NitzanMordhai
Copy link
Contributor Author

jenkins test make check

@NitzanMordhai NitzanMordhai requested a review from tchaikov January 7, 2026 10:55
name,
str(info.severity))
)
with self.health_history.lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the builtin lock in self.health_history.healthcheck, we don't need to hold self.health_history.lock when accessing it anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, removed

The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 36e2d1e to 0d971b2 Compare February 2, 2026 13:53
@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai was trying this out locally but the prometheus target is not coming up in the prometheus UI with an error but the metrics were succesfully exported and can be seen in the /metrics endpoint. but since the target is down, we couldn't get the alerts.

fixed

@NitzanMordhai
Copy link
Contributor Author

I added unit-tests as well

@nizamial09
Copy link
Member

@NitzanMordhai was trying this out locally but the prometheus target is not coming up in the prometheus UI with an error but the metrics were succesfully exported and can be seen in the /metrics endpoint. but since the target is down, we couldn't get the alerts.

fixed

@NitzanMordhai i just tested it again and still see the same error in the ceph target

~/projects/ceph-dev (main*) » curl http://192.168.100.100:9095/api/v1/targets | jq .                                                                                                                                   
{
  "status": "success",
  "data": {
    "activeTargets": [
      {
        "discoveredLabels": {
          "__address__": "ceph-node-00:9283",
          "__meta_url": "http://192.168.100.100:8765/sd/prometheus/sd-config?service=ceph",
          "__metrics_path__": "/metrics",
          "__scheme__": "http",
          "__scrape_interval__": "10s",
          "__scrape_timeout__": "10s",
          "job": "ceph"
        },
        "labels": {
          "cluster": "9ee8994a-010f-11f1-91e4-525400c337c6",
          "instance": "ceph_cluster",
          "job": "ceph"
        },
        "scrapePool": "ceph",
        "scrapeUrl": "http://ceph-node-00:9283/metrics",
        "globalUrl": "http://ceph-node-00:9283/metrics",
        "lastError": "received unsupported Content-Type \"text/html;charset=utf-8\" and no fallback_scrape_protocol specified for target",
        "lastScrape": "2026-02-03T14:57:04.930802312Z",
        "lastScrapeDuration": 0.002269357,
        "health": "down",
        "scrapeInterval": "10s",
        "scrapeTimeout": "10s"
      },

@nizamial09
Copy link
Member

Saw this thread: prometheus/prometheus#15777, so maybe its more about adding the version along with header like cherrypy.response.headers['Content-Type'] = 'text/plain; version=0.0.4; charset=utf-8' or adding a fallback fallback_scrape_protocol to the prometheus.yml template

@nizamial09
Copy link
Member

okay, what worked for me local is adding the fallback

index 2afbf606af2..2f584bbceb2 100644
--- a/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
+++ b/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
@@ -45,6 +45,7 @@ scrape_configs:
 {% for service, urls in service_discovery_cfg.items() %}
  {% if service != 'alertmanager' %}
   - job_name: '{{ service }}'
+    fallback_scrape_protocol: PrometheusText0.0.4
     relabel_configs:
     - source_labels: [__address__]

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from 0d971b2 to b31df46 Compare February 5, 2026 12:09
@NitzanMordhai NitzanMordhai requested a review from a team as a code owner February 5, 2026 12:09
@NitzanMordhai
Copy link
Contributor Author

okay, what worked for me local is adding the fallback

@nizamial09 can you please review it again? i just made that change

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nizamial09
Copy link
Member

nizamial09 commented Feb 11, 2026

@NitzanMordhai I see that you are adding cherrypy.response.headers['Content-Type'] = 'text/plain; charset=utf-8' to the StandbyModule but not to the active one. But I think we should add that to the active one as well. Else it could fail on non-cephadm deployment unless people add the fallback_scrape_protocol to their prometheus.yml configuration.

If we add that header to both active and standby then we can safely remove the fallback_scrape_protocol from the prometheus.yml.j2 file. and that would actually fix https://tracker.ceph.com/issues/74819 because from what I see its failing on scraping the /metrics endpoint with a 400 error similar to what we see if that header is missing while the root prefix and server itself is working

2026-02-10T13:43:33.363 DEBUG:tasks.mgr.mgr_test_case:Found prometheus at http://10.20.193.21:7789/ (daemon x/5411)
2026-02-10T13:43:33.366 INFO:tasks.mgr.test_prometheus:/: 200 (176 bytes)
2026-02-10T13:43:33.368 INFO:tasks.mgr.test_prometheus:/metrics: 400 (50 bytes)

NitzanMordhai and others added 2 commits February 11, 2026 09:08
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-prometheus-HealthHistory-deadlock branch from b31df46 to e3de8c9 Compare February 11, 2026 09:20
@NitzanMordhai
Copy link
Contributor Author

@nizamial09 please see my changes, thanks!

@nizamial09
Copy link
Member

@NitzanMordhai looks good to me. we can pick up the new change for the branch of the day runs and see if the test_urls failure is gone.

@lee-j-sanders
Copy link
Member

@SrinivasaBharath
Copy link
Contributor

jenkins test make check

@SrinivasaBharath
Copy link
Contributor

@NitzanMordhai -The RADOS test execution has been completed and approved. Kindly let me know if retesting is required.

@NitzanMordhai
Copy link
Contributor Author

@tchaikov hey! you still have pending request of change that i done and its blocking that PR from merge, can you please respond?

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@NitzanMordhai NitzanMordhai merged commit 7d9f8f3 into ceph:main Mar 5, 2026
14 checks passed
@NitzanMordhai NitzanMordhai deleted the wip-nitzan-prometheus-HealthHistory-deadlock branch March 5, 2026 06:48
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22705980464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.