mgr/prometheus: Use RLock to fix deadlock in HealthHistory#66571
Conversation
cf00dc7 to
4530b73
Compare
rzarzynski
left a comment
There was a problem hiding this comment.
FWIW LGTM. The final word goes to the Dashboard Folks as, AFAIK, they maintain the module.
| def __init__(self, mgr: MgrModule): | ||
| self.mgr = mgr | ||
| self.lock = threading.Lock() | ||
| self.lock = threading.RLock() |
There was a problem hiding this comment.
The other users of the lock property are:
def reset(self) -> None:
"""Reset the healthcheck history."""
with self.lock:
self.mgr.set_store(self.kv_name, "{}")
self.healthcheck = {}
def save(self) -> None:
"""Save the current in-memory healthcheck history to the KV store."""
with self.lock:
self.mgr.set_store(self.kv_name, self.as_json())Please note that RLock in Python is not the read-write lock, it's the reentrant lock.
4530b73 to
1017ede
Compare
|
jenkins test make check |
|
jenkins test api |
|
jenkins test windows |
tchaikov
left a comment
There was a problem hiding this comment.
might want to fix the flake8 issues.
src/pybind/mgr/prometheus/module.py
Outdated
| return yaml.safe_dump(self.as_dict(), explicit_start=True, default_flow_style=False) | ||
|
|
||
|
|
||
| class ThreadSafeLRUCacheDict(LRUCacheDict[K, V], Generic[K, V]): |
There was a problem hiding this comment.
@NitzanMordhai could you please fix the lint issues reported by flake8?
flake8: install_deps /ceph/src/pybind/mgr> python -I -m pip install flake8
flake8: commands[0] /ceph/src/pybind/mgr> flake8 --config=tox.ini alerts balancer cephadm cli_api crash devicehealth diskprediction_local hello iostat localpool mgr_module.py mgr_util.py nfs object_format.py orchestrator prometheus rbd_support rgw selftest smb
prometheus/module.py:372:5: E301 expected 1 blank line, found 0
prometheus/module.py:375:5: E301 expected 1 blank line, found 0
prometheus/module.py:378:5: E301 expected 1 blank line, found 0
prometheus/module.py:381:5: E301 expected 1 blank line, found 0
prometheus/module.py:384:5: E301 expected 1 blank line, found 0
prometheus/module.py:387:5: E301 expected 1 blank line, found 0
prometheus/module.py:390:5: E301 expected 1 blank line, found 0
prometheus/module.py:393:5: E301 expected 1 blank line, found 0
prometheus/module.py:397:1: E302 expected 2 blank lines, found 1
8 E301 expected 1 blank line, found 0
1 E302 expected 2 blank lines, found 1
870c36b to
3efd5f1
Compare
|
jenkins test make check |
src/pybind/mgr/prometheus/module.py
Outdated
| name, | ||
| str(info.severity)) | ||
| ) | ||
| with self.health_history.lock: |
There was a problem hiding this comment.
with the builtin lock in self.health_history.healthcheck, we don't need to hold self.health_history.lock when accessing it anymore.
There was a problem hiding this comment.
Right, removed
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
36e2d1e to
0d971b2
Compare
fixed |
|
I added unit-tests as well |
@NitzanMordhai i just tested it again and still see the same error in the ceph target |
|
Saw this thread: prometheus/prometheus#15777, so maybe its more about adding the version along with header like |
|
okay, what worked for me local is adding the |
0d971b2 to
b31df46
Compare
@nizamial09 can you please review it again? i just made that change |
|
@NitzanMordhai I see that you are adding If we add that header to both active and standby then we can safely remove the |
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
b31df46 to
e3de8c9
Compare
|
@nizamial09 please see my changes, thanks! |
|
@NitzanMordhai looks good to me. we can pick up the new change for the branch of the day runs and see if the |
|
jenkins test make check |
|
@NitzanMordhai -The RADOS test execution has been completed and approved. Kindly let me know if retesting is required. |
|
@tchaikov hey! you still have pending request of change that i done and its blocking that PR from merge, can you please respond? |
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/22705980464 |
This PR fixes a few issues:
ThreadSafeLRUCacheDictto prevent mutation during iterationContent-Type: text/plain; charsetheader for standby moduleFixes: https://tracker.ceph.com/issues/74148
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.