Make prometheus TLS config work with Rook orchestrator by Sunnatillo · Pull Request #61468 · ceph/ceph

Sunnatillo · 2025-01-21T13:53:37Z

This commit is introduced to generate self signed certificated for Prometheus and dashboard. #58402
In above mentioned commit tls for prometheus works only with cephadm orchestrator. And it also generates selfsigned certificate with cephadm-root CA that is genereated every time function is called.

However when using rook orch certificates are not generated. This PR implements getting certificates for rook orch.

Solution is following:

Certificates are read from rook namespace via kubernetes api client.
Name of the secrets are provided by following parameters:

prometheus_tls_secret_name (default secret name "rook-ceph-prometheus-server-tls")
dashboard_tls_secret_name (default secret name "rook-ceph-dashboard-server-tls")

In simple words when cephadm is used it generates silf signed certificates, when rook used it reads certififcates from kubernetes client api in rook_env.namespace.

Tracker issue: https://tracker.ceph.com/issues/69610

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

Sunnatillo · 2025-03-05T11:36:37Z

jenkins retest this please

Sunnatillo · 2025-03-05T13:49:57Z

@travisn I have updated the PR description a bit. What do you think of the solution?

@rkachach PTAL when you have time

travisn · 2025-03-05T17:45:11Z

@Sunnatillo Help me understand a few things:

Rook users need to create these secrets for dashboard and prometheus? Are they optional? It looks like errors are raised if they are not found.
Have you tested the changes with Rook?
We would need documentation updates in Rook, right? For example, to the dashboard and prometheus docs?
How can we ensure this feature works with Rook in the long term? We don't have good testing integration for the rook mgr module.
Could this change instead be implemented in Rook? Then we have better control over the cert generation and we can test in the Rook CI.

I haven't reviewed deeply if this change is the right approach or not, my questions are really based on our history that maintaining the rook mgr module has been a problem due to testing end to end with Rook.

Sunnatillo · 2025-03-06T12:11:38Z

@Sunnatillo Help me understand a few things:

Rook users need to create these secrets for dashboard and prometheus? Are they optional? It looks like errors are raised if they are not found.

Have you tested the changes with Rook?

We would need documentation updates in Rook, right? For example, to the dashboard and prometheus docs?

How can we ensure this feature works with Rook in the long term? We don't have good testing integration for the rook mgr module.

Could this change instead be implemented in Rook? Then we have better control over the cert generation and we can test in the Rook CI.

I haven't reviewed deeply if this change is the right approach or not, my questions are really based on our history that maintaining the rook mgr module has been a problem due to testing end to end with Rook.

Thank you for your questions:

yes, Rook users need to create these secrets for dashboard and prometheus. Yes, they are optional. For me if user sets metrics_tls_enabled to true errors should be thrown. Even it fails it will go back to http connection.

 if self.get_module_option('metrics_tls_enabled'):
            try:
                self.setup_tls_config(server_addr, server_port)
                return
            except Exception as e:
                self.log.exception(f'Failed to setup orchestrator based secure monitoring stack: {e}\n',
                                    'Falling back to default configuration')
        # In any error fallback to plain http mode
        self.setup_default_config(server_addr, server_port)

yes, we have been testing it heavily for last one month. And it did not show any issues.
yes, we need to make doc updates to dashboard and prometheus docs.
I can't answer this question due to lack of experience in here.
Another option would be mount certificates to mgr to defined location, and we can think through the way of naming and location of of the certificates etc.

The reason we implemented this ways is to have the same as cephadm orch. It uses the same "orch certmgr generate-certificates" command and it will work for rook with current implementation.

rkachach · 2025-03-07T06:17:06Z

@Sunnatillo thank you for working on this solution. Before reviewing the PR in deep and in order to try giving an informed opinion I'd like to understand how is your monitoring stack deployed:

Are you consuming Ceph monitoring stack?
If not, how are you deploying your own monitoring stack?

.. from the changes it seems to me that in fact what you need is to basically enable SSL/TLS for the Prometheus metrics end-point. The PR #58402 you linked is introducing mTLS because it's needed for a better security as par of the PR #57535. I had to implement that in cephadm because we don't have a platform as in k8s ... but in rook probably you can get a similar/equivalent functionality to the cephadm mgmt-gatway out-of-the-box by using k8s ingress...

Sunnatillo · 2025-03-10T09:37:53Z

@Sunnatillo thank you for working on this solution. Before reviewing the PR in deep and in order to try giving an informed opinion I'd like to understand how is your monitoring stack deployed:

Are you consuming Ceph monitoring stack?

If not, how are you deploying your own monitoring stack?

.. from the changes it seems to me that in fact what you need is to basically enable SSL/TLS for the Prometheus metrics end-point. The PR #58402 you linked is introducing mTLS because it's needed for a better security as par of the PR #57535. I had to implement that in cephadm because we don't have a platform as in k8s ... but in rook probably you can get a similar/equivalent functionality to the cephadm mgmt-gatway out-of-the-box by using k8s ingress...

Thank you for your comment @rkachach.

No we are not consuming ceph monitoring stack.
We have our our monitoring stack with Prometheus on k8s. We want to have TLS on Prometheus endpoint for internal communication inside cluster. Yes we can get similar functionality with ingress, but we still need to get certificates from somewhere or generate it for internal/external communication. As it is better to manage certificates with k8s, in this PR we try to get certificates from k8s api, instead of self generating like in cephadm case.

I would say, it is better to get certificates from k8s api since we can manage certificates easily in k8s cluster, for example with cert-manager.

parth-gr · 2025-03-12T06:44:09Z

lets use what rgw tls uses, i think rgw should be using server certificate provided by k8s in internal mode, cc @thotz @BlaineEXE

BlaineEXE · 2025-03-12T20:50:46Z

I'm not a ceph developer, but I'm not sure I would want to have K8s APIs coded into Ceph. Likely, a more flexible architecture for Ceph is to accept a certificate file for the prometheus server instantiation (does this exist already?) and have Rook connect that together in what I would presume to be the mgr pod and/or metrics collectors.

The rook orch is and has been broken for many years, and I've repeatedly advocated for removing it from Ceph entirely to avoid confusing users who keep thinking that it should be functional. No company or user has ever had consistent resources available to dedicate to getting the Rook orch into shape and keeping it working.

Sunnatillo · 2025-03-17T09:03:01Z

I'm not a ceph developer, but I'm not sure I would want to have K8s APIs coded into Ceph. Likely, a more flexible architecture for Ceph is to accept a certificate file for the prometheus server instantiation (does this exist already?) and have Rook connect that together in what I would presume to be the mgr pod and/or metrics collectors.

The rook orch is and has been broken for many years, and I've repeatedly advocated for removing it from Ceph entirely to avoid confusing users who keep thinking that it should be functional. No company or user has ever had consistent resources available to dedicate to getting the Rook orch into shape and keeping it working.

@BlaineEXE accepting certificates prometheus server instantiation does not exist. TLS configuration is introduced as a part of making monitoring stack HA here #57535. It was introduced in this PR #58402. But this works only when orch is cephadm.
In this PR we have introduced almost similar way of initializing certificates as in this previous PR #58402.

Slightly disagree on the part of reading certificates from k8s api. Because it makes very easy to maintain certificates,
We will be using this and we are willing maintain this functionality.

BlaineEXE · 2025-03-20T17:27:12Z

Like I said, I'm not a Ceph dev, so I can't (and won't) make this decision for the Ceph project. However, I still don't believe this PR as written today is a good fit for a Ceph contribution.

Even reading a certificate using the K8s API is coding the K8s API into Ceph. It makes a k8s API call, so it counts. Ceph operates without k8s present, and so I would be hesitant to add k8s functionality in. Configuring Ceph in a k8s environment is most appropriate for the Rook project IMO. A more generic orchestrator-independent configuration in Ceph would be a more flexible architecture for all users.

ceph-jenkins · 2025-03-20T17:27:15Z

Can one of the admins verify this patch?

src/pybind/mgr/rook/module.py

src/pybind/mgr/prometheus/module.py

rkachach · 2025-03-21T13:54:47Z

src/pybind/mgr/prometheus/module.py

-        verify_tls_files(self.cert_file.name, self.key_file.name)
-        cert_file_path, key_file_path = self.cert_file.name, self.key_file.name


plz don't remove this check as it's validating that the cert/key pair are valid specially in this case as they are provided by the user

@rkachach
This function verify_tls_files function causing an issue for our case. We are running it on suse based OS and facing following error:
"ImportError: PyO3 modules may only be initialized once per interpreter process".
pyca/bcrypt#694
PyO3/pyo3#3451

As it mentioned above issues, the issue is not fixed, for our case workaround did not work which was done for RHEL based OS.

Can I add separate function for rook case temporarily until thy pyo3 issue fixed or can we remove verify_tls_files function temporarily?

At least for rook case we can remove it temporarily.

Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.

I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.

Deep diving into the ceph code right now, I found that it is relying on cryptography which is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.

Make sure you do not do importlib.reload(...) or anything that tries to reload cryptography, OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet

@Sunnatillo if verify_tls_files is causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.

Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.

I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.

Deep diving into the ceph code right now, I found that it is relying on cryptography which is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.

Make sure you do not do importlib.reload(...) or anything that tries to reload cryptography, OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet

RHEL/centos have just done WA for it where they remove the warning that prevents the loading. As the library set it mandatory to only run it once, but ceph runs it on multiple processess. Or it do not run it, but it loads it multiple times so the code things it is ran multiple times. For suse based os it fails everytime.

@Sunnatillo if verify_tls_files is causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.

I will comment this function, we can take in to use after PyO3 issue is fixed.

@Sunnatillo While I'm not fully familiar with every detail of the code, simply suppressing the warning doesn't seem like a proper solution, even knowing it's a workaround. Doing so may leave open sessions and bindings in the interpreter. Depending on what they are doing, they could be exploited. I wanted to highlight this as an important point for further consideration, just for safety if needed.

rkachach

@Sunnatillo

rkachach

@Sunnatillo Overall I think the idea behind the change makes sense as it's trying to mimic in rook Orchestrator what's already implemented in cephadm. Before going with these changes I'll encourage you to explore other alternatives to secure the Prometheus in your k8s env (not sure if creating some dedicated service as front for prometheus metrics!).

rkachach

I left some more comment. In general please, be careful to do not break cephadm support.

src/pybind/mgr/prometheus/module.py

src/pybind/mgr/rook/module.py

So far we have been using some indirect methods for that. Let's be more explicit about the check and use the new API offered by the cephadm for that. Fixes: https://tracker.ceph.com/issues/71599 Signed-off-by: Redouane Kachach <rkachach@ibm.com>

Removing the call to verify_tls_files as in this case case certs files are generated by cephadm internally and we can trust them. In the worst case (bad files) the prometheus module will fail to start. Fixes: https://tracker.ceph.com/issues/71599 Signed-off-by: Redouane Kachach <rkachach@ibm.com>

This commit extends tls to work with Rook orch that has been deployed for cephadm. Certificates are read from rook namespace via kubernetes api client. Name of the secrets are provided by following parameter: prometheus_tls_secret_name (default secret name "rook-ceph-prometheus-server-tls") When cephadm is used it generates silf signed certificates, when rook used it reads certififcates from kubernetes client api in rook_env.namespace. Signed-off-by: Sunnatillo <sunnat.samadov@est.tech>

Sunnatillo · 2025-06-13T10:40:09Z

@rkachach I have rebased PR on top of your changes. Failing tests seem to be not relevant to my changes.
@adk3798 Please tests this PR as well.

Sunnatillo · 2025-06-24T06:59:07Z

@adk3798 @rkachach Could you please push forward this PR? It has been quite long in the waiting

adk3798 · 2025-07-01T13:42:51Z

https://pulpito.ceph.com/adking-2025-06-25_15:21:10-orch:cephadm-wip-adk-testing-2025-06-24-1732-distro-default-smithi/

okay for merging other than unexpected monitoring stack test failure

issues addressed

rkachach

LGTM

ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com>

nizamial09 · 2025-07-08T13:41:37Z

@Sunnatillo we saw a regression with http metric service which failed to start after this PR. Raised a fix here: #64385

ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com>

ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com> (cherry picked from commit 0f2dbbd)

ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>

ljflores · 2025-08-01T21:13:29Z

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

Sunnatillo · 2025-08-04T07:55:14Z

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

I will take a look

ceph/ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph/ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>

Sunnatillo · 2025-08-14T13:12:17Z

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

regression fix PR is open here:
#65033

Sorry, I could not open Fix PR earlier.

ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>

Sunnatillo requested a review from a team as a code owner January 21, 2025 13:53

github-actions bot added monitoring orchestrator pybind rook labels Jan 21, 2025

travisn requested a review from rkachach January 21, 2025 17:41

Sunnatillo changed the title ~~WIP: Make prometheus tls config work with Rook orch~~ Make prometheus tls config work with Rook orch Jan 27, 2025

Sunnatillo force-pushed the fix-69610-sunnat branch from b90d070 to 3d4932e Compare February 27, 2025 08:19

Sunnatillo changed the title ~~Make prometheus tls config work with Rook orch~~ Make prometheus TLS config work with Rook orchestrator Mar 5, 2025

Sunnatillo force-pushed the fix-69610-sunnat branch from 3d4932e to 3aa7dad Compare March 5, 2025 13:53

Sunnatillo force-pushed the fix-69610-sunnat branch from 3aa7dad to 5f8dd39 Compare March 6, 2025 08:35

Sunnatillo force-pushed the fix-69610-sunnat branch from 5f8dd39 to c5add5d Compare March 6, 2025 12:47

rkachach reviewed Mar 21, 2025

View reviewed changes

src/pybind/mgr/rook/module.py Outdated Show resolved Hide resolved

rkachach reviewed Mar 21, 2025

View reviewed changes

src/pybind/mgr/prometheus/module.py Show resolved Hide resolved

rkachach reviewed Mar 21, 2025

View reviewed changes

rkachach requested changes Mar 21, 2025

View reviewed changes

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

src/pybind/mgr/rook/module.py Outdated Show resolved Hide resolved

Sunnatillo force-pushed the fix-69610-sunnat branch from c5add5d to a1f9675 Compare March 27, 2025 12:23

rkachach added 2 commits June 9, 2025 14:25

Sunnatillo force-pushed the fix-69610-sunnat branch from 888872d to 55651aa Compare June 13, 2025 07:20

Sunnatillo force-pushed the fix-69610-sunnat branch from 55651aa to 64f590c Compare June 13, 2025 08:17

adk3798 removed the wip-adk-testing label Jun 21, 2025

adk3798 added the wip-adk-testing label Jun 24, 2025

rkachach approved these changes Jul 1, 2025

View reviewed changes

adk3798 merged commit 2cbd2cd into ceph:main Jul 1, 2025
12 of 14 checks passed

This was referenced Jul 7, 2025

tentacle: orch: Pass TLS certificates when Rook orchestrator is used #64371

Merged

squid: orch: Pass TLS certificates when Rook orchestrator is used #64373

Open

nizamial09 mentioned this pull request Jul 8, 2025

mgr/prometheus: fix metrics service not coming up #64385

Merged

14 tasks

Sunnatillo mentioned this pull request Aug 14, 2025

mgr/prometheus: Handle empty JSON from orch get-security-config #65033

Merged

14 tasks

Sunnatillo deleted the fix-69610-sunnat branch August 14, 2025 13:12

		verify_tls_files(self.cert_file.name, self.key_file.name)
		cert_file_path, key_file_path = self.cert_file.name, self.key_file.name

Conversation

Sunnatillo commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

Sunnatillo commented Mar 5, 2025

Uh oh!

Sunnatillo commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisn commented Mar 5, 2025

Uh oh!

Sunnatillo commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkachach commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sunnatillo commented Mar 10, 2025

Uh oh!

parth-gr commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlaineEXE commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sunnatillo commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlaineEXE commented Mar 20, 2025

Uh oh!

ceph-jenkins commented Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

rkachach Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sunnatillo Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Sunnatillo Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sunnatillo Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristiancmrg Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

rkachach Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Sunnatillo Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Sunnatillo Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristiancmrg Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

rkachach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkachach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkachach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Sunnatillo commented Jan 21, 2025 •

edited

Loading

Sunnatillo commented Mar 5, 2025 •

edited

Loading

Sunnatillo commented Mar 6, 2025 •

edited

Loading

rkachach commented Mar 7, 2025 •

edited

Loading

parth-gr commented Mar 12, 2025 •

edited

Loading

BlaineEXE commented Mar 12, 2025 •

edited

Loading

Sunnatillo commented Mar 17, 2025 •

edited

Loading

rkachach Mar 21, 2025 •

edited

Loading

Sunnatillo Apr 4, 2025 •

edited

Loading

Sunnatillo Apr 4, 2025 •

edited

Loading

Sunnatillo Apr 8, 2025 •

edited

Loading

rkachach left a comment •

edited

Loading

rkachach left a comment •

edited

Loading