Skip to content

Make prometheus TLS config work with Rook orchestrator#61468

Merged
adk3798 merged 3 commits intoceph:mainfrom
Nordix:fix-69610-sunnat
Jul 1, 2025
Merged

Make prometheus TLS config work with Rook orchestrator#61468
adk3798 merged 3 commits intoceph:mainfrom
Nordix:fix-69610-sunnat

Conversation

@Sunnatillo
Copy link
Contributor

@Sunnatillo Sunnatillo commented Jan 21, 2025

This commit is introduced to generate self signed certificated for Prometheus and dashboard. #58402
In above mentioned commit tls for prometheus works only with cephadm orchestrator. And it also generates selfsigned certificate with cephadm-root CA that is genereated every time function is called.

However when using rook orch certificates are not generated. This PR implements getting certificates for rook orch.

Solution is following:

Certificates are read from rook namespace via kubernetes api client.
Name of the secrets are provided by following parameters:

  • prometheus_tls_secret_name (default secret name "rook-ceph-prometheus-server-tls")
  • dashboard_tls_secret_name (default secret name "rook-ceph-dashboard-server-tls")

In simple words when cephadm is used it generates silf signed certificates, when rook used it reads certififcates from kubernetes client api in rook_env.namespace.

Tracker issue: https://tracker.ceph.com/issues/69610

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@Sunnatillo Sunnatillo requested a review from a team as a code owner January 21, 2025 13:53
@travisn travisn requested a review from rkachach January 21, 2025 17:41
@Sunnatillo Sunnatillo changed the title WIP: Make prometheus tls config work with Rook orch Make prometheus tls config work with Rook orch Jan 27, 2025
@Sunnatillo
Copy link
Contributor Author

jenkins retest this please

@Sunnatillo Sunnatillo changed the title Make prometheus tls config work with Rook orch Make prometheus TLS config work with Rook orchestrator Mar 5, 2025
@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Mar 5, 2025

@travisn I have updated the PR description a bit. What do you think of the solution?

@rkachach PTAL when you have time

@travisn
Copy link
Member

travisn commented Mar 5, 2025

@Sunnatillo Help me understand a few things:

  1. Rook users need to create these secrets for dashboard and prometheus? Are they optional? It looks like errors are raised if they are not found.
  2. Have you tested the changes with Rook?
  3. We would need documentation updates in Rook, right? For example, to the dashboard and prometheus docs?
  4. How can we ensure this feature works with Rook in the long term? We don't have good testing integration for the rook mgr module.
  5. Could this change instead be implemented in Rook? Then we have better control over the cert generation and we can test in the Rook CI.

I haven't reviewed deeply if this change is the right approach or not, my questions are really based on our history that maintaining the rook mgr module has been a problem due to testing end to end with Rook.

@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Mar 6, 2025

@Sunnatillo Help me understand a few things:

  1. Rook users need to create these secrets for dashboard and prometheus? Are they optional? It looks like errors are raised if they are not found.
  2. Have you tested the changes with Rook?
  3. We would need documentation updates in Rook, right? For example, to the dashboard and prometheus docs?
  4. How can we ensure this feature works with Rook in the long term? We don't have good testing integration for the rook mgr module.
  5. Could this change instead be implemented in Rook? Then we have better control over the cert generation and we can test in the Rook CI.

I haven't reviewed deeply if this change is the right approach or not, my questions are really based on our history that maintaining the rook mgr module has been a problem due to testing end to end with Rook.

Thank you for your questions:

  1. yes, Rook users need to create these secrets for dashboard and prometheus. Yes, they are optional. For me if user sets metrics_tls_enabled to true errors should be thrown. Even it fails it will go back to http connection.
 if self.get_module_option('metrics_tls_enabled'):
            try:
                self.setup_tls_config(server_addr, server_port)
                return
            except Exception as e:
                self.log.exception(f'Failed to setup orchestrator based secure monitoring stack: {e}\n',
                                    'Falling back to default configuration')
        # In any error fallback to plain http mode
        self.setup_default_config(server_addr, server_port)

  1. yes, we have been testing it heavily for last one month. And it did not show any issues.
  2. yes, we need to make doc updates to dashboard and prometheus docs.
  3. I can't answer this question due to lack of experience in here.
  4. Another option would be mount certificates to mgr to defined location, and we can think through the way of naming and location of of the certificates etc.

The reason we implemented this ways is to have the same as cephadm orch. It uses the same "orch certmgr generate-certificates" command and it will work for rook with current implementation.

@rkachach
Copy link
Contributor

rkachach commented Mar 7, 2025

@Sunnatillo thank you for working on this solution. Before reviewing the PR in deep and in order to try giving an informed opinion I'd like to understand how is your monitoring stack deployed:

  1. Are you consuming Ceph monitoring stack?
  2. If not, how are you deploying your own monitoring stack?

.. from the changes it seems to me that in fact what you need is to basically enable SSL/TLS for the Prometheus metrics end-point. The PR #58402 you linked is introducing mTLS because it's needed for a better security as par of the PR #57535. I had to implement that in cephadm because we don't have a platform as in k8s ... but in rook probably you can get a similar/equivalent functionality to the cephadm mgmt-gatway out-of-the-box by using k8s ingress...

@Sunnatillo
Copy link
Contributor Author

@Sunnatillo thank you for working on this solution. Before reviewing the PR in deep and in order to try giving an informed opinion I'd like to understand how is your monitoring stack deployed:

  1. Are you consuming Ceph monitoring stack?
  2. If not, how are you deploying your own monitoring stack?

.. from the changes it seems to me that in fact what you need is to basically enable SSL/TLS for the Prometheus metrics end-point. The PR #58402 you linked is introducing mTLS because it's needed for a better security as par of the PR #57535. I had to implement that in cephadm because we don't have a platform as in k8s ... but in rook probably you can get a similar/equivalent functionality to the cephadm mgmt-gatway out-of-the-box by using k8s ingress...

Thank you for your comment @rkachach.

  1. No we are not consuming ceph monitoring stack.
  2. We have our our monitoring stack with Prometheus on k8s. We want to have TLS on Prometheus endpoint for internal communication inside cluster. Yes we can get similar functionality with ingress, but we still need to get certificates from somewhere or generate it for internal/external communication. As it is better to manage certificates with k8s, in this PR we try to get certificates from k8s api, instead of self generating like in cephadm case.

I would say, it is better to get certificates from k8s api since we can manage certificates easily in k8s cluster, for example with cert-manager.

@parth-gr
Copy link
Contributor

parth-gr commented Mar 12, 2025

lets use what rgw tls uses, i think rgw should be using server certificate provided by k8s in internal mode, cc @thotz @BlaineEXE

@BlaineEXE
Copy link
Collaborator

BlaineEXE commented Mar 12, 2025

I'm not a ceph developer, but I'm not sure I would want to have K8s APIs coded into Ceph. Likely, a more flexible architecture for Ceph is to accept a certificate file for the prometheus server instantiation (does this exist already?) and have Rook connect that together in what I would presume to be the mgr pod and/or metrics collectors.

The rook orch is and has been broken for many years, and I've repeatedly advocated for removing it from Ceph entirely to avoid confusing users who keep thinking that it should be functional. No company or user has ever had consistent resources available to dedicate to getting the Rook orch into shape and keeping it working.

@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Mar 17, 2025

I'm not a ceph developer, but I'm not sure I would want to have K8s APIs coded into Ceph. Likely, a more flexible architecture for Ceph is to accept a certificate file for the prometheus server instantiation (does this exist already?) and have Rook connect that together in what I would presume to be the mgr pod and/or metrics collectors.

The rook orch is and has been broken for many years, and I've repeatedly advocated for removing it from Ceph entirely to avoid confusing users who keep thinking that it should be functional. No company or user has ever had consistent resources available to dedicate to getting the Rook orch into shape and keeping it working.

@BlaineEXE accepting certificates prometheus server instantiation does not exist. TLS configuration is introduced as a part of making monitoring stack HA here #57535. It was introduced in this PR #58402. But this works only when orch is cephadm.
In this PR we have introduced almost similar way of initializing certificates as in this previous PR #58402.

Slightly disagree on the part of reading certificates from k8s api. Because it makes very easy to maintain certificates,
We will be using this and we are willing maintain this functionality.

@BlaineEXE
Copy link
Collaborator

Like I said, I'm not a Ceph dev, so I can't (and won't) make this decision for the Ceph project. However, I still don't believe this PR as written today is a good fit for a Ceph contribution.

Even reading a certificate using the K8s API is coding the K8s API into Ceph. It makes a k8s API call, so it counts. Ceph operates without k8s present, and so I would be hesitant to add k8s functionality in. Configuring Ceph in a k8s environment is most appropriate for the Rook project IMO. A more generic orchestrator-independent configuration in Ceph would be a more flexible architecture for all users.

@ceph-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

Comment on lines -1805 to -1818
verify_tls_files(self.cert_file.name, self.key_file.name)
cert_file_path, key_file_path = self.cert_file.name, self.key_file.name
Copy link
Contributor

@rkachach rkachach Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz don't remove this check as it's validating that the cert/key pair are valid specially in this case as they are provided by the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

@Sunnatillo Sunnatillo Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkachach
This function verify_tls_files function causing an issue for our case. We are running it on suse based OS and facing following error:
"ImportError: PyO3 modules may only be initialized once per interpreter process".
pyca/bcrypt#694
PyO3/pyo3#3451

As it mentioned above issues, the issue is not fixed, for our case workaround did not work which was done for RHEL based OS.

Can I add separate function for rook case temporarily until thy pyo3 issue fixed or can we remove verify_tls_files function temporarily?

Copy link
Contributor Author

@Sunnatillo Sunnatillo Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for rook case we can remove it temporarily.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.

I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.

Deep diving into the ceph code right now, I found that it is relying on cryptography which is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.

Make sure you do not do importlib.reload(...) or anything that tries to reload cryptography, OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sunnatillo if verify_tls_files is causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.

I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.

Deep diving into the ceph code right now, I found that it is relying on cryptography which is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.

Make sure you do not do importlib.reload(...) or anything that tries to reload cryptography, OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet

RHEL/centos have just done WA for it where they remove the warning that prevents the loading. As the library set it mandatory to only run it once, but ceph runs it on multiple processess. Or it do not run it, but it loads it multiple times so the code things it is ran multiple times. For suse based os it fails everytime.

Copy link
Contributor Author

@Sunnatillo Sunnatillo Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sunnatillo if verify_tls_files is causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.

I will comment this function, we can take in to use after PyO3 issue is fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sunnatillo While I'm not fully familiar with every detail of the code, simply suppressing the warning doesn't seem like a proper solution, even knowing it's a workaround. Doing so may leave open sessions and bindings in the interpreter. Depending on what they are doing, they could be exploited. I wanted to highlight this as an important point for further consideration, just for safety if needed.

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sunnatillo Overall I think the idea behind the change makes sense as it's trying to mimic in rook Orchestrator what's already implemented in cephadm. Before going with these changes I'll encourage you to explore other alternatives to secure the Prometheus in your k8s env (not sure if creating some dedicated service as front for prometheus metrics!).

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comment. In general please, be careful to do not break cephadm support.

rkachach added 2 commits June 9, 2025 14:25
So far we have been using some indirect methods for that. Let's be
more explicit about the check and use the new API offered by the
cephadm for that.

Fixes: https://tracker.ceph.com/issues/71599

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Removing the call to verify_tls_files as in this case case certs files
are generated by cephadm internally and we can trust them. In the
worst case (bad files) the prometheus module will fail to start.

Fixes: https://tracker.ceph.com/issues/71599

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
This commit extends tls to work with Rook orch that has been deployed for cephadm.

Certificates are read from rook namespace via kubernetes api client.
Name of the secrets are provided by following parameter:

prometheus_tls_secret_name (default secret name "rook-ceph-prometheus-server-tls")

When cephadm is used it generates silf signed certificates, when
rook used it reads certififcates from kubernetes client api in rook_env.namespace.

Signed-off-by: Sunnatillo <sunnat.samadov@est.tech>
@Sunnatillo
Copy link
Contributor Author

@rkachach I have rebased PR on top of your changes. Failing tests seem to be not relevant to my changes.
@adk3798 Please tests this PR as well.

@Sunnatillo
Copy link
Contributor Author

@adk3798 @rkachach Could you please push forward this PR? It has been quite long in the waiting

@adk3798
Copy link
Contributor

adk3798 commented Jul 1, 2025

https://pulpito.ceph.com/adking-2025-06-25_15:21:10-orch:cephadm-wip-adk-testing-2025-06-24-1732-distro-default-smithi/

okay for merging other than unexpected monitoring stack test failure

@adk3798 adk3798 dismissed their stale review July 1, 2025 13:55

issues addressed

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adk3798 adk3798 merged commit 2cbd2cd into ceph:main Jul 1, 2025
12 of 14 checks passed
nizamial09 added a commit to rhcs-dashboard/ceph that referenced this pull request Jul 8, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Signed-off-by: Nizamudeen A <nia@redhat.com>
@nizamial09
Copy link
Member

@Sunnatillo we saw a regression with http metric service which failed to start after this PR. Raised a fix here: #64385

nizamial09 added a commit to rhcs-dashboard/ceph that referenced this pull request Jul 8, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Signed-off-by: Nizamudeen A <nia@redhat.com>
Sunnatillo pushed a commit to Nordix/ceph that referenced this pull request Jul 10, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 0f2dbbd)
Sunnatillo pushed a commit to Nordix/ceph that referenced this pull request Jul 10, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 0f2dbbd)
nizamial09 added a commit to rhcs-dashboard/ceph that referenced this pull request Jul 10, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Fixes: https://tracker.ceph.com/issues/72012
Signed-off-by: Nizamudeen A <nia@redhat.com>
avanthakkar pushed a commit to avanthakkar/ceph that referenced this pull request Jul 14, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Fixes: https://tracker.ceph.com/issues/72012
Signed-off-by: Nizamudeen A <nia@redhat.com>
@ljflores
Copy link
Member

ljflores commented Aug 1, 2025

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

@Sunnatillo
Copy link
Contributor Author

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

I will take a look

zmc pushed a commit to ceph/ceph-ci that referenced this pull request Aug 5, 2025
ceph/ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph/ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Fixes: https://tracker.ceph.com/issues/72012
Signed-off-by: Nizamudeen A <nia@redhat.com>
@Sunnatillo
Copy link
Contributor Author

@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380

regression fix PR is open here:
#65033

Sorry, I could not open Fix PR earlier.

@Sunnatillo Sunnatillo deleted the fix-69610-sunnat branch August 14, 2025 13:12
abitdrag pushed a commit to abitdrag/ceph that referenced this pull request Oct 21, 2025
ceph#61468 unintentionally broke the http
metric service while it removed the code that starts the metrics. adding
them back up.

adding a basic test to catch these issues

Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f

Fixes: https://tracker.ceph.com/issues/72012
Signed-off-by: Nizamudeen A <nia@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants