Make prometheus TLS config work with Rook orchestrator#61468
Make prometheus TLS config work with Rook orchestrator#61468
Conversation
b90d070 to
3d4932e
Compare
|
jenkins retest this please |
3d4932e to
3aa7dad
Compare
|
@Sunnatillo Help me understand a few things:
I haven't reviewed deeply if this change is the right approach or not, my questions are really based on our history that maintaining the rook mgr module has been a problem due to testing end to end with Rook. |
3aa7dad to
5f8dd39
Compare
Thank you for your questions:
The reason we implemented this ways is to have the same as cephadm orch. It uses the same "orch certmgr generate-certificates" command and it will work for rook with current implementation. |
5f8dd39 to
c5add5d
Compare
|
@Sunnatillo thank you for working on this solution. Before reviewing the PR in deep and in order to try giving an informed opinion I'd like to understand how is your monitoring stack deployed:
.. from the changes it seems to me that in fact what you need is to basically enable SSL/TLS for the Prometheus metrics end-point. The PR #58402 you linked is introducing |
Thank you for your comment @rkachach.
I would say, it is better to get certificates from k8s api since we can manage certificates easily in k8s cluster, for example with |
|
lets use what rgw tls uses, i think rgw should be using server certificate provided by k8s in internal mode, cc @thotz @BlaineEXE |
|
I'm not a ceph developer, but I'm not sure I would want to have K8s APIs coded into Ceph. Likely, a more flexible architecture for Ceph is to accept a certificate file for the prometheus server instantiation (does this exist already?) and have Rook connect that together in what I would presume to be the mgr pod and/or metrics collectors. The rook orch is and has been broken for many years, and I've repeatedly advocated for removing it from Ceph entirely to avoid confusing users who keep thinking that it should be functional. No company or user has ever had consistent resources available to dedicate to getting the Rook orch into shape and keeping it working. |
@BlaineEXE accepting certificates prometheus server instantiation does not exist. TLS configuration is introduced as a part of making monitoring stack HA here #57535. It was introduced in this PR #58402. But this works only when orch is cephadm. Slightly disagree on the part of reading certificates from k8s api. Because it makes very easy to maintain certificates, |
|
Like I said, I'm not a Ceph dev, so I can't (and won't) make this decision for the Ceph project. However, I still don't believe this PR as written today is a good fit for a Ceph contribution. Even reading a certificate using the K8s API is coding the K8s API into Ceph. It makes a k8s API call, so it counts. Ceph operates without k8s present, and so I would be hesitant to add k8s functionality in. Configuring Ceph in a k8s environment is most appropriate for the Rook project IMO. A more generic orchestrator-independent configuration in Ceph would be a more flexible architecture for all users. |
|
Can one of the admins verify this patch? |
| verify_tls_files(self.cert_file.name, self.key_file.name) | ||
| cert_file_path, key_file_path = self.cert_file.name, self.key_file.name |
There was a problem hiding this comment.
plz don't remove this check as it's validating that the cert/key pair are valid specially in this case as they are provided by the user
There was a problem hiding this comment.
@rkachach
This function verify_tls_files function causing an issue for our case. We are running it on suse based OS and facing following error:
"ImportError: PyO3 modules may only be initialized once per interpreter process".
pyca/bcrypt#694
PyO3/pyo3#3451
As it mentioned above issues, the issue is not fixed, for our case workaround did not work which was done for RHEL based OS.
Can I add separate function for rook case temporarily until thy pyo3 issue fixed or can we remove verify_tls_files function temporarily?
There was a problem hiding this comment.
At least for rook case we can remove it temporarily.
There was a problem hiding this comment.
Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.
I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.
Deep diving into the ceph code right now, I found that it is relying on cryptography which is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.
Make sure you do not do importlib.reload(...) or anything that tries to reload cryptography, OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet
There was a problem hiding this comment.
@Sunnatillo if verify_tls_files is causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.
There was a problem hiding this comment.
Hello, I'm in the talking related to addressing the PyO3 limitations related to single GIL, and the conversation popped up here once mentioned and got my attention.
I want to understand the issue, in order to help as possible and maybe give some insights to contribute to the workaround being done to get around this limitation that is also affecting other projects as well.
Deep diving into the ceph code right now, I found that it is relying on
cryptographywhich is using Rust in newer versions, depending on PyO3 to do the interface, as most projects that need Rust modules running on Python depend on it.Make sure you do not do
importlib.reload(...)or anything that tries to reloadcryptography,OpenSSL, or their dependencies. And if possible, only initialize it once. If you're requiring any multithreading, try to use processes and isolate each calling in its own environment. To give some ideas, you can see this workaround to the same problem but in rust and try to adapt for your needs in Python with a similar idea: https://github.com/letalboy/RustPyNet
RHEL/centos have just done WA for it where they remove the warning that prevents the loading. As the library set it mandatory to only run it once, but ceph runs it on multiple processess. Or it do not run it, but it loads it multiple times so the code things it is ran multiple times. For suse based os it fails everytime.
There was a problem hiding this comment.
@Sunnatillo if
verify_tls_filesis causing issues we can go without it. It's a sanity check to make sure the cert/key are OKay. In cephadm the certs are self-signed so the odds of getting bad certs is really low... however in your case since the cert/key is provided by the user the probability of getting bad values is higher.
I will comment this function, we can take in to use after PyO3 issue is fixed.
There was a problem hiding this comment.
@Sunnatillo While I'm not fully familiar with every detail of the code, simply suppressing the warning doesn't seem like a proper solution, even knowing it's a workaround. Doing so may leave open sessions and bindings in the interpreter. Depending on what they are doing, they could be exploited. I wanted to highlight this as an important point for further consideration, just for safety if needed.
There was a problem hiding this comment.
@Sunnatillo Overall I think the idea behind the change makes sense as it's trying to mimic in rook Orchestrator what's already implemented in cephadm. Before going with these changes I'll encourage you to explore other alternatives to secure the Prometheus in your k8s env (not sure if creating some dedicated service as front for prometheus metrics!).
rkachach
left a comment
There was a problem hiding this comment.
I left some more comment. In general please, be careful to do not break cephadm support.
c5add5d to
a1f9675
Compare
So far we have been using some indirect methods for that. Let's be more explicit about the check and use the new API offered by the cephadm for that. Fixes: https://tracker.ceph.com/issues/71599 Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Removing the call to verify_tls_files as in this case case certs files are generated by cephadm internally and we can trust them. In the worst case (bad files) the prometheus module will fail to start. Fixes: https://tracker.ceph.com/issues/71599 Signed-off-by: Redouane Kachach <rkachach@ibm.com>
888872d to
55651aa
Compare
This commit extends tls to work with Rook orch that has been deployed for cephadm. Certificates are read from rook namespace via kubernetes api client. Name of the secrets are provided by following parameter: prometheus_tls_secret_name (default secret name "rook-ceph-prometheus-server-tls") When cephadm is used it generates silf signed certificates, when rook used it reads certififcates from kubernetes client api in rook_env.namespace. Signed-off-by: Sunnatillo <sunnat.samadov@est.tech>
55651aa to
64f590c
Compare
|
okay for merging other than unexpected monitoring stack test failure |
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com>
|
@Sunnatillo we saw a regression with http metric service which failed to start after this PR. Raised a fix here: #64385 |
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com>
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com> (cherry picked from commit 0f2dbbd)
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Signed-off-by: Nizamudeen A <nia@redhat.com> (cherry picked from commit 0f2dbbd)
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>
|
@Sunnatillo @nizamial09 PTAL at this regression that I believe was caused in this PR: https://tracker.ceph.com/issues/72380 |
I will take a look |
ceph/ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph/ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>
regression fix PR is open here: Sorry, I could not open Fix PR earlier. |
ceph#61468 unintentionally broke the http metric service while it removed the code that starts the metrics. adding them back up. adding a basic test to catch these issues Regression from ceph@64f590c#diff-031e09c4297d84a407cf55f8981d38764efc3c37e9827e12e638521f69284e1f Fixes: https://tracker.ceph.com/issues/72012 Signed-off-by: Nizamudeen A <nia@redhat.com>
This commit is introduced to generate self signed certificated for Prometheus and dashboard. #58402
In above mentioned commit tls for prometheus works only with cephadm orchestrator. And it also generates selfsigned certificate with cephadm-root CA that is genereated every time function is called.
However when using rook orch certificates are not generated. This PR implements getting certificates for rook orch.
Solution is following:
Certificates are read from rook namespace via kubernetes api client.
Name of the secrets are provided by following parameters:
In simple words when cephadm is used it generates silf signed certificates, when rook used it reads certififcates from kubernetes client api in rook_env.namespace.
Tracker issue: https://tracker.ceph.com/issues/69610
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e