Adding support for ceph mgmt-gateway by rkachach · Pull Request #57535 · ceph/ceph

rkachach · 2024-05-17T13:20:52Z

TODO

~~review default SSL/TLS configuration~~
unify cephadm root CA management
(addressed by mgr/cephadm: adding mTLS for ceph mgmt-gateway and backend services communication #58402)
fix prometheus/alertmanager access when secure_monitoring_stack is enabled
(addressed by mgr/cephadm: adding mTLS for ceph mgmt-gateway and backend services communication #58402)
~~fix the nginx image~~ (now image from quay is used)

This pull request introduces a new design for Ceph applications based on a modular, service-based architecture. A new cephadm service mgmt-gateway based on nginx an open-source, high-performance web server known for its scalability, efficiency, and versatility. It will act as the new front-end and single point entry to the cluster, providing unified access to all Ceph applications, including the Dashboard and monitoring applications. In addition, Nginx enhances security and simplifies access management due to its robust community and high security standards.

Benefits of the new service

Unified Access: Centralizing access through Nginx improves security and provide a single entry to the cluster mgmt.
Improved user experience: User shouldn't care anymore about where each application is running (ip/host).
High Availability for dashboard: Nginx HA mechanisms are used to provide high availability for ceph dashboard.
High Availability for monitoring: Nginx HA mechanisms are used to provide high availability for monitoring.

High availability enhancements

The current cephadm/dashboard implementation lacks HA when it comes to monitoring services. Even when cephadm is able to deploy N instances of services such as grafana, prometheus or alertmanager when configuring the dashboard (using dashboard set-<service>-api-host API) it just picks the last configured daemon. In case this daemons goes down there's no automated fail-over to use redundant healthy instance. The following diagram reflects the current architecture (notice dashboard is configured to access directly the different monitoring services).

This problem is solved by using upstream HA features provided by nginx. The proposed solution makes sure of a dedicated internal server to act as rev-proxy for monitoring services. Dashboard is configured to use nginx end-points instead of using directly ip/host of the monitoring daemons. Following is a diagram of the new architecture:

As we can see in the above diagram, in the new architecture there are two servers:

External server: this server is responsible of attending and routing external user requests. The idea is for this server is use it also for any extra processing we would like to perform for external users such as authentication, authorization, etc. This server relies on nginx upstream feature to group the monitoring applications (by category). HA mechanism is implemented by selecting one of the available healthy servers.
Internal server: this server is responsible of attending and routing internal requests only. Similarly to the external case, this server relies on nginx upstream feature to provide monitoring HA this time for internal services. This server uses its own self-signed certificates to secure the communication with other internal clients.

Usage

cephadm:
ceph orch apply mgmt-gateway --placement=<your-destination-node>

Or by providing a detailed spec file (for custom certificates i.e):

service_type: mgmt-gateway
placement:
  hosts:
    - ceph-node-1
spec:
 port: 9443
 ssl_protocols:
   - TLSv1.2
   - TLSv1.3
 ssl_ciphers:
   - AES128-SHA
   - AES256-SHA
   - RC4-SHA
 ssl_certificate: |
   -----BEGIN CERTIFICATE-----
   < YOU CERT DATA HERE >
   -----END CERTIFICATE-----
 ssl_certificate_key: |
  -----BEGIN RSA PRIVATE KEY-----
   < YOU PRIV KEY DATA HERE >
  -----END RSA PRIVATE KEY-----

Example of the generated nginx config:



[root@ceph-node-0 ~]# cat /var/lib/ceph/bd861cc8-28ae-11ef-92e9-525400cfad85/mgmt-gateway.ceph-node-0/etc/nginx_external_server.conf 

server {
    listen                    443 ssl;
    listen                    [::]:443 ssl;
    ssl_certificate            /etc/nginx/ssl/nginx.crt;
    ssl_certificate_key /etc/nginx/ssl/nginx.key;
    ssl_protocols            TLSv1.2 TLSv1.3;
    # from:  https://ssl-config.mozilla.org/#server=nginx
    ssl_ciphers              ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305;

    # Only return Nginx in server header, no extra info will be provided
    server_tokens             off;

    # Perfect Forward Secrecy(PFS) is frequently compromised without this
    ssl_prefer_server_ciphers on;

    # Enable SSL session caching for improved performance
    ssl_session_tickets       off;
    ssl_session_timeout       1d;
    ssl_session_cache         shared:SSL:10m;

    # OCSP stapling
    ssl_stapling              on;
    ssl_stapling_verify       on;
    resolver_timeout 5s;

    # Security headers
    ## X-Content-Type-Options: avoid MIME type sniffing
    add_header X-Content-Type-Options nosniff;
    ## Strict Transport Security (HSTS): Yes
    add_header Strict-Transport-Security "max-age=31536000; includeSubdomains; preload";
    ## Enables the Cross-site scripting (XSS) filter in browsers.
    add_header X-XSS-Protection "1; mode=block";
    ## Content-Security-Policy (CSP): FIXME
    # add_header Content-Security-Policy "default-src 'self'; script-src 'self'; object-src 'none'; base-uri 'none'; require-trusted-types-for 'script'; frame-ancestors 'self';";



    location / {
        proxy_pass https://dashboard_servers;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
    }

    location /grafana {
        rewrite ^/grafana/(.*) /$1 break;
        proxy_pass https://grafana_servers;
    }

    location /prometheus {
        proxy_pass http://prometheus_servers;
    }

    location /alertmanager {
        proxy_pass http://alertmanager_servers;
    }
}

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

phlogistonjohn · 2024-05-17T17:46:49Z

I will not review the patch but I will take this opportunity to bike-shed the name a little bit: cluster-gateway doesn't make it clear that this is mainly for stuff like the dashboard and grafana, etc. ui-gateway might be clearer but too narrow. How about admin-gateway? Or management-gateway?

rkachach · 2024-05-20T15:21:26Z

I will not review the patch but I will take this opportunity to bike-shed the name a little bit: cluster-gateway doesn't make it clear that this is mainly for stuff like the dashboard and grafana, etc. ui-gateway might be clearer but too narrow. How about admin-gateway? Or management-gateway?

@phlogistonjohn no worries and the I think it's a good time to dicuss about the service name :)

The rev-proxy is includes monitoring stack (prometheus, alertmanager, grafana, ..). In addition, it can include any service app that we would run for cluster mgmt in the future. The ideal name I think could be "ingress" (to be aligned with k8s) but it's already in use as you know. I'm OK with going with another main as long as it describes better what's the purpose of the service 👍

rkachach · 2024-06-04T14:43:44Z

jenkins retest this please

rkachach · 2024-06-26T07:31:40Z

I might be missing it, but I don’t see a note of which release introduces this functionality

@anthonyeleven I used 19.x.x as we don't know the exact release yet (as this will be determined when backporting).

rkachach · 2024-06-26T12:21:52Z

jenkins test rook e2e

rkachach · 2024-06-26T13:08:36Z

jenkins test rook e2e

rkachach · 2024-06-27T07:45:18Z

jenkins test rook e2e

doc/cephadm/services/mgmt-gateway.rst

src/cephadm/cephadmlib/daemons/mgmt_gateway.py

src/python-common/ceph/deployment/service_spec.py

anthonyeleven

Nice. Approval is for docs only. I've made a number of nitpicky wording requests.

doc/cephadm/services/mgmt-gateway.rst

github-actions · 2024-07-01T20:09:39Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

rkachach · 2024-07-02T16:10:50Z

jenkins test rook e2e

rkachach · 2024-07-02T18:20:06Z

jenkins test rook e2e

adk3798

Always tough to review PRs of this size, but given you already addressed my last set of review comments and nothing jumped out at me as an issue when checking over again, I approve

adding mgmt-gateway, a new cephadm service based on nginx, to act as the front-end and single entry point to the cluster. This gateway offers unified access to all Ceph applications, including the Ceph dashboard and monitoring tools (Prometheus, Grafana, ..), while enhancing security and simplifying access management through nginx. Fixes: https://tracker.ceph.com/issues/66095 Signed-off-by: Redouane Kachach <rkachach@ibm.com>

Signed-off-by: Redouane Kachach <rkachach@ibm.com>

adk3798 · 2024-07-10T14:27:43Z

https://pulpito.ceph.com/adking-2024-07-09_19:40:48-orch:cephadm-wip-adk-testing-2024-07-09-0905-distro-default-smithi/

Failures:

mds_upgrade_sequence, known issue
staggered upgrade with agent, known issue
a couple in cluster logs type failures that don't look bad
one failure in staggered upgrade that seemed to be a mismatch between what ceph versions and cephadm believed to be the version for a single RGW daemon. Needs investigation, but unrelated to PRs in the run
thrash and rgw-ingress tests timing out, known issue

Generally a good run, nothing to block merging most of the PRs

github-actions bot added cephadm pybind labels May 17, 2024

github-actions bot added monitoring orchestrator labels May 20, 2024

rkachach force-pushed the fix_issue_66095 branch 2 times, most recently from ee4f57f to cab9d77 Compare May 21, 2024 08:44

rkachach changed the title ~~[WIP] [DO NOT REVIEW] adding new cephadm service cluster-gateway~~ [WIP] [DO NOT REVIEW] adding support for ceph admin-gateway May 21, 2024

rkachach force-pushed the fix_issue_66095 branch 5 times, most recently from 0495dd3 to 9a01460 Compare May 28, 2024 11:13

rkachach force-pushed the fix_issue_66095 branch 4 times, most recently from 6cd2f1f to 3f7b6f0 Compare May 31, 2024 08:01

github-actions bot added the mgr label May 31, 2024

rkachach force-pushed the fix_issue_66095 branch 5 times, most recently from 08633bc to 0a371f3 Compare June 4, 2024 08:20

rkachach force-pushed the fix_issue_66095 branch from ed0c369 to 7d696be Compare June 6, 2024 14:06

rkachach changed the title ~~[WIP] [DO NOT REVIEW] adding support for ceph admin-gateway~~ [WIP] [DO NOT REVIEW] adding support for ceph mgmt-gateway Jun 12, 2024

rkachach changed the title ~~[WIP] [DO NOT REVIEW] adding support for ceph mgmt-gateway~~ Adding support for ceph mgmt-gateway Jun 12, 2024

rkachach force-pushed the fix_issue_66095 branch 2 times, most recently from 8eb1d1b to f375a8e Compare June 12, 2024 13:18

rkachach force-pushed the fix_issue_66095 branch 5 times, most recently from c03a1e8 to fc7a01b Compare June 25, 2024 13:14

rkachach mentioned this pull request Jun 25, 2024

adding mTLS support for ceph mgmt backend services communication #58047

Closed

15 tasks

rkachach force-pushed the fix_issue_66095 branch from e3dee23 to 8351612 Compare June 25, 2024 22:08

epuertat reviewed Jun 27, 2024

View reviewed changes

doc/cephadm/services/mgmt-gateway.rst Outdated Show resolved Hide resolved

src/cephadm/cephadmlib/daemons/mgmt_gateway.py Outdated Show resolved Hide resolved

src/python-common/ceph/deployment/service_spec.py Outdated Show resolved Hide resolved

rkachach force-pushed the fix_issue_66095 branch from 37c1a94 to 7b83cc9 Compare June 27, 2024 10:13

rkachach requested a review from anthonyeleven June 27, 2024 10:20

anthonyeleven approved these changes Jun 27, 2024

View reviewed changes

rkachach force-pushed the fix_issue_66095 branch 3 times, most recently from a6698e7 to e3beb1a Compare June 28, 2024 12:57

adk3798 approved these changes Jul 3, 2024

View reviewed changes

rkachach added 3 commits July 9, 2024 15:27

mgr/cephadm: adding documentation for cephadm mgmt-gateway service

a093ba7

Signed-off-by: Redouane Kachach <rkachach@ibm.com>

mgr/cephadm: introducing nobody/nogroup constants

11aaee1

Signed-off-by: Redouane Kachach <rkachach@ibm.com>

rkachach mentioned this pull request Jul 11, 2024

mgr/cephadm: adding mTLS for ceph mgmt-gateway and backend services communication #58402

Merged

20 tasks

rkachach mentioned this pull request Aug 5, 2024

adding support for SSO based on oauth2-proxy #58460

Merged

18 tasks

rkachach mentioned this pull request Oct 9, 2024

Adding HA support for mgmt-gateway and oauth2-proxy services #59982

Merged

14 tasks

rkachach mentioned this pull request Mar 7, 2025

Make prometheus TLS config work with Rook orchestrator #61468

Merged

14 tasks

Conversation

rkachach commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

phlogistonjohn commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkachach commented May 20, 2024

Uh oh!

rkachach commented Jun 4, 2024

Uh oh!

rkachach commented Jun 26, 2024

Uh oh!

rkachach commented Jun 26, 2024

Uh oh!

rkachach commented Jun 26, 2024

Uh oh!

rkachach commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anthonyeleven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 1, 2024

Uh oh!

rkachach commented Jul 2, 2024

Uh oh!

rkachach commented Jul 2, 2024

Uh oh!

adk3798 left a comment

Choose a reason for hiding this comment

Uh oh!

adk3798 commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rkachach commented May 17, 2024 •

edited

Loading

phlogistonjohn commented May 17, 2024 •

edited

Loading