Skip to content

Conversation

@singh-kalpana
Copy link
Contributor

What this PR does / why we need it?
Install the Prometheus Operator and Grafana Operator instead of a standalone Prometheus and Grafana instance.
The standalone instances does not include important CRDs like servicemonitor ( prometheus-community/helm-charts#3010 ), that is needed to monitor user-defined applications without directly modifying the Prometheus configuration, also absence of these CRDs limits flexibility and stop us to migrate monitoring resources to production environment.

Changes:

  1. Install Prometheus Operator (kube-prometheus stack) by keeping Grafana deployment set to false, as it installs standalone Grafana instance with it
    This PR #2172 suggests including the Grafana Operator within the kube-prometheus stack rather than Grafana instance, but the Prometheus community recommends disabling Grafana in the stack and installing Grafana Operator separately.
  2. Install Grafana Operator following this doc https://github.com/grafana/grafana-operator

type: prometheus
uid: prometheus
access: proxy
url: http://kube-prometheus-stack-1735-prometheus.monitoring.svc.cluster.local:9090
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a fixed URL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will change but need to test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out.

Initially, I was facing problem while setting the URL, so I followed this recommendation, and it worked.

But yes, you're right, URL can change because of the suffix (1735). I used the --generate-name option while installing kube-prometheus-stack, as this option adds a unique suffix to the release name.

To avoid this, we can use a fixed release name "kube-prometheus-stack", instead of --generate-name. This will create a consistent service name: kube-prometheus-stack-prometheus

helm install kube-prometheus-stack --version {{ prometheus_stack }} prometheus-community/kube-prometheus-stack --create-namespace --namespace monitoring --values {{ ansible_user_dir }}/kube-prometheus-stack.values

With this, URL will always be:
http://kube-prometheus-stack-prometheus.monitoring:9090 as describe in this kube-prometheus-stack grafana datasource file also

I tried this approach, and it worked.

@ghost ghost merged commit ef96395 into NVIDIA:master Feb 4, 2025
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singh-kalpana not sure why you removed the additional scrape config, because of that unable to see GPU Metrics on Grafana Dashboard. I will fix that but it's FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add additional scrape config, as ServiceMonitor for DCGM exporter is being enabled, which allows Prometheus to scrape metrics from DCGM exporter, it takes about a minute to show metrics on Grafana Dashboard

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants