Skip to content

[Docs] Add ServiceMonitor section and make some step optional in Grafana & Promethus page#53474

Merged
edoakes merged 36 commits intoray-project:masterfrom
owenowenisme:kuberay/update-grafana-prom-docs
Jun 18, 2025
Merged

[Docs] Add ServiceMonitor section and make some step optional in Grafana & Promethus page#53474
edoakes merged 36 commits intoray-project:masterfrom
owenowenisme:kuberay/update-grafana-prom-docs

Conversation

@owenowenisme
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme commented Jun 2, 2025

Why are these changes needed?

This PR update docs to show latest change of KubeRay.
Changes including:

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@owenowenisme owenowenisme changed the title add ServiceMonitor section and make some step optional [Docs] Add ServiceMonitor section and make some step optional in Grafana & Promethus page Jun 2, 2025
@owenowenisme
Copy link
Copy Markdown
Member Author

@troychiu @win5923 This PR added ServiceMonitor section, would you mind checking if I'm missing anything?

@owenowenisme owenowenisme force-pushed the kuberay/update-grafana-prom-docs branch from 75e443e to 66b002c Compare June 3, 2025 02:47
@owenowenisme owenowenisme marked this pull request as ready for review June 3, 2025 05:12
@owenowenisme owenowenisme requested review from a team, kevin85421 and pcmoritz as code owners June 3, 2025 05:12
@owenowenisme owenowenisme requested a review from win5923 June 3, 2025 05:12
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a new step to display the KubeRay Operator Dashboard once this PR is merged. Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added

Copy link
Copy Markdown
Member

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks Owen!

Copy link
Copy Markdown
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also mention that KubeRay provides the service monitor in helm chart https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/servicemonitor.yaml? Maybe in Step 3: Install a KubeRay operator

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe simply KubeRay Metrics?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think KubeRay Metrics should include Controller Runtime Metrics and KubeRay Custom Metrics emphasize "custom" so we can differentiate with natively provided metrics (Controller Runtime), does it make sense to you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@owenowenisme
Copy link
Copy Markdown
Member Author

Can we also mention that KubeRay provides the service monitor in helm chart https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/servicemonitor.yaml? Maybe in Step 3: Install a KubeRay operator

Is this for production purpose?
Cause we already deploy ServiceMonitor with install.sh as mentioned in Step 7.

@troychiu
Copy link
Copy Markdown
Contributor

troychiu commented Jun 7, 2025

Can we also mention that KubeRay provides the service monitor in helm chart https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/servicemonitor.yaml? Maybe in Step 3: Install a KubeRay operator

Is this for production purpose? Cause we already deploy ServiceMonitor with install.sh as mentioned in Step 7.

Yes, they serve different purposes. The install.sh script is mainly for users to quickly try out the Grafana dashboard, while the ServiceMonitor in the Helm chart is intended for production use.
I just want to make sure users are aware of it, so it doesn’t need to be a long explanation.

@owenowenisme
Copy link
Copy Markdown
Member Author

@troychiu Then I suppose I should also make helm chart for other components in install.sh?
e.g. podMonitor

@owenowenisme owenowenisme requested review from troychiu and win5923 June 11, 2025 07:32
owenowenisme and others added 9 commits June 12, 2025 21:49
Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: owenowenisme <mses010108@gmail.com>
Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Troy Chiu <114708546+troychiu@users.noreply.github.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
@owenowenisme
Copy link
Copy Markdown
Member Author

@kevin85421 PTAL

@kevin85421
Copy link
Copy Markdown
Member

I’ve requested that the docs team review this PR. @owenowenisme, if you haven’t installed Vale yet, please do so and ensure its rules are being followed.

@owenowenisme
Copy link
Copy Markdown
Member Author

Thanks for reminding, I already ran vale, and make sure there is no new error.

Copy link
Copy Markdown
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some style and grammatical fixes. Our style guide says to use sentence case for titles. Unfortunately, the titles you haven't changed in this PR aren't all sentence case. If it's not too much work for you to fix them in this PR, please do. Otherwise, I can follow up with a PR after yours merges to fix them. Thanks for updating the docs!

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
@owenowenisme owenowenisme force-pushed the kuberay/update-grafana-prom-docs branch from ca7c630 to 181ab2e Compare June 16, 2025 19:20
@owenowenisme
Copy link
Copy Markdown
Member Author

Thanks @angelinalg I just applied your suggestions.

Copy link
Copy Markdown
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a typo in one of my suggestions. Could you accept it? Also, are the corrections accurate? I want to verify that you meant RayService and not Ray Service for those metrics titles. @owenowenisme

## Step 3: Install a KubeRay operator

* Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository.
* You can enable the ServiceMonitor when installing the KubeRay operator with helm. See [Step 7](#step-7-collect-kuberay-metrics-with-servicemonitor) for more details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and provide the instruction to install KubeRay operator with ServiceMonitor instead.

helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0 --set metrics.serviceMonitor.enabled=true

* KubeRay provides an [install.sh script](https://github.com/ray-project/kuberay/blob/master/install/prometheus/install.sh) to:
* Install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources, including **PodMonitor** and **PrometheusRule**, in the namespace `prometheus-system` automatically.
* Import Ray Dashboards [Grafana JSON files](https://github.com/ray-project/kuberay/tree/master/config/grafana) into Grafana using the `--auto-load-dashboard true` flag. If the flag isn't set, the following step also provides instructions for manual import.
* Install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources, including **PodMonitor**, **ServiceMonitor**, and **PrometheusRule**, in the namespace `prometheus-system` automatically.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove ServiceMonitor for KubeRay operator from the install.sh.

## Step 7: Collect custom metrics with Recording Rules
## Step 7: Collect KubeRay metrics with ServiceMonitor

Starting with KubeRay 1.4.0, KubeRay provides a [ServiceMonitor](https://github.com/ray-project/kuberay/blob/master/config/prometheus/serviceMonitor.yaml) to help Prometheus discover and scrape KubeRay metrics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to tell users that the ServiceMonitor will be created when install KubeRay operator is installed and then provide a instruction to help users to verify it (something like kubectl get servicemonitor).

## Step 13: Embed Grafana panels in Ray Dashboard
## Step 14: View the KubeRay operator dashboard

Once the KubeRay Operator dashboard is imported into Grafana, you can monitor metrics from the KubeRay operator. The dashboard provides a dropdown menu to filter and view controller runtime metrics for specific Ray CRs:`RayCluster`, `RayJob`, `RayService`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to include one picture here; four pictures are too many.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only one.

# KubeRay metrics references

## Controller runtime metrics
KubeRay is built with Controller Runtime, which natively exposes metrics that KubeRay includes in its metrics. These metrics include:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check Vale

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❯ vale doc/source/cluster/kubernetes/k8s-ecosystem/metrics-references.md

 doc/source/cluster/kubernetes/k8s-ecosystem/metrics-references.md
 1:1     suggestion  Use parentheses judiciously.    Google.Parens   
 1:2     error       Use 'KubeRay' instead of        Google.WordList 
                     'kuberay'.                                      
 6:38    error       Use 'Kubernetes' instead of     Vale.Terms      
                     'kubernetes'.                                   
 8:57    error       Use 'Kubernetes' instead of     Vale.Terms      
                     'kubernetes'.                                   
 32:95   suggestion  In general, use active voice    Google.Passive  
                     instead of passive voice ('is                   
                     provisioned').                                  
 33:158  suggestion  Use parentheses judiciously.    Google.Parens   
 49:313  suggestion  In general, use active voice    Google.Passive  
                     instead of passive voice ('is                   
                     enabled').                                      

I suppose that we can ignore these.

# KubeRay metrics references

## Controller runtime metrics
KubeRay is built with Controller Runtime, which natively exposes metrics that KubeRay includes in its metrics. These metrics include:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too verbose.

Please remove:

KubeRay is built with Controller Runtime ...
...
- Go runtime stats like Goroutines and GC duration

and use the following line instead:

"KubeRay exposes metrics provided by kubernetes-sigs/controller-runtime, including information about reconciliation, work queues, and more, to help users operate the KubeRay operator in production environments."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

owenowenisme and others added 2 commits June 18, 2025 01:08
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
owenowenisme and others added 2 commits June 18, 2025 02:36
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
```sh
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0 \
--set metrics.serviceMonitor.enabled=true
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an instruction to verify whether the ServiceMonitor is created correctly.

## Step 7: Collect custom metrics with Recording Rules
## Step 7: Collect KubeRay metrics with ServiceMonitor

Installing the KubeRay operator automatically creates a ServiceMonitor to help Prometheus discover and scrape KubeRay metrics. You can verify the ServiceMonitor creation with:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it.

honorLabels: true

```
* Same as PodMonitor, the **install.sh** script also creates the [serviceMonitor.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/serviceMonitor.yaml) shown above, so you don't need to create it manually.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete L271 - L276

owenowenisme and others added 3 commits June 18, 2025 10:15
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jun 18, 2025
@kevin85421
Copy link
Copy Markdown
Member

@owenowenisme please ping me when all CI tests pass.

@owenowenisme
Copy link
Copy Markdown
Member Author

@kevin85421 CI passed.

@kevin85421
Copy link
Copy Markdown
Member

cc @jjyao @edoakes would you mind merging this PR? Thanks!

@edoakes edoakes merged commit cf5eeb8 into ray-project:master Jun 18, 2025
6 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…ana & Promethus page (ray-project#53474)

This PR update docs to show latest change of KubeRay.
Changes including:
- A page listing existing KubeRay Metrics
https://anyscale-ray--53474.com.readthedocs.build/en/53474/cluster/kubernetes/k8s-ecosystem/metrics-references.html#kuberay-metrics-references
- Instruction of ServiceMonitor in
https://anyscale-ray--53474.com.readthedocs.build/en/53474/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html#step-7-collect-kuberay-metrics-with-servicemonitor
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Co-authored-by: Troy Chiu <114708546+troychiu@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…ana & Promethus page (#53474)

This PR update docs to show latest change of KubeRay.
Changes including:
- A page listing existing KubeRay Metrics
https://anyscale-ray--53474.com.readthedocs.build/en/53474/cluster/kubernetes/k8s-ecosystem/metrics-references.html#kuberay-metrics-references
- Instruction of ServiceMonitor in
https://anyscale-ray--53474.com.readthedocs.build/en/53474/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html#step-7-collect-kuberay-metrics-with-servicemonitor
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: owenowenisme <mses010108@gmail.com>
Signed-off-by: Owen Lin (You-Cheng Lin) <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Co-authored-by: Troy Chiu <114708546+troychiu@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants