Add bastion monitoring with node-exporter and registry traffic tracking#836
Conversation
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (15)
💤 Files with no reviewable changes (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (10)
📝 WalkthroughWalkthroughThis PR adds comprehensive Prometheus/Grafana-based monitoring for hypervisors and bastion hosts. It introduces two pre-built Grafana dashboards, integrates bastion node-exporter into the metrics pod, and adds optional registry traffic collection via iptables accounting and systemd timers. The Ansible role is updated to orchestrate pod deployment, directory creation, and collector script execution. ChangesHV Metrics Monitoring Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8383dcb to
a35627c
Compare
fc52f0f to
f011c8e
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
ansible/roles/hv-metrics-server/tasks/main.yaml (1)
118-120: ⚡ Quick winMake the one-shot initialization idempotent.
ansible.builtin.commandalways reportschanged, breaking idempotency on re-runs. Since the script producesregistry_traffic.prom, gate it withcreates(or setchanged_when).♻️ Proposed change
- name: Run collector once to initialize iptables rules and metrics file ansible.builtin.command: cmd: "{{ hv_metrics_server_work_dir }}/registry-traffic-collector.sh" + creates: "{{ hv_metrics_server_bastion_exporter_textfile_dir }}/registry_traffic.prom"As per coding guidelines: "Idempotency: Tasks should be idempotent (can be run multiple times safely)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ansible/roles/hv-metrics-server/tasks/main.yaml` around lines 118 - 120, The one-shot init task using ansible.builtin.command (cmd: "{{ hv_metrics_server_work_dir }}/registry-traffic-collector.sh") is non-idempotent; update the task to avoid always reporting changed by gating the command with creates pointing to the generated file (registry_traffic.prom) or by adding changed_when: false so re-runs are no-op; ensure you reference the same hv_metrics_server_work_dir and registry-traffic-collector.sh invocation when adding the creates/changed_when condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json`:
- Around line 496-503: The "Network Interfaces" row currently has "title":
"Network Interfaces" with "y": 54 causing a large gap after Disk I/O; update
this row's "y" from 54 to the correct value (e.g., 36) and then shift all
dependent panels that follow (the Network panels and the subsequent "Mirror
Registry" row and its panels) up by the same delta so the Dashboard layout is
contiguous; locate the JSON objects with "title": "Network Interfaces" and
"title": "Mirror Registry" and adjust their "gridPos.y" (and any panels whose
gridPos.y falls between the old and new positions) to remove the gap.
In `@ansible/roles/hv-metrics-server/tasks/main.yaml`:
- Around line 91-120: The registry traffic monitoring block runs
registry-traffic-collector.sh which expects
hv_metrics_server_bastion_exporter_textfile_dir to exist but that directory is
only created when hv_metrics_server_enable_bastion_exporter is true; either
ensure the directory is created inside the Registry traffic monitoring block or
gate the block on hv_metrics_server_enable_bastion_exporter as well.
Specifically, update the "Registry traffic monitoring block" (tasks that render
registry-traffic-collector.sh, registry-traffic-collector.service,
registry-traffic-collector.timer and run the collector) to either add a task
that creates the directory hv_metrics_server_bastion_exporter_textfile_dir (use
file/mode/owner/group) before rendering templates and running the script, or add
the additional when: hv_metrics_server_enable_bastion_exporter condition to the
block to prevent running without the Bastion exporter enabled.
In
`@ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2`:
- Line 2: Description text in the registry-traffic-collector.timer unit is
inconsistent with the timer interval; either update the Description to reflect
the configured OnUnitActiveSec=5s or change OnUnitActiveSec to 15s if 15s was
intended. Edit the registry-traffic-collector.timer.j2 template: locate the
Description=... line and the OnUnitActiveSec=... setting and make them
consistent (e.g., set Description="Collect mirror registry traffic metrics every
5 seconds" when keeping OnUnitActiveSec=5s, or change OnUnitActiveSec to 15s to
match the current description).
In `@docs/hv-metrics.md`:
- Line 24: Update the table row describing hv-metrics-server to fix the grammar
mismatch "a node-exporter containers" by changing it to either "a node-exporter
container" (singular) or "node-exporter containers" (drop the article) so the
description for hv-metrics-server / Bastion reads consistently with existing
docs formatting.
---
Nitpick comments:
In `@ansible/roles/hv-metrics-server/tasks/main.yaml`:
- Around line 118-120: The one-shot init task using ansible.builtin.command
(cmd: "{{ hv_metrics_server_work_dir }}/registry-traffic-collector.sh") is
non-idempotent; update the task to avoid always reporting changed by gating the
command with creates pointing to the generated file (registry_traffic.prom) or
by adding changed_when: false so re-runs are no-op; ensure you reference the
same hv_metrics_server_work_dir and registry-traffic-collector.sh invocation
when adding the creates/changed_when condition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 82df00a6-c732-4997-9af4-8aa4fe4f9bb7
📒 Files selected for processing (16)
README.mdansible/roles/hv-metrics-server/defaults/main/bastion-registry.ymlansible/roles/hv-metrics-server/defaults/main/main.yamlansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.jsonansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.jsonansible/roles/hv-metrics-server/files/dashboards/hypervisors-dashboard.jsonansible/roles/hv-metrics-server/tasks/main.yamlansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2ansible/roles/hv-metrics-server/templates/hv-metrics.yaml.j2ansible/roles/hv-metrics-server/templates/prometheus.yaml.j2ansible/roles/hv-metrics-server/templates/registry-traffic-collector.service.j2ansible/roles/hv-metrics-server/templates/registry-traffic-collector.sh.j2ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2ansible/vars/.gitignoredocs/hv-metrics.md
💤 Files with no reviewable changes (3)
- ansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2
- ansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.json
- ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2
| "title": "Network Interfaces", | ||
| "type": "row", | ||
| "gridPos": { | ||
| "h": 1, | ||
| "w": 24, | ||
| "x": 0, | ||
| "y": 54 | ||
| }, |
There was a problem hiding this comment.
Network Interfaces row y leaves a large empty gap. Disk I/O panels end at y:36 (y:28 + h:8), but this row starts at y:54, leaving ~18 grid units of blank space. The Hypervisors dashboard places the equivalent row at y:36 with no gap.
🎨 Proposed fix (shift Network + Registry rows up)
- "gridPos": {
- "h": 1,
- "w": 24,
- "x": 0,
- "y": 54
- },
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 36
+ },Adjust the dependent panel y values (and the subsequent Mirror Registry row/panels) accordingly so the layout is contiguous.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "title": "Network Interfaces", | |
| "type": "row", | |
| "gridPos": { | |
| "h": 1, | |
| "w": 24, | |
| "x": 0, | |
| "y": 54 | |
| }, | |
| "title": "Network Interfaces", | |
| "type": "row", | |
| "gridPos": { | |
| "h": 1, | |
| "w": 24, | |
| "x": 0, | |
| "y": 36 | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json`
around lines 496 - 503, The "Network Interfaces" row currently has "title":
"Network Interfaces" with "y": 54 causing a large gap after Disk I/O; update
this row's "y" from 54 to the correct value (e.g., 36) and then shift all
dependent panels that follow (the Network panels and the subsequent "Mirror
Registry" row and its panels) up by the same delta so the Dashboard layout is
contiguous; locate the JSON objects with "title": "Network Interfaces" and
"title": "Mirror Registry" and adjust their "gridPos.y" (and any panels whose
gridPos.y falls between the old and new positions) to remove the gap.
Extends the hv-metrics-server role to monitor the bastion machine itself. Adds node-exporter as a container in the hv-metrics pod for system metrics (CPU, memory, disk, network) and adds iptables-based accounting for mirror registry traffic via a textfile collector script on a 5s systemd timer. Registry traffic monitoring activates automatically when setup_bastion_registry is true, using the existing registry_port variable via the defaults symlink pattern. Switches the hv-metrics pod to host networking to match all other Jetlag pods and enable accurate host network interface metrics from node-exporter. Removes the custom pod network, PublishPort directives, and related dead variables. The registry collector script also gathers container-level CPU and memory usage from cgroup files and active connection count via ss, covering both IPv4 and IPv6 traffic paths. Includes new Grafana dashboards for bastion metrics and improvements to the existing HV metrics dashboard: memory utilization panel, load averages, dynamic network interfaces, collapsed network row, stable UID, and legend tables with last/min/max on all timeseries panels. Adds docs/hv-metrics.md documentation page linked from the README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mcornea The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4e98b63
into
redhat-performance:main
Extends the hv-metrics-server role to monitor the bastion machine itself.
Adds node-exporter as a container in the hv-metrics pod for system metrics
(CPU, memory, disk, network) and adds iptables-based accounting for mirror
registry traffic via a textfile collector script on a 5s systemd timer.
Registry traffic monitoring activates automatically when setup_bastion_registry
is true, using the existing registry_port variable via the defaults symlink
pattern. Includes a new Grafana dashboard for bastion resource metrics and
per-service registry bandwidth breakdown.
Summary by CodeRabbit
New Features
Documentation