Skip to content

Add bastion monitoring with node-exporter and registry traffic tracking#836

Merged
openshift-merge-bot[bot] merged 1 commit into
redhat-performance:mainfrom
akrzos:monitor_bastion
Jun 4, 2026
Merged

Add bastion monitoring with node-exporter and registry traffic tracking#836
openshift-merge-bot[bot] merged 1 commit into
redhat-performance:mainfrom
akrzos:monitor_bastion

Conversation

@akrzos

@akrzos akrzos commented May 29, 2026

Copy link
Copy Markdown
Member

Extends the hv-metrics-server role to monitor the bastion machine itself.
Adds node-exporter as a container in the hv-metrics pod for system metrics
(CPU, memory, disk, network) and adds iptables-based accounting for mirror
registry traffic via a textfile collector script on a 5s systemd timer.
Registry traffic monitoring activates automatically when setup_bastion_registry
is true, using the existing registry_port variable via the defaults symlink
pattern. Includes a new Grafana dashboard for bastion resource metrics and
per-service registry bandwidth breakdown.

Summary by CodeRabbit

  • New Features

    • New Grafana dashboards for bastion and hypervisors tracking CPU, memory, disk, network, and I/O metrics
    • Registry traffic monitoring with connection counts and performance metrics via systemd timer
  • Documentation

    • Added HV Metrics monitoring documentation with setup instructions and available metrics

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1a56e56-bac8-429f-a80b-daa92caf2a6f

📥 Commits

Reviewing files that changed from the base of the PR and between f011c8e and 176b332.

📒 Files selected for processing (15)
  • README.md
  • ansible/roles/hv-metrics-server/defaults/main/bastion-registry.yml
  • ansible/roles/hv-metrics-server/defaults/main/main.yaml
  • ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json
  • ansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.json
  • ansible/roles/hv-metrics-server/files/dashboards/hypervisors-dashboard.json
  • ansible/roles/hv-metrics-server/tasks/main.yaml
  • ansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics.yaml.j2
  • ansible/roles/hv-metrics-server/templates/prometheus.yaml.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.service.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.sh.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2
  • docs/hv-metrics.md
💤 Files with no reviewable changes (3)
  • ansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.json
  • ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2
✅ Files skipped from review due to trivial changes (2)
  • README.md
  • docs/hv-metrics.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • ansible/roles/hv-metrics-server/defaults/main/bastion-registry.yml
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.service.j2
  • ansible/roles/hv-metrics-server/templates/prometheus.yaml.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics.yaml.j2
  • ansible/roles/hv-metrics-server/files/dashboards/hypervisors-dashboard.json
  • ansible/roles/hv-metrics-server/tasks/main.yaml
  • ansible/roles/hv-metrics-server/defaults/main/main.yaml
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.sh.j2
  • ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json

📝 Walkthrough

Walkthrough

This PR adds comprehensive Prometheus/Grafana-based monitoring for hypervisors and bastion hosts. It introduces two pre-built Grafana dashboards, integrates bastion node-exporter into the metrics pod, and adds optional registry traffic collection via iptables accounting and systemd timers. The Ansible role is updated to orchestrate pod deployment, directory creation, and collector script execution.

Changes

HV Metrics Monitoring Implementation

Layer / File(s) Summary
Configuration schema and documentation
README.md, ansible/roles/hv-metrics-server/defaults/main/bastion-registry.yml, ansible/roles/hv-metrics-server/defaults/main/main.yaml, docs/hv-metrics.md
Role defaults declare bastion exporter enablement, container name/image, and textfile collector directory; bastion-registry.yml becomes a symlink; comprehensive documentation describes architecture, configuration variables, setup steps, dashboard features, and manual counter-reset procedures.
Grafana monitoring dashboards
ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json, ansible/roles/hv-metrics-server/files/dashboards/hypervisors-dashboard.json
Bastion Dashboard provides system metrics (CPU utilization/cores/load, memory, disk space, disk I/O, network interfaces) and optional Mirror Registry Traffic panels (container CPU/memory, active connections, bandwidth, throughput, packets). Hypervisors Dashboard offers per-host CPU, memory, disk, and network panels with template variables for interval, job, interface, and disk filtering. Both use PromQL queries and consistent renameByRegex transformations.
Pod deployment: hostNetwork, exporter container, and Prometheus scrape config
ansible/roles/hv-metrics-server/templates/hv-metrics.yaml.j2, ansible/roles/hv-metrics-server/templates/prometheus.yaml.j2, ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2
Pod spec enables hostNetwork: true and conditionally adds bastion-metrics-exporter container with host rootfs/procfs/sysfs mounts and port 9100 exposure; Pod volumes mount / as host-rootfs and textfile directory. Prometheus config adds bastion-system scrape job targeting the exporter. Kube quadlet template removes prior Network= and PublishPort= directives for Prometheus/Grafana.
Registry traffic monitoring collection script
ansible/roles/hv-metrics-server/templates/registry-traffic-collector.sh.j2
Bash script defines setup_rules() to idempotently create iptables/ip6tables accounting chains keyed by registry port, read_counter() to extract RX/TX bytes and packets from RETURN-marked rules. Script locates registry container via podman, reads CPU usage (cpu.stat) and memory usage from cgroups, maps unlimited memory (max) to zero, and counts established port connections with ss for both IPv4 and IPv6. Outputs Prometheus-format samples (network counters, CPU seconds, memory/limit gauges, connection count) to temporary file, then atomically moves to registry_traffic.prom.
Systemd service and timer orchestration
ansible/roles/hv-metrics-server/templates/registry-traffic-collector.service.j2, ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2
Service template configures Type=oneshot to execute the collector script from the role work directory. Timer template triggers collection every 5 seconds after boot with 1-second accuracy, installed under timers.target.
Ansible task coordination
ansible/roles/hv-metrics-server/tasks/main.yaml
Tasks remove the unused hv-metrics network config rendering. Conditionally create bastion exporter textfile collector directory with defined permissions when hv_metrics_server_enable_bastion_exporter is enabled. Conditionally set up registry traffic monitoring (template rendering, initial collector script run to initialize iptables chains and metrics file, systemd daemon-reload, timer start/enable) when setup_bastion_registry is true.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • mcornea

Poem

🐰 Metrics bloom in Prometheus fields,
Two dashboards shine with steady yields,
Registry traffic, counted with care,
Iptables accounting in the air,
Grafana paints the bastion's tale! 📊

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main changes: adding bastion monitoring with node-exporter and registry traffic tracking, which aligns with the core objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@akrzos akrzos force-pushed the monitor_bastion branch 7 times, most recently from 8383dcb to a35627c Compare May 29, 2026 21:00
@akrzos akrzos requested a review from agurenko June 1, 2026 14:05
@akrzos akrzos force-pushed the monitor_bastion branch 8 times, most recently from fc52f0f to f011c8e Compare June 2, 2026 15:38
@akrzos akrzos marked this pull request as ready for review June 2, 2026 15:38
@openshift-ci openshift-ci Bot requested review from mcornea and rsevilla87 June 2, 2026 15:38
@akrzos akrzos removed the request for review from rsevilla87 June 2, 2026 15:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
ansible/roles/hv-metrics-server/tasks/main.yaml (1)

118-120: ⚡ Quick win

Make the one-shot initialization idempotent.

ansible.builtin.command always reports changed, breaking idempotency on re-runs. Since the script produces registry_traffic.prom, gate it with creates (or set changed_when).

♻️ Proposed change
     - name: Run collector once to initialize iptables rules and metrics file
       ansible.builtin.command:
         cmd: "{{ hv_metrics_server_work_dir }}/registry-traffic-collector.sh"
+        creates: "{{ hv_metrics_server_bastion_exporter_textfile_dir }}/registry_traffic.prom"

As per coding guidelines: "Idempotency: Tasks should be idempotent (can be run multiple times safely)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ansible/roles/hv-metrics-server/tasks/main.yaml` around lines 118 - 120, The
one-shot init task using ansible.builtin.command (cmd: "{{
hv_metrics_server_work_dir }}/registry-traffic-collector.sh") is non-idempotent;
update the task to avoid always reporting changed by gating the command with
creates pointing to the generated file (registry_traffic.prom) or by adding
changed_when: false so re-runs are no-op; ensure you reference the same
hv_metrics_server_work_dir and registry-traffic-collector.sh invocation when
adding the creates/changed_when condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json`:
- Around line 496-503: The "Network Interfaces" row currently has "title":
"Network Interfaces" with "y": 54 causing a large gap after Disk I/O; update
this row's "y" from 54 to the correct value (e.g., 36) and then shift all
dependent panels that follow (the Network panels and the subsequent "Mirror
Registry" row and its panels) up by the same delta so the Dashboard layout is
contiguous; locate the JSON objects with "title": "Network Interfaces" and
"title": "Mirror Registry" and adjust their "gridPos.y" (and any panels whose
gridPos.y falls between the old and new positions) to remove the gap.

In `@ansible/roles/hv-metrics-server/tasks/main.yaml`:
- Around line 91-120: The registry traffic monitoring block runs
registry-traffic-collector.sh which expects
hv_metrics_server_bastion_exporter_textfile_dir to exist but that directory is
only created when hv_metrics_server_enable_bastion_exporter is true; either
ensure the directory is created inside the Registry traffic monitoring block or
gate the block on hv_metrics_server_enable_bastion_exporter as well.
Specifically, update the "Registry traffic monitoring block" (tasks that render
registry-traffic-collector.sh, registry-traffic-collector.service,
registry-traffic-collector.timer and run the collector) to either add a task
that creates the directory hv_metrics_server_bastion_exporter_textfile_dir (use
file/mode/owner/group) before rendering templates and running the script, or add
the additional when: hv_metrics_server_enable_bastion_exporter condition to the
block to prevent running without the Bastion exporter enabled.

In
`@ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2`:
- Line 2: Description text in the registry-traffic-collector.timer unit is
inconsistent with the timer interval; either update the Description to reflect
the configured OnUnitActiveSec=5s or change OnUnitActiveSec to 15s if 15s was
intended. Edit the registry-traffic-collector.timer.j2 template: locate the
Description=... line and the OnUnitActiveSec=... setting and make them
consistent (e.g., set Description="Collect mirror registry traffic metrics every
5 seconds" when keeping OnUnitActiveSec=5s, or change OnUnitActiveSec to 15s to
match the current description).

In `@docs/hv-metrics.md`:
- Line 24: Update the table row describing hv-metrics-server to fix the grammar
mismatch "a node-exporter containers" by changing it to either "a node-exporter
container" (singular) or "node-exporter containers" (drop the article) so the
description for hv-metrics-server / Bastion reads consistently with existing
docs formatting.

---

Nitpick comments:
In `@ansible/roles/hv-metrics-server/tasks/main.yaml`:
- Around line 118-120: The one-shot init task using ansible.builtin.command
(cmd: "{{ hv_metrics_server_work_dir }}/registry-traffic-collector.sh") is
non-idempotent; update the task to avoid always reporting changed by gating the
command with creates pointing to the generated file (registry_traffic.prom) or
by adding changed_when: false so re-runs are no-op; ensure you reference the
same hv_metrics_server_work_dir and registry-traffic-collector.sh invocation
when adding the creates/changed_when condition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 82df00a6-c732-4997-9af4-8aa4fe4f9bb7

📥 Commits

Reviewing files that changed from the base of the PR and between e37a447 and f011c8e.

📒 Files selected for processing (16)
  • README.md
  • ansible/roles/hv-metrics-server/defaults/main/bastion-registry.yml
  • ansible/roles/hv-metrics-server/defaults/main/main.yaml
  • ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json
  • ansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.json
  • ansible/roles/hv-metrics-server/files/dashboards/hypervisors-dashboard.json
  • ansible/roles/hv-metrics-server/tasks/main.yaml
  • ansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2
  • ansible/roles/hv-metrics-server/templates/hv-metrics.yaml.j2
  • ansible/roles/hv-metrics-server/templates/prometheus.yaml.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.service.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.sh.j2
  • ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2
  • ansible/vars/.gitignore
  • docs/hv-metrics.md
💤 Files with no reviewable changes (3)
  • ansible/roles/hv-metrics-server/templates/hv-metrics-net.network.j2
  • ansible/roles/hv-metrics-server/files/dashboards/hv-metrics-dashboard.json
  • ansible/roles/hv-metrics-server/templates/hv-metrics.kube.j2

Comment on lines +496 to +503
"title": "Network Interfaces",
"type": "row",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 54
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Network Interfaces row y leaves a large empty gap. Disk I/O panels end at y:36 (y:28 + h:8), but this row starts at y:54, leaving ~18 grid units of blank space. The Hypervisors dashboard places the equivalent row at y:36 with no gap.

🎨 Proposed fix (shift Network + Registry rows up)
-      "gridPos": {
-        "h": 1,
-        "w": 24,
-        "x": 0,
-        "y": 54
-      },
+      "gridPos": {
+        "h": 1,
+        "w": 24,
+        "x": 0,
+        "y": 36
+      },

Adjust the dependent panel y values (and the subsequent Mirror Registry row/panels) accordingly so the layout is contiguous.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"title": "Network Interfaces",
"type": "row",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 54
},
"title": "Network Interfaces",
"type": "row",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 36
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ansible/roles/hv-metrics-server/files/dashboards/bastion-dashboard.json`
around lines 496 - 503, The "Network Interfaces" row currently has "title":
"Network Interfaces" with "y": 54 causing a large gap after Disk I/O; update
this row's "y" from 54 to the correct value (e.g., 36) and then shift all
dependent panels that follow (the Network panels and the subsequent "Mirror
Registry" row and its panels) up by the same delta so the Dashboard layout is
contiguous; locate the JSON objects with "title": "Network Interfaces" and
"title": "Mirror Registry" and adjust their "gridPos.y" (and any panels whose
gridPos.y falls between the old and new positions) to remove the gap.

Comment thread ansible/roles/hv-metrics-server/tasks/main.yaml Outdated
Comment thread ansible/roles/hv-metrics-server/templates/registry-traffic-collector.timer.j2 Outdated
Comment thread docs/hv-metrics.md Outdated
Extends the hv-metrics-server role to monitor the bastion machine itself.
Adds node-exporter as a container in the hv-metrics pod for system metrics
(CPU, memory, disk, network) and adds iptables-based accounting for mirror
registry traffic via a textfile collector script on a 5s systemd timer.
Registry traffic monitoring activates automatically when setup_bastion_registry
is true, using the existing registry_port variable via the defaults symlink
pattern.

Switches the hv-metrics pod to host networking to match all other Jetlag
pods and enable accurate host network interface metrics from node-exporter.
Removes the custom pod network, PublishPort directives, and related dead
variables.

The registry collector script also gathers container-level CPU and memory
usage from cgroup files and active connection count via ss, covering both
IPv4 and IPv6 traffic paths.

Includes new Grafana dashboards for bastion metrics and improvements to the
existing HV metrics dashboard: memory utilization panel, load averages,
dynamic network interfaces, collapsed network row, stable UID, and legend
tables with last/min/max on all timeseries panels.

Adds docs/hv-metrics.md documentation page linked from the README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@akrzos akrzos force-pushed the monitor_bastion branch from f011c8e to 176b332 Compare June 3, 2026 21:05
@mcornea

mcornea commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

/lgtm

@mcornea

mcornea commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

/approve

@openshift-ci

openshift-ci Bot commented Jun 4, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcornea

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Jun 4, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 4e98b63 into redhat-performance:main Jun 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants