Implement vLLM VM Monitoring and Metrics Export Pipeline by leehyeoklee · Pull Request #2331 · cloud-barista/cb-tumblebug

leehyeoklee · 2026-02-26T12:36:01Z

Implement vLLM VM Monitoring and Metrics Export Pipeline

Overview

This PR introduces a comprehensive monitoring and data extraction pipeline for our GPU VMs serving models via vLLM.

Pipeline Architecture & Features

1. Target VM Setup (setup_gpu_sensor.sh)

This script is executed on the target GPU VMs to set up the data collection endpoints.

Auto-GPU Detection & Profiling: Automatically detects the underlying GPU architecture and deploys the dedicated official exporter:

NVIDIA: Uses DCGM Exporter for deep GPU insights.
AMD: Uses ROCm Device Metrics Exporter for AMD-specific metrics.

System Metrics: Deploys node-exporter to track host-level information (CPU, Memory, Network).

vLLM Metrics: Exposes native serving and inference metrics provided by the vLLM engine.

Telegraf Aggregation: Uses Telegraf to collect and aggregate these three data sources (System, GPU, vLLM) and route them efficiently to the central monitoring VM.

2. Central Monitoring Setup (setup_monitoring.sh)
This script sets up the central control tower to gather data from the target VMs.

Multi-Node Support: Easily connects and gathers information from multiple GPU VMs simultaneously using their IP addresses.
Prometheus Integration: Periodically scrapes and stores the time-series data from the target VMs.
Grafana Dashboards: Enables real-time visualization and monitoring of the aggregated metrics.

3. Data Extraction Tool (export_metrics.sh)
A custom extraction tool for data analysis and reporting.

Config-Driven Exports: Uses a simple configuration file (.conf) to customize the extraction.
Granular Control: Users can specify exactly which metrics to extract, which specific VMs (IPs) to target, and the desired time duration.
CSV Output: Automatically formats and saves the requested time-series data into clean CSV files

seokho-son · 2026-02-26T12:44:49Z

/approve

leehyeoklee added 3 commits February 26, 2026 21:04

Add scripts for GPU VM monitoring setup

bdd745b

Enhance GPU monitoring setup scripts to support AMD/NVIDIA GPU

6e41cc7

Add Prometheus metrics CSV exporter script

01dcd3c

leehyeoklee requested review from seokho-son and yunkon-kim as code owners February 26, 2026 12:36

github-actions bot approved these changes Feb 26, 2026

View reviewed changes

github-actions bot added the approved This PR is approved and will be merged soon. label Feb 26, 2026

cb-github-robot merged commit d02d1ba into cloud-barista:main Feb 26, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement vLLM VM Monitoring and Metrics Export Pipeline#2331

Implement vLLM VM Monitoring and Metrics Export Pipeline#2331
cb-github-robot merged 3 commits intocloud-barista:mainfrom
leehyeoklee:gpu-telemetry

leehyeoklee commented Feb 26, 2026

Uh oh!

seokho-son commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leehyeoklee commented Feb 26, 2026