Skip to content

Implement vLLM VM Monitoring and Metrics Export Pipeline#2331

Merged
cb-github-robot merged 3 commits intocloud-barista:mainfrom
leehyeoklee:gpu-telemetry
Feb 26, 2026
Merged

Implement vLLM VM Monitoring and Metrics Export Pipeline#2331
cb-github-robot merged 3 commits intocloud-barista:mainfrom
leehyeoklee:gpu-telemetry

Conversation

@leehyeoklee
Copy link
Copy Markdown
Contributor

Implement vLLM VM Monitoring and Metrics Export Pipeline

Overview

This PR introduces a comprehensive monitoring and data extraction pipeline for our GPU VMs serving models via vLLM.

Pipeline Architecture & Features

1. Target VM Setup (setup_gpu_sensor.sh)

This script is executed on the target GPU VMs to set up the data collection endpoints.

Auto-GPU Detection & Profiling: Automatically detects the underlying GPU architecture and deploys the dedicated official exporter:

  • NVIDIA: Uses DCGM Exporter for deep GPU insights.
  • AMD: Uses ROCm Device Metrics Exporter for AMD-specific metrics.

System Metrics: Deploys node-exporter to track host-level information (CPU, Memory, Network).

vLLM Metrics: Exposes native serving and inference metrics provided by the vLLM engine.

Telegraf Aggregation: Uses Telegraf to collect and aggregate these three data sources (System, GPU, vLLM) and route them efficiently to the central monitoring VM.

2. Central Monitoring Setup (setup_monitoring.sh)
This script sets up the central control tower to gather data from the target VMs.

  • Multi-Node Support: Easily connects and gathers information from multiple GPU VMs simultaneously using their IP addresses.
  • Prometheus Integration: Periodically scrapes and stores the time-series data from the target VMs.
  • Grafana Dashboards: Enables real-time visualization and monitoring of the aggregated metrics.
image image image

3. Data Extraction Tool (export_metrics.sh)
A custom extraction tool for data analysis and reporting.

  • Config-Driven Exports: Uses a simple configuration file (.conf) to customize the extraction.
  • Granular Control: Users can specify exactly which metrics to extract, which specific VMs (IPs) to target, and the desired time duration.
  • CSV Output: Automatically formats and saves the requested time-series data into clean CSV files
image image

@seokho-son
Copy link
Copy Markdown
Member

/approve

@github-actions github-actions bot added the approved This PR is approved and will be merged soon. label Feb 26, 2026
@cb-github-robot cb-github-robot merged commit d02d1ba into cloud-barista:main Feb 26, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This PR is approved and will be merged soon.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants