Sonar is a tool to profile usage of HPC resources by regularly sampling processes, jobs, accelerators, nodes, queues, and clusters.
Sonar examines /proc and /sys and/or runs diagnostic programs, filters and groups the
information, and prints it to stdout, stores it in a local directory tree, or sends it to a remote
collector.
Sonar proper is GPL-3 but some side components that are crucial for the interaction with other tools that might not be GPL carry the MIT license.
Image: Midjourney, CC BY-NC 4.0
Start by reading the user manual, which explains most things about what it can do and how you make it do it.
For a deeper dive into how it works, try the design document.
To build it, or to modify it, try the developer document.
A sample deployment of Sonar on a cluster and a data aggregator on a backend is outlined in doc/HOWTO-DEPLOY.md.
Sonar's output data are rigorously specified and you can build your own data collectors, post-processors and analyses, but you can also use these existing tools (both under active development):
- JobAnalyzer allows Sonar logs to be queried and analyzed, and provides dashboards, interactive and batch queries, and reporting of system activity, policy violations, hung jobs, and more.
- Slurm-monitor is complementary to JobAnalyzer and focuses on managing and analyzing slurm queues and clusters, and has a benchmarking facility and other tools for job placement.
- Radovan Bast
- Mathias Bockwoldt
- Lars T. Hansen
- Henrik Rojas Nagel
- Thomas Roehr
Sonar's original vision was to be a very simple, lightweight tool that did some basic things fairly cheaply and produced easy-to-process output for subsequent scripting. Sonar is no longer that: with GPU integration, SLURM integration, Kafka exfiltration, memory-resident modes, structured output, continual focus on performance and several elaborate backends, it is becoming as complex as the tools it was intended to replace or compete with.
Here are some of those tools:
- Trailblazing Turtle, SLURM-specific but similar to Sonar.
- Scaphandre, for energy monitoring.
- Sysstat and SAR, for monitoring a lot of things.
- seff, SLURM-specific.
- TACC Remora
- Reference implementation which serves as inspiration: https://github.com/UNINETTSigma2/appusage
- TACC Stats
- Ganglia Monitoring System
