Hedera Network Monitor
Production-Grade Blockchain Monitoring in Go
What It Does
The Hedera Network Monitor is a comprehensive monitoring and alerting system for the Hedera distributed ledger network. If you’re running applications on Hedera, you need to know when account balances drop below thresholds, when transactions fail, or when network nodes become unavailable. This tool handles all of that with real-time alerts and a clean REST API.1
The monitor is a blockchain watchdog. It can run 24/7, collecting metrics, evaluating alert conditions, and notify you through webhooks (Slack, Discord, PagerDuty, etc.) when something needs attention.
Core Features
Account Monitoring
Track multiple Hedera accounts simultaneously
Monitor balance changes in real-time
Alert when balances fall below configurable thresholds
Historical balance tracking via REST API
Network Health
Monitor Hedera node availability
Track network status across testnet and mainnet
Alert on network degradation
Flexible Alerting
Webhook notifications to your service (Slack, Discord, custom endpoints)
Configurable alert rules with multiple severity levels
Alert cooldowns to prevent notification spam
Exponential backoff with retry logic for failed webhooks
Developer-Friendly Tooling
REST API for programmatic access to metrics
CLI tool for quick queries and configuration
Clean configuration via YAML files
Comprehensive logging for debugging
Why I Built This
The Hedera ecosystem has excellent developer tools for building applications. Monitoring infrastructure is lacking.
I recently contributed to the Hedera protocol itself where I achieved a major performance improvement in protocol buffer implementation. I recognized that production teams need robust monitoring to operate their nodes with confidence.
Most blockchain monitoring solutions are either:
Expensive SaaS platforms with vendor lock-in
Generic tools that require extensive customization
Built for Ethereum and poorly adapted to other chains
I wanted something purpose-built for Hedera that teams could self-host, customize, and trust in production.
Architecture Overview
The monitor runs as a daemon service with several concurrent components:
Monitor Service
├── Collectors (gather metrics from Hedera)
│ ├── Account Collector (balances, transactions)
│ └── Network Collector (node status)
├── Alert Manager (evaluates rules, dispatches webhooks)
├── Storage (in-memory for MVP, PostgreSQL/InfluxDB planned)
└── REST API Server (exposes metrics and configuration)
Everything runs concurrently using Go’s goroutines, coordinated through channels and context-based cancellation. When you shut down the service (Ctrl+C), all components receive the signal and clean up gracefully. No orphaned connections or lost data.
Real-World Usage Example
Here’s how you’d use this in production. Copy the config/config.example.yaml file to config.yaml and customize it as appropriate.
1. Configure your monitoring:
network:
name: mainnet
operator_id: “0.0.1234”
operator_key: “your-private-key”
accounts:
- id: “0.0.5000”
label: “Main Treasury”
- id: “0.0.5001”
label: “Fee Account”
alerting:
enabled: true
webhooks:
- “https://hooks.slack.com/services/YOUR/WEBHOOK”
rules:
- name: “Treasury Balance Critical”
metric_name: “account_balance”
condition: “<”
threshold: 50000000000 # 500 HBAR
severity: “critical”
- name: “Fee Account Low”
metric_name: “account_balance”
condition: “<”
threshold: 10000000000 # 100 HBAR
severity: “warning”
2. Start the service:
./monitor --config config/config.yaml
3. Query metrics via command-line interface (CLI):
# Check current balance
./hmon account balance 0.0.5000
# List active alert rules
./hmon alerts list
# Get network status
./hmon network status
4. Query metrics via API:
# Get recent metrics
curl http://localhost:8080/api/v1/metrics?name=account_balance&limit=10
# Get metrics for specific account
curl “http://localhost:8080/api/v1/metrics/account?key=account_id&value=0.0.5000” | jq ‘.metrics[-5:]’
When your treasury balance drops below 500 HBAR, you get a Slack notification immediately. No manual checking, no missed alerts.
What I Learned Building This in Go
This was my first production-scale Go project. It taught me why Go has become the standard for infrastructure tooling. Here are my key insights.
1. Goroutines Make Concurrency Easy
In Java and C# threading is expensive and complex. Go’s goroutines changed everything.
In this project, we run:
1 goroutine for the API server
2 goroutines for collectors (account and network monitoring)
1 goroutine for the alert manager
N goroutines for webhook dispatch (one per webhook, fired in parallel)
In Java, managing 5+ threads would require careful thread pool configuration, executor services, and significant overhead. In Go, it’s just:
eg, ctx := errgroup.WithContext(context.Background())
// Start API server
eg.Go(func() error {
return server.Start(ctx)
})
// Start collectors
for _, collector := range collectors {
c := c // Capture loop variable!
eg.Go(func() error {
return c.Collect(ctx, store, alertManager)
})
}
// Wait for all to complete or first error
if err := eg.Wait(); err != nil {
log.Fatal(err)
}
Each goroutine is only about 2KB of stack space. You could spawn 10,000 if needed. The Go runtime handles multiplexing them onto OS threads automatically.
Critical lesson: The loop variable capture (c := c) is essential. Without it, all goroutines would reference the same loop variable, and you would get race conditions. This is one of Go’s classic tricks to learn.2
2. Context-Based Cancellation Is Elegant
Graceful shutdown in distributed systems is notoriously difficult. Go’s context package makes it straightforward. The defer keyword describes what to do at the end of the context like the Java try-catch-finally syntax.
// Create cancellable context
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Listen for shutdown signals
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-sigChan
log.Println(”Shutting down gracefully...”)
cancel() // Signal all goroutines to stop
}()
Every goroutine checks ctx.Done() in its main loop to test for when the process is shutting down.
for {
select {
case <-ctx.Done():
log.Println(”Received shutdown signal”)
return ctx.Err()
case metric := <-metricChan:
// Process metric
}
}
When you press Ctrl+C:
The signal handler calls
cancel()All contexts receive the cancellation signal
Each goroutine cleans up and exits
Main waits for all goroutines via
errgroup.Wait()Process exits cleanly
No zombie processes, no leaked connections, no corruption. Just clean shutdown.
3. Interfaces Enable True Modularity
Go’s interface system is minimal and powerful. Go uses implicit satisfaction, which is a bit different from Java’s implicit implements or Python’s duck typing.3
If your type has the right methods, it implements the interface automatically. This made my storage layer flexible:
type Storage interface {
StoreMetric(metric Metric) error
GetMetrics(name string, limit int) ([]Metric, error)
Close() error
}
The MVP uses in-memory storage:
var store Storage = storage.NewMemoryStorage()
For production we can implement and swap in PostgreSQL:
var store Storage = storage.NewPostgresStorage(connString)
The rest of the codebase doesn’t know or care. The collectors just call store.StoreMetric(), and it works regardless of the backend.
Key insight: Interface-driven design helps you build systems that can evolve without massive refactoring. And is very useful for testing!
4. Testing Async Code Requires Discipline
The hardest part of this project wasn’t writing the code. It was testing the asynchronous behavior properly.
The problem: Async tests are slow. Testing webhook retry logic with exponential backoff takes real time:
First attempt: immediate
Retry 1: 10ms delay
Retry 2: 20ms delay
Retry 3: 40ms delay
Retry 4: 80ms delay
A single test can take 5+ seconds. If you have 10 such tests, you’re waiting nearly a minute before every commit. That breaks the fast feedback loop that developers need.
My solution: Split tests by speed using build tags.
Fast tests (run on every commit, ~4 seconds):
// No build tag - runs by default
func TestAlertCondition_GreaterThan(t *testing.T) {
// Pure logic, no waiting
result := EvaluateCondition(100, “>”, 50)
if !result {
t.Error(”Expected true”)
}
}
Slow tests (run before pushing, ~30-60 seconds):
//go:build integration
// +build integration
func TestEndToEndWebhookRetry(t *testing.T) {
// Actually waits for retries, tests real async behavior
manager.CheckMetric(metric)
time.Sleep(5 * time.Second)
// Verify webhook was called after retries
}
In the Makefile:
test-unit:
go test ./... # Fast, no integration tag
test-integration:
go test -tags integration ./... # Slow, full validation
This way we get instant feedback (<5s) on every commit, and comprehensive validation before pushing.
5. Implementing Robust Webhook Delivery
When the alerts fire, you need confidence they reach the user. Networks fail, services restart, rate limits hit. The solution is exponential backoff.
The pattern is simple: after each failed attempt, wait twice as long before retrying.
func (m *Manager) sendWebhookWithRetry(webhook string, alert Alert) error {
maxRetries := 4
baseDelay := 10 * time.Millisecond
for attempt := 0; attempt <= maxRetries; attempt++ {
err := m.sendWebhook(webhook, alert)
if err == nil {
return nil // Success!
}
if attempt < maxRetries {
delay := baseDelay * (1 << attempt) // 10ms, 20ms, 40ms, 80ms
time.Sleep(delay)
}
}
return fmt.Errorf(”webhook failed after %d attempts”, maxRetries)
}Why this works:
Transient failures recover - In case of a 50ms network blip the second attempt succeeds.
No endpoint hammering - It helps to give a backed up or restarting server some time to reset.
Rate limit friendly - Backing off gives rate limit windows time to reset.
Fast when possible - First retry is only 10ms. Most failures resolve quickly.
The total retry sequence with 4 attempts takes only 150ms (10+20+40+80). Therefore genuine failures fail fast while temporary issues self-resolve.
Testing Exponential Backoff Without Flakiness
A testing challenge was verifying the exponential backoff timing. You can’t hardcode “expect exactly 10ms delay.” Because system load, CI runners, and scheduler jitter make that flaky.
The wrong approach:
delay := timestamps[1].Sub(timestamps[0])
if delay != 10*time.Millisecond { // Fails randomly!
t.Fatal(”Expected 10ms”)
}
The right approach - test the ratio with tolerance:
var delays []time.Duration
for i := 1; i < len(timestamps); i++ {
delays = append(delays, timestamps[i].Sub(timestamps[i-1]))
}
// Verify exponential growth: each delay ~2x the previous
tolerance := 0.5 // Allow 50% variance
for i := 1; i < len(delays); i++ {
ratio := float64(delays[i]) / float64(delays[i-1])
if ratio < (2.0-tolerance) || ratio > (2.0+tolerance) {
t.Errorf(”Delay %d: ratio %.2f, expected ~2.0”, i, ratio)
}
}
This verifies the exponential property (each retry is ~2x longer) while tolerating system variance. It catches real bugs like linear backoff instead of exponential without false failures.
6. JSON Serialization Has Sharp Edges
Go’s JSON encoding has a critical rule: only exported (capitalized) fields are serialized.
This bit me during API development:
type Metric struct {
Name string // Becomes “Name” in JSON
Value float64 // Becomes “Value” in JSON
timestamp int64 // NOT EXPORTED - invisible to JSON!
}
To control JSON field names, use struct tags:
type Metric struct {
Name string `json:”name”` // “name” in JSON
Value float64 `json:”value”` // “value” in JSON
Timestamp int64 `json:”timestamp”` // “timestamp” in JSON
}
Second gotcha: Don’t encode JSON strings, encode structs.
Wrong:
jsonStr := `{”name”:”test”}`
json.NewEncoder(w).Encode(jsonStr) // Double-wraps it!
// Output: “{\”name\”:\”test\”}” - escaped quotes!
Right:
data := MyStruct{Name: “test”}
json.NewEncoder(w).Encode(data)
// Output: {”name”:”test”} - clean JSON
7. Channels Are Not Just Queues
Coming from Java’s BlockingQueue or Python’s Queue, I initially thought Go channels were just thread-safe queues. They’re much more powerful.
Channels with select enable elegant multiplexing:
for {
select {
case <-ctx.Done():
return ctx.Err() // Shutdown signal
case alert := <-alertQueue:
processAlert(alert) // New alert
case <-time.After(30 * time.Second):
log.Println(”Health check”) // Periodic task
}
}
One select statement handles three completely different event sources. In Java, you’d need separate threads or a complex event loop. In Go, it’s a language primitive.
8. Mutexes Are Necessary But Simpler
Despite channels being the idiomatic Go approach, sometimes you need good old-fashioned mutexes for shared state:
type Manager struct {
rules []AlertRule
ruleMutex sync.RWMutex // Protects rules
lastAlerts map[string]time.Time
alertMutex sync.Mutex // Protects lastAlerts
}
func (m *Manager) GetRules() []AlertRule {
m.ruleMutex.RLock() // Multiple readers OK
defer m.ruleMutex.RUnlock()
return append([]AlertRule{}, m.rules...) // Return copy
}
func (m *Manager) AddRule(rule AlertRule) {
m.ruleMutex.Lock() // Exclusive access
defer m.ruleMutex.Unlock()
m.rules = append(m.rules, rule)
}
Key insight: Use RWMutex when you have many readers and few writers. The read lock allows concurrent reads, while the write lock is exclusive.
Testing Infrastructure
The test suite is comprehensive and fast:
Statistics:
Currently 197 unit tests (run in ~2.6 seconds on my laptop)
11 integration tests (run in ~30-40 seconds)
Zero linter warnings
Coverage of all critical paths
What’s tested:
Alert condition evaluation (9 tests covering all operators)
Alert manager core functionality (16 tests)
Webhook retry logic with exponential backoff (11 tests)
End-to-end workflows (10 integration tests)
CLI commands (13 tests)
Storage operations
Configuration loading
Hiero SDK interactions (mocked)
Pre-push verification workflow:
# Fast checks on every commit
./scripts/check-offline.sh # Format, lint, unit tests, build (~10-20s)
# Full verification before push
./monitor --config config.yaml # Start service
./scripts/check-online.sh # API health, metrics, alerts (~30-60s)
This gives developers instant feedback while ensuring production quality before code reaches CI.
Production Readiness Features
Graceful Shutdown
Handles SIGTERM and SIGINT signals
Stops accepting new requests
Waits for in-flight operations to complete
Closes all connections cleanly
Error Handling
No silent failures - all errors are logged or returned
Webhook retry with exponential backoff (10ms → 20ms → 40ms → 80ms)
Alert cooldowns prevent notification spam
Configurable severity levels
Observability
Structured logging throughout
Configurable log levels (debug, info, warn, error)
REST API exposes health checks and metrics
Clear error messages with context
Configuration
YAML-based configuration with validation
Environment variable fallback
Inline documentation in example config
No hardcoded credentials
What’s Next
The MVP focuses on core monitoring and alerting. Potential future enhancements include:
Storage Backends
PostgreSQL for persistent storage
InfluxDB for time-series optimization
Prometheus metrics export
Advanced Features
Web UI dashboard with Grafana integration
Transaction history tracking and analysis
Custom alerting domain-specific language (DSL) for complex conditions
Rate limiting and cost analysis
WebSocket API for real-time metrics
Multi-network monitoring (mainnet + testnet simultaneously)
Enterprise Features
Kubernetes deployment manifests
High availability setup
User authentication and multi-tenancy
Integration with enterprise alerting platforms (PagerDuty, Opsgenie)
Key Takeaways for Go Development
Here’s what I’d tell someone new to Go:
Embrace goroutines - Don’t think in terms of thread pools. Spawn goroutines liberally and let the runtime handle scheduling.
Context is not optional - For any long-running operation accept a context. It’s how you build composable, cancellable systems.
Interfaces after implementation - You don’t need to design interfaces upfront. Build concrete implementations first, extract interfaces when you need flexibility.
Test behavior, not implementation - Go’s simplicity encourages you to test what code does, not how it does it.
Channels for ownership transfer - If data ownership moves between goroutines, use channels. For shared state accessed by multiple goroutines, use mutexes.
Error handling is verbose but explicit - Yes you write
if err != nilconstantly. But you always know when operations can fail and how to handle it.The standard library is phenomenal - For HTTP servers, JSON encoding, testing, benchmarking - it’s all built-in and production-grade.
Tooling matters -
go fmt,go vet,golangci-lint- use them on every commit. The ecosystem’s emphasis on consistency pays dividends.
Why This Project Matters
Beyond learning Go, this project the skills I’m building for protocol engineering and distributed systems:
Concurrent System Design
Coordinating multiple independent components
Managing shared state safely
Implementing graceful shutdown
Production Quality Code
Comprehensive testing (unit + integration)
Error handling without silent failures
Observability and debugging support
Real-World Problem Solving
Exponential backoff for transient failures
Alert cooldowns to prevent notification spam
Configurable behavior without code changes
Developer Experience
Clean APIs (REST + CLI)
Thorough documentation
Easy local development and testing
For teams running production applications on Hedera, this tool provides the observability foundation they need. My aim is also to demonstrate I can build production-grade infrastructure to engineers evaluating my work. This includes appropriate testing, documentation, and operational considerations.
Try It Yourself
The project is open source and ready to run on GitHub. If you’re familiar with the command-line, GitHub and Go you can get set up within ten minutes.
Quick Start:
git clone https://github.com/kaldun-tech/hedera-network-monitor.git
cd hedera-network-monitor
cp config/config.example.yaml config/config.yaml
# Edit config.yaml with your Hedera credentials
make build
./monitor
What You Need:
Go 1.21+
Hedera testnet account - free from portal.hedera.com
The README includes complete configuration examples, API documentation, and troubleshooting guides. The test suite demonstrates best practices for testing concurrent Go code.
Built by Taras Smereka
Distributed Systems Engineer specializing in protocol optimization and production infrastructure






