Hedera Network Monitor

Production-Grade Blockchain Monitoring in Go

Dec 02, 2025

What It Does

The Hedera Network Monitor is a comprehensive monitoring and alerting system for the Hedera distributed ledger network. If you’re running applications on Hedera, you need to know when account balances drop below thresholds, when transactions fail, or when network nodes become unavailable. This tool handles all of that with real-time alerts and a clean REST API.1

The monitor is a blockchain watchdog. It can run 24/7, collecting metrics, evaluating alert conditions, and notify you through webhooks (Slack, Discord, PagerDuty, etc.) when something needs attention.

Get Source Code

Core Features

Account Monitoring

Track multiple Hedera accounts simultaneously
Monitor balance changes in real-time
Alert when balances fall below configurable thresholds
Historical balance tracking via REST API

Network Health

Monitor Hedera node availability
Track network status across testnet and mainnet
Alert on network degradation

Flexible Alerting

Webhook notifications to your service (Slack, Discord, custom endpoints)
Configurable alert rules with multiple severity levels
Alert cooldowns to prevent notification spam
Exponential backoff with retry logic for failed webhooks

Developer-Friendly Tooling

REST API for programmatic access to metrics
CLI tool for quick queries and configuration
Clean configuration via YAML files
Comprehensive logging for debugging
This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why I Built This

The Hedera ecosystem has excellent developer tools for building applications. Monitoring infrastructure is lacking.

I recently contributed to the Hedera protocol itself where I achieved a major performance improvement in protocol buffer implementation. I recognized that production teams need robust monitoring to operate their nodes with confidence.

Most blockchain monitoring solutions are either:

Expensive SaaS platforms with vendor lock-in
Generic tools that require extensive customization
Built for Ethereum and poorly adapted to other chains

I wanted something purpose-built for Hedera that teams could self-host, customize, and trust in production.

Thanks for reading! This post is public so feel free to share it.

Architecture Overview

The monitor runs as a daemon service with several concurrent components:

Monitor Service
├── Collectors (gather metrics from Hedera)
│   ├── Account Collector (balances, transactions)
│   └── Network Collector (node status)
├── Alert Manager (evaluates rules, dispatches webhooks)
├── Storage (in-memory for MVP, PostgreSQL/InfluxDB planned)
└── REST API Server (exposes metrics and configuration)

Everything runs concurrently using Go’s goroutines, coordinated through channels and context-based cancellation. When you shut down the service (Ctrl+C), all components receive the signal and clean up gracefully. No orphaned connections or lost data.

Real-World Usage Example

Here’s how you’d use this in production. Copy the config/config.example.yaml file to config.yaml and customize it as appropriate.

1. Configure your monitoring:

network:
  name: mainnet
  operator_id: “0.0.1234”
  operator_key: “your-private-key”

accounts:
  - id: “0.0.5000”
    label: “Main Treasury”
  - id: “0.0.5001”
    label: “Fee Account”

alerting:
  enabled: true
  webhooks:
    - “https://hooks.slack.com/services/YOUR/WEBHOOK”
  rules:
    - name: “Treasury Balance Critical”
      metric_name: “account_balance”
      condition: “<”
      threshold: 50000000000  # 500 HBAR
      severity: “critical”
    - name: “Fee Account Low”
      metric_name: “account_balance”
      condition: “<”
      threshold: 10000000000  # 100 HBAR
      severity: “warning”

2. Start the service:

./monitor --config config/config.yaml

3. Query metrics via command-line interface (CLI):

# Check current balance
./hmon account balance 0.0.5000

# List active alert rules
./hmon alerts list

# Get network status
./hmon network status

4. Query metrics via API:

# Get recent metrics
curl http://localhost:8080/api/v1/metrics?name=account_balance&limit=10

# Get metrics for specific account
curl “http://localhost:8080/api/v1/metrics/account?key=account_id&value=0.0.5000” | jq ‘.metrics[-5:]’

When your treasury balance drops below 500 HBAR, you get a Slack notification immediately. No manual checking, no missed alerts.

What I Learned Building This in Go

This was my first production-scale Go project. It taught me why Go has become the standard for infrastructure tooling. Here are my key insights.

1. Goroutines Make Concurrency Easy

In Java and C# threading is expensive and complex. Go’s goroutines changed everything.

In this project, we run:

1 goroutine for the API server
2 goroutines for collectors (account and network monitoring)
1 goroutine for the alert manager
N goroutines for webhook dispatch (one per webhook, fired in parallel)

In Java, managing 5+ threads would require careful thread pool configuration, executor services, and significant overhead. In Go, it’s just:

eg, ctx := errgroup.WithContext(context.Background())

// Start API server
eg.Go(func() error {
    return server.Start(ctx)
})

// Start collectors
for _, collector := range collectors {
    c := c  // Capture loop variable!
    eg.Go(func() error {
        return c.Collect(ctx, store, alertManager)
    })
}

// Wait for all to complete or first error
if err := eg.Wait(); err != nil {
    log.Fatal(err)
}

Each goroutine is only about 2KB of stack space. You could spawn 10,000 if needed. The Go runtime handles multiplexing them onto OS threads automatically.

Critical lesson: The loop variable capture (c := c) is essential. Without it, all goroutines would reference the same loop variable, and you would get race conditions. This is one of Go’s classic tricks to learn.2

2. Context-Based Cancellation Is Elegant

Graceful shutdown in distributed systems is notoriously difficult. Go’s context package makes it straightforward. The defer keyword describes what to do at the end of the context like the Java try-catch-finally syntax.

// Create cancellable context
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

// Listen for shutdown signals
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

go func() {
    <-sigChan
    log.Println(”Shutting down gracefully...”)
    cancel()  // Signal all goroutines to stop
}()

Every goroutine checks ctx.Done() in its main loop to test for when the process is shutting down.

for {
    select {
    case <-ctx.Done():
        log.Println(”Received shutdown signal”)
        return ctx.Err()
    case metric := <-metricChan:
        // Process metric
    }
}

When you press Ctrl+C:

The signal handler calls cancel()
All contexts receive the cancellation signal
Each goroutine cleans up and exits
Main waits for all goroutines via errgroup.Wait()
Process exits cleanly

No zombie processes, no leaked connections, no corruption. Just clean shutdown.

3. Interfaces Enable True Modularity

Go’s interface system is minimal and powerful. Go uses implicit satisfaction, which is a bit different from Java’s implicit implements or Python’s duck typing.3

If your type has the right methods, it implements the interface automatically. This made my storage layer flexible:

type Storage interface {
    StoreMetric(metric Metric) error
    GetMetrics(name string, limit int) ([]Metric, error)
    Close() error
}

The MVP uses in-memory storage:

var store Storage = storage.NewMemoryStorage()

For production we can implement and swap in PostgreSQL:

var store Storage = storage.NewPostgresStorage(connString)

The rest of the codebase doesn’t know or care. The collectors just call store.StoreMetric(), and it works regardless of the backend.

Key insight: Interface-driven design helps you build systems that can evolve without massive refactoring. And is very useful for testing!

4. Testing Async Code Requires Discipline

The hardest part of this project wasn’t writing the code. It was testing the asynchronous behavior properly.

The problem: Async tests are slow. Testing webhook retry logic with exponential backoff takes real time:

First attempt: immediate
Retry 1: 10ms delay
Retry 2: 20ms delay
Retry 3: 40ms delay
Retry 4: 80ms delay

A single test can take 5+ seconds. If you have 10 such tests, you’re waiting nearly a minute before every commit. That breaks the fast feedback loop that developers need.

My solution: Split tests by speed using build tags.

Fast tests (run on every commit, ~4 seconds):

// No build tag - runs by default
func TestAlertCondition_GreaterThan(t *testing.T) {
    // Pure logic, no waiting
    result := EvaluateCondition(100, “>”, 50)
    if !result {
        t.Error(”Expected true”)
    }
}

Slow tests (run before pushing, ~30-60 seconds):

//go:build integration
// +build integration

func TestEndToEndWebhookRetry(t *testing.T) {
    // Actually waits for retries, tests real async behavior
    manager.CheckMetric(metric)
    time.Sleep(5 * time.Second)
    // Verify webhook was called after retries
}

In the Makefile:

test-unit:
    go test ./...  # Fast, no integration tag

test-integration:
    go test -tags integration ./...  # Slow, full validation

This way we get instant feedback (<5s) on every commit, and comprehensive validation before pushing.

5. Implementing Robust Webhook Delivery

When the alerts fire, you need confidence they reach the user. Networks fail, services restart, rate limits hit. The solution is exponential backoff.

The pattern is simple: after each failed attempt, wait twice as long before retrying.

func (m *Manager) sendWebhookWithRetry(webhook string, alert Alert) error {
    maxRetries := 4
    baseDelay := 10 * time.Millisecond

    for attempt := 0; attempt <= maxRetries; attempt++ {
        err := m.sendWebhook(webhook, alert)
        if err == nil {
            return nil  // Success!
        }
        if attempt < maxRetries {
            delay := baseDelay * (1 << attempt)  // 10ms, 20ms, 40ms, 80ms
            time.Sleep(delay)
        }
    }

    return fmt.Errorf(”webhook failed after %d attempts”, maxRetries)
}

Why this works:

Transient failures recover - In case of a 50ms network blip the second attempt succeeds.
No endpoint hammering - It helps to give a backed up or restarting server some time to reset.
Rate limit friendly - Backing off gives rate limit windows time to reset.
Fast when possible - First retry is only 10ms. Most failures resolve quickly.

The total retry sequence with 4 attempts takes only 150ms (10+20+40+80). Therefore genuine failures fail fast while temporary issues self-resolve.

Testing Exponential Backoff Without Flakiness

A testing challenge was verifying the exponential backoff timing. You can’t hardcode “expect exactly 10ms delay.” Because system load, CI runners, and scheduler jitter make that flaky.

The wrong approach:

delay := timestamps[1].Sub(timestamps[0])
if delay != 10*time.Millisecond {  // Fails randomly!
    t.Fatal(”Expected 10ms”)
}

The right approach - test the ratio with tolerance:

var delays []time.Duration
for i := 1; i < len(timestamps); i++ {
    delays = append(delays, timestamps[i].Sub(timestamps[i-1]))
}

// Verify exponential growth: each delay ~2x the previous
tolerance := 0.5  // Allow 50% variance
for i := 1; i < len(delays); i++ {
    ratio := float64(delays[i]) / float64(delays[i-1])
    if ratio < (2.0-tolerance) || ratio > (2.0+tolerance) {
        t.Errorf(”Delay %d: ratio %.2f, expected ~2.0”, i, ratio)
    }
}

This verifies the exponential property (each retry is ~2x longer) while tolerating system variance. It catches real bugs like linear backoff instead of exponential without false failures.

6. JSON Serialization Has Sharp Edges

Go’s JSON encoding has a critical rule: only exported (capitalized) fields are serialized.

This bit me during API development:

type Metric struct {
    Name      string  // Becomes “Name” in JSON
    Value     float64 // Becomes “Value” in JSON
    timestamp int64   // NOT EXPORTED - invisible to JSON!
}

To control JSON field names, use struct tags:

type Metric struct {
    Name      string  `json:”name”`       // “name” in JSON
    Value     float64 `json:”value”`      // “value” in JSON
    Timestamp int64   `json:”timestamp”`  // “timestamp” in JSON
}

Second gotcha: Don’t encode JSON strings, encode structs.

Wrong:

jsonStr := `{”name”:”test”}`
json.NewEncoder(w).Encode(jsonStr)  // Double-wraps it!
// Output: “{\”name\”:\”test\”}”  - escaped quotes!

Right:

data := MyStruct{Name: “test”}
json.NewEncoder(w).Encode(data)
// Output: {”name”:”test”}  - clean JSON

7. Channels Are Not Just Queues

Coming from Java’s BlockingQueue or Python’s Queue, I initially thought Go channels were just thread-safe queues. They’re much more powerful.

Channels with select enable elegant multiplexing:

for {
    select {
    case <-ctx.Done():
        return ctx.Err()  // Shutdown signal
    case alert := <-alertQueue:
        processAlert(alert)  // New alert
    case <-time.After(30 * time.Second):
        log.Println(”Health check”)  // Periodic task
    }
}

One select statement handles three completely different event sources. In Java, you’d need separate threads or a complex event loop. In Go, it’s a language primitive.

8. Mutexes Are Necessary But Simpler

Despite channels being the idiomatic Go approach, sometimes you need good old-fashioned mutexes for shared state:

type Manager struct {
    rules      []AlertRule
    ruleMutex  sync.RWMutex  // Protects rules
    lastAlerts map[string]time.Time
    alertMutex sync.Mutex    // Protects lastAlerts
}

func (m *Manager) GetRules() []AlertRule {
    m.ruleMutex.RLock()         // Multiple readers OK
    defer m.ruleMutex.RUnlock()
    
    return append([]AlertRule{}, m.rules...)  // Return copy
}

func (m *Manager) AddRule(rule AlertRule) {
    m.ruleMutex.Lock()          // Exclusive access
    defer m.ruleMutex.Unlock()
    
    m.rules = append(m.rules, rule)
}

Key insight: Use RWMutex when you have many readers and few writers. The read lock allows concurrent reads, while the write lock is exclusive.

Testing Infrastructure

The test suite is comprehensive and fast:

Statistics:

Currently 197 unit tests (run in ~2.6 seconds on my laptop)
11 integration tests (run in ~30-40 seconds)
Zero linter warnings
Coverage of all critical paths

What’s tested:

Alert condition evaluation (9 tests covering all operators)
Alert manager core functionality (16 tests)
Webhook retry logic with exponential backoff (11 tests)
End-to-end workflows (10 integration tests)
CLI commands (13 tests)
Storage operations
Configuration loading
Hiero SDK interactions (mocked)

Pre-push verification workflow:

# Fast checks on every commit
./scripts/check-offline.sh  # Format, lint, unit tests, build (~10-20s)

# Full verification before push
./monitor --config config.yaml  # Start service
./scripts/check-online.sh       # API health, metrics, alerts (~30-60s)

This gives developers instant feedback while ensuring production quality before code reaches CI.

Production Readiness Features

Graceful Shutdown

Handles SIGTERM and SIGINT signals
Stops accepting new requests
Waits for in-flight operations to complete
Closes all connections cleanly

Error Handling

No silent failures - all errors are logged or returned
Webhook retry with exponential backoff (10ms → 20ms → 40ms → 80ms)
Alert cooldowns prevent notification spam
Configurable severity levels

Observability

Structured logging throughout
Configurable log levels (debug, info, warn, error)
REST API exposes health checks and metrics
Clear error messages with context

Configuration

YAML-based configuration with validation
Environment variable fallback
Inline documentation in example config
No hardcoded credentials

What’s Next

The MVP focuses on core monitoring and alerting. Potential future enhancements include:

Storage Backends

PostgreSQL for persistent storage
InfluxDB for time-series optimization
Prometheus metrics export

Advanced Features

Web UI dashboard with Grafana integration
Transaction history tracking and analysis
Custom alerting domain-specific language (DSL) for complex conditions
Rate limiting and cost analysis
WebSocket API for real-time metrics
Multi-network monitoring (mainnet + testnet simultaneously)

Enterprise Features

Kubernetes deployment manifests
High availability setup
User authentication and multi-tenancy
Integration with enterprise alerting platforms (PagerDuty, Opsgenie)

Key Takeaways for Go Development

Here’s what I’d tell someone new to Go:

Embrace goroutines - Don’t think in terms of thread pools. Spawn goroutines liberally and let the runtime handle scheduling.
Context is not optional - For any long-running operation accept a context. It’s how you build composable, cancellable systems.
Interfaces after implementation - You don’t need to design interfaces upfront. Build concrete implementations first, extract interfaces when you need flexibility.
Test behavior, not implementation - Go’s simplicity encourages you to test what code does, not how it does it.
Channels for ownership transfer - If data ownership moves between goroutines, use channels. For shared state accessed by multiple goroutines, use mutexes.
Error handling is verbose but explicit - Yes you write if err != nil constantly. But you always know when operations can fail and how to handle it.
The standard library is phenomenal - For HTTP servers, JSON encoding, testing, benchmarking - it’s all built-in and production-grade.
Tooling matters - go fmt, go vet, golangci-lint - use them on every commit. The ecosystem’s emphasis on consistency pays dividends.

Why This Project Matters

Beyond learning Go, this project the skills I’m building for protocol engineering and distributed systems:

Concurrent System Design

Coordinating multiple independent components
Managing shared state safely
Implementing graceful shutdown

Production Quality Code

Comprehensive testing (unit + integration)
Error handling without silent failures
Observability and debugging support

Real-World Problem Solving

Exponential backoff for transient failures
Alert cooldowns to prevent notification spam
Configurable behavior without code changes

Developer Experience

Clean APIs (REST + CLI)
Thorough documentation
Easy local development and testing

For teams running production applications on Hedera, this tool provides the observability foundation they need. My aim is also to demonstrate I can build production-grade infrastructure to engineers evaluating my work. This includes appropriate testing, documentation, and operational considerations.

Try It Yourself

The project is open source and ready to run on GitHub. If you’re familiar with the command-line, GitHub and Go you can get set up within ten minutes.

Get The Code

Quick Start:

git clone https://github.com/kaldun-tech/hedera-network-monitor.git
cd hedera-network-monitor
cp config/config.example.yaml config/config.yaml
# Edit config.yaml with your Hedera credentials
make build
./monitor

What You Need:

Go 1.21+
Hedera testnet account - free from portal.hedera.com

The README includes complete configuration examples, API documentation, and troubleshooting guides. The test suite demonstrates best practices for testing concurrent Go code.

Built by Taras Smereka

Distributed Systems Engineer specializing in protocol optimization and production infrastructure

Hedera

GitHub: Hedera Network Monitor

Go: goroutine

Educative: How are interfaces implicitly satisfied in Go

Stack Overflow: Java Interfaces Explicit or Implicit

Real Python: Duck Typing

Taras Smereka

Discussion about this post

Ready for more?