Skip to content

feat(helm): Phase 3.5 - Observability Stack#7

Merged
tamingchaos merged 10 commits into
k8s-researchfrom
feature/phase3.5-observability
Dec 18, 2025
Merged

feat(helm): Phase 3.5 - Observability Stack#7
tamingchaos merged 10 commits into
k8s-researchfrom
feature/phase3.5-observability

Conversation

@tamingchaos

Copy link
Copy Markdown
Owner

Summary

  • Add integrated Prometheus/Grafana monitoring stack via kube-prometheus-stack dependency
  • Add NATS prometheus-exporter sidecar for JetStream metrics
  • Add Makefile targets for Helm/K8s operations
  • Add development values and local secrets template

Changes

Monitoring

  • NATS prometheus-exporter sidecar container in sequencer StatefulSet
  • ServiceMonitors for sequencer (erigon + NATS) and RPC
  • Grafana dashboards: cdk-erigon-performance, cdk-erigon-sync
  • Grafana NodePort (30300) for local access

Developer Experience

  • make docker-build / make docker-build-all - Build images
  • make helm-deps - Update chart dependencies
  • make helm-crds - Install Prometheus Operator CRDs
  • values-dev.yaml - Development environment settings
  • values-local.yaml.example - Template for local secrets (API keys)
  • Remove values-local.yaml from tracking (now gitignored)

Documentation

  • NATS stream/KV mismatch troubleshooting guide

Related Issues

  • RD-593: NATS Prometheus metrics
  • RD-594: ServiceMonitors
  • RD-595: Grafana dashboards
  • RD-613, RD-614: NATS datastream issues (documented, not fixed in this PR)

Copilot AI review requested due to automatic review settings December 18, 2025 12:09
@tamingchaos tamingchaos changed the base branch from k8s-baseline to k8s-research December 18, 2025 12:13

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive observability stack to the CDK Erigon Kubernetes deployment, including:

  • Integrated Prometheus/Grafana monitoring via kube-prometheus-stack dependency
  • NATS prometheus-exporter sidecar for JetStream metrics
  • ServiceMonitors for automated scrape configuration
  • Grafana dashboards for NATS and cdk-erigon performance
  • Development tooling including Makefile targets and validation scripts
  • Extensive documentation for security, shutdown procedures, and conventions

Reviewed changes

Copilot reviewed 42 out of 44 changed files in this pull request and generated no comments.

Show a summary per file
File Description
k8s/l1-proxy/Dockerfile L1 RPC cache proxy container image
k8s/scripts/*.sh Build, validation, and testing automation scripts
k8s/helm/values*.yaml Helm configuration for various environments
k8s/helm/templates/**/*.yaml Kubernetes resource templates
k8s/helm/tests/**/*.yaml Helm unit tests
k8s/helm/dashboards/*.json Grafana dashboard definitions
k8s/helm/Chart.yaml Helm chart metadata with kube-prometheus-stack dependency
k8s/docs/**/*.md Comprehensive documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tamingchaos and others added 10 commits December 18, 2025 12:27
- Add NATS http_port config (8222) to sequencer
- Expose monitoring port in StatefulSet and Service
- Create ServiceMonitor for Prometheus scraping
- Add 6 helm-unittest tests (105/105 passing)
- Document Prometheus Operator prerequisite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add global monitoring values schema
- Add erigon metrics port (6060) to sequencer and RPC
- Expose metrics ports in Services
- Create ServiceMonitor for sequencer erigon metrics
- Create ServiceMonitor for RPC erigon metrics
- Add 14 helm-unittest tests (119/119 passing)
- Update monitoring README with new ServiceMonitors
- Add .helmignore file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…595)

- Add NATS JetStream dashboard (from official repo)
- Create custom cdk-erigon performance dashboard
- Create Grafana ConfigMap for dashboard provisioning
- Add 5 helm-unittest tests (124/124 passing)
- Document dashboard usage and configuration

Dashboards included:
- NATS JetStream: stream metrics, storage, consumer lag, throughput
- cdk-erigon Performance: block sync, memory/CPU, RPC rate, DB size

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add kube-prometheus-stack as optional subchart for one-command deployment:

- Add prometheus dependency in Chart.yaml (v65.0.0)
- Configure Grafana sidecar for auto-dashboard discovery
- Enable single-command monitoring deployment
- Add .gitignore for generated charts/
- Update README with integrated setup instructions

Usage:
  helm install cdk-erigon . \
    --set monitoring.enabled=true \
    --set monitoring.prometheus.enabled=true \
    --set monitoring.grafana.dashboards.enabled=true

Dashboards auto-load into Grafana at http://localhost:3000

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Upgrade kube-prometheus-stack v65→v80 for built-in CRD installation
- Enable prometheus.crds.upgradeJob for idempotent CRD handling
- Remove custom prometheus-crds-install.yaml (upstream handles it)
- Add comprehensive README.md with quick start guide
- Consolidate monitoring docs into main README

The upgradeJob runs as a Helm pre-install/pre-upgrade hook, using
kubectl apply --server-side to install Prometheus Operator CRDs
before ServiceMonitor resources are created. This enables seamless
deployment on fresh clusters without manual CRD installation.

Tests: 124/124 helm-unittest passing
The L1 proxy doesn't have a dedicated /health endpoint that works
without query parameters. Switch to TCP socket probe which only
verifies the port is listening.
Helm validates templates before running hooks, so CRDs must be
installed before helm install/upgrade when monitoring is enabled.
The kubectl apply commands are idempotent and safe to re-run.
Replace ${DS_PROMETHEUS} and ${DS__NATS-PROMETHEUS} placeholders with
the actual datasource UID 'prometheus' at template render time using
Helm's replace function. This enables provisioned dashboards to work
correctly since Grafana doesn't process __inputs for sidecar-loaded
dashboards.
- Add Helm/K8s make targets (docker-build, helm-deps, helm-crds, etc.)
- Enable NATS HTTP monitoring on 0.0.0.0:8222 in backend.go
- Add prometheus-nats-exporter sidecar container to sequencer
- Add cdk-erigon-sync.json Grafana dashboard
- Add values-dev.yaml for development environment settings
- Add values-local.yaml.example as template for local secrets
- Remove values-local.yaml from tracking (now gitignored)
- Add NATS stream/KV mismatch troubleshooting guide
- Configure Grafana NodePort (30300) for local access
@tamingchaos tamingchaos force-pushed the feature/phase3.5-observability branch from fc1d9f9 to 78b8d04 Compare December 18, 2025 12:27
@tamingchaos tamingchaos merged commit 3bb5f25 into k8s-research Dec 18, 2025
@tamingchaos tamingchaos deleted the feature/phase3.5-observability branch December 18, 2025 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants