Production-ready observability stack deployed to your AWS VPC with distributed tracing, metrics, and alerting
- Overview
- Architecture
- Prerequisites
- Installation
- Deployment
- Services
- Authentication
- Configuration
- Monitoring & Alerts
- Troubleshooting
- Customization
- Cleanup
Retriever is a cloud-native observability platform that deploys directly into your AWS account. Built on battle-tested open-source tools, it provides distributed tracing, spanmetrics collection, and intelligent alerting - all running securely in your VPC. Using a built-in MCP server, it allows for a useful debugging workflow all in one environment using Cursor.
- 📊 Distributed Tracing via Jaeger for request flow visualization
- 📈 Metrics Collection via Prometheus for performance monitoring
- 🚨 Smart Alerting via AlertManager with Slack integration
- 💾 Persistent Storage via OpenSearch for long-term trace retention
- 🤖 AI Integration via MCP Server for AI-powered observability analysis
- 🔒 Secure by Default with JWT authentication and TLS encryption
- ☁️ Cloud-Native deployed to AWS ECS Fargate (serverless containers)
- 🏗️ Infrastructure as Code using Terraform for reproducible deployments
- 🚀 One-Command Deploy via CLI - no manual AWS console configuration needed
Traditional observability platforms require:
- Complex manual setup and configuration
- Expensive SaaS subscriptions with per-GB pricing
- Vendor lock-in and data residency concerns
- Limited customization options
Retriever provides:
- ✅ Self-hosted in your AWS account (you control your data)
- ✅ One-command deployment via CLI
- ✅ Automated infrastructure provisioning with Terraform
- ✅ Full source code access for customization
- ✅ Pay only for AWS infrastructure (no per-GB fees)
- ✅ Production-ready with TLS, authentication, and auto-scaling
| Resource | Purpose | Details |
|---|---|---|
| VPC | Network isolation | Uses your existing VPC |
| ECS Cluster | Container orchestration | Fargate (serverless) |
| 7 ECS Services | Observability components | Query, Collector, Prometheus, etc. |
| Application Load Balancer | HTTPS ingress | TLS termination with ACM |
| ACM Certificate | TLS/SSL | Auto-validated via DNS |
| Service Connect | Service mesh | Inter-service DNS and discovery |
| Secrets Manager | Secrets storage | JWT secret, Slack webhook |
| S3 Bucket | Terraform state | Account-isolated state storage |
| Security Groups | Network policies | Least-privilege access |
| Component | Purpose | Access |
|---|---|---|
| Auth Proxy | JWT authentication gateway | All traffic flows through here |
| Jaeger Query | Trace visualization UI | https://your-domain.com/ |
| Jaeger Collector | OTLP trace ingestion | Port 4317 (gRPC), 4318 (HTTP) |
| OpenSearch | Long-term trace storage | Internal only |
| Prometheus | Metrics aggregation | https://your-domain.com/prometheus |
| AlertManager | Alert routing and deduplication | https://your-domain.com/alertmanager |
| MCP Server | AI integration API | https://your-domain.com/mcp |
-
AWS Account
- Admin or sufficient IAM permissions
- Account must support Fargate in your region
-
AWS CLI configured with credentials
aws configure # or use environment variables: # export AWS_ACCESS_KEY_ID=... # export AWS_SECRET_ACCESS_KEY=... # export AWS_REGION=us-east-1
-
Existing VPC Infrastructure
- VPC with at least 2 public subnets (for ALB)
- 1 private subnet (for ECS tasks)
- Internet Gateway attached
- NAT Gateway for private subnet
-
Domain Name (for TLS)
- You own a domain (e.g., example.com)
- Ability to add DNS records
- Domain can be hosted anywhere (AWS Route53, DigitalOcean, Cloudflare, etc.)
- Slack Workspace (for alert notifications)
- Webhook URL for posting messages
# Clone the repository
git clone https://github.com/TeamRetriever/retriever.git
cd retriever/cli
# Install dependencies
npm install
# Build the CLI
npm run buildF
# Link globally (makes 'retriever' command available)
npm linkVerify installation:
retriever --versionRun the interactive setup wizard:
retriever initThe CLI will prompt you for:
-
AWS Configuration
- Region (e.g., us-east-1)
- VPC ID
- Public Subnet IDs (2 required for ALB high availability)
- Private Subnet ID (for ECS tasks)
-
TLS Certificate Setup
- Domain name (e.g., observability.example.com)
- Creates ACM certificate automatically
- Provides DNS validation record to add
-
DNS Validation
- Add the CNAME record to your DNS provider
- CLI waits for validation to complete (~5 minutes)
-
JWT Authentication
- Generates cryptographically secure JWT secret
- Stores in AWS Secrets Manager
- Creates initial access token (valid for 10 years)
-
Slack Integration (Optional)
- Prompt to configure Slack webhook
- Skip if you don't want Slack notifications
Configuration is saved to .retriever-config.json:
{
"region": "us-east-1",
"vpcId": "vpc-xxxxx",
"publicSubnetId1": "subnet-xxxxx",
"publicSubnetId2": "subnet-xxxxx",
"privateSubnetId": "subnet-xxxxx",
"certificateArn": "arn:aws:acm:...",
"domain": "observability.example.com",
"jwtToken": "eyJhbGciOiJIUzI1NiIs..."
}retriever deployWhat happens:
- ✅ Validates AWS credentials and configuration
- ✅ Checks/creates ECS task execution role
- ✅ Sets up S3 backend for Terraform state
- ✅ Initializes Terraform
- ✅ Shows deployment plan (resources to be created)
- ❓ Asks for confirmation
- 🚀 Deploys all infrastructure (~10-15 minutes)
- Creates ECS cluster
- Launches 7 Fargate services
- Configures Application Load Balancer
- Sets up Service Connect mesh
- Configures security groups
- ✅ Verifies deployment health
Output:
✅ Deployment Complete!
Your Retriever observability platform is now running!
Load Balancer DNS: retriever-alb-xxxxxxxxx.us-east-1.elb.amazonaws.com
Next steps:
1. Point your DNS A record for observability.example.com to:
retriever-alb-xxxxxxxxx.us-east-1.elb.amazonaws.com
2. Access Retriever at: https://observability.example.com
3. Configure your applications to send traces to the collector
━━━ Access Information ━━━
Your JWT Access Token:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
This token is required to:
• Log in to the web UI (paste when prompted)
• Access the MCP server (use as Bearer token)
• Valid for 10 years from generation
Token also saved in .retriever-config.json
URL: https://your-domain.com/
View distributed traces, analyze service dependencies, and debug performance issues.
Features:
- Search traces by service, operation, tags
- Visualize trace spans and timing
- Analyze service dependencies
- Compare trace performance
URL: https://your-domain.com/prometheus
Query metrics, visualize data, and view active alerts.
Useful Queries:
# Request rate by service
rate(calls_total[1m])
# Error rate
rate(calls_total{http_status_code=~"5.."}[1m])
# P95 Latency
histogram_quantile(0.95, rate(duration_milliseconds_bucket[5m]))
# Request rate by HTTP method
sum by(http_method) (rate(calls_total[1m]))
# Top services by request volume
topk(5, sum by(service_name) (rate(calls_total[5m])))
URL: https://your-domain.com/alertmanager
View active alerts, silences, and notification history.
Features:
- View firing alerts
- Create silences to suppress notifications
- View alert routing and grouping
- Check Slack notification status
API endpoint for AI integration with Claude Desktop.
Integration: Configure Claude Desktop to connect:
{
"mcpServers": {
"retriever": {
"url": "https://your-domain.com/mcp",
"headers": {
"Authorization": "Bearer YOUR_JWT_TOKEN"
}
}
}
}All services are protected by JWT authentication via the Auth Proxy.
- Navigate to
https://your-domain.com - You'll be prompted for a token
- Paste your JWT token (from
.retriever-config.json) - Token is stored in your browser session
Use your JWT token as a Bearer token:
curl https://your-domain.com/prometheus/api/v1/query \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-d 'query=up'# Generate a new token using existing secret
retriever generate-token
# Regenerate secret (invalidates all existing tokens)
retriever generate-token --regenerate-secretThe Jaeger collector automatically transforms traces into metrics:
connectors:
spanmetrics:
histogram:
explicit:
buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
dimensions:
- name: http.method
- name: http.status_code
- name: service_nameGenerated Metrics:
calls_total- Request counter (by service, method, status)duration_milliseconds- Latency histogram
Configured in terraform/infrastructure/prometheus/alert_rules.yml:
| Alert | Condition | Threshold | Duration | Severity |
|---|---|---|---|---|
| ServiceError | Any 5xx errors | > 0 req/sec | 30s | critical |
| HighErrorRate | Error percentage | > 5% | 2m | warning |
| HighLatency | P95 latency | > 100ms | 5m | warning |
| CollectorDown | Collector unreachable | N/A | 1m | critical |
| HighRequestRate | Request spike | > 1000 req/sec | 2m | info |
Update Slack webhook in AWS Secrets Manager:
aws secretsmanager update-secret \
--secret-id retriever-slack-webhookurl \
--secret-string "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
--region us-east-1retriever deploy --force-recreateYour App → Collector → OpenSearch (traces)
└→ Spanmetrics → Prometheus (metrics)
- Applications send traces to Collector via OTLP (ports 4317/4318)
- Collector processes traces:
- Stores in OpenSearch for long-term retention
- Generates metrics via spanmetrics connector
- Prometheus scrapes metrics from Collector (port 8889)
# View ECS service status
aws ecs list-services --cluster retriever --region us-east-1
# Check if services are running
aws ecs describe-services \
--cluster retriever \
--services rvr_query rvr_auth_proxy rvr_collector \
--region us-east-1 \
--query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount}'Cause: DNS validation CNAME not added or not propagated
Check validation status:
aws acm describe-certificate \
--certificate-arn your-cert-arn \
--region us-east-1 \
--query 'Certificate.Status'Fix: Add the CNAME record shown by retriever init to your DNS provider.
Cause: Invalid or expired JWT token
Fix:
# Generate new token
retriever generate-token
# Token is displayed - copy and paste into UIVerify collector is reachable:
# Test gRPC endpoint
grpcurl -d '{"message":"test"}' \
your-domain.com:4317 \
opentelemetry.proto.collector.trace.v1.TraceService/Export
# Check collector logs for errors
aws logs tail /ecs/rvr_collector --region us-east-1 --since 5mCheck Prometheus is scraping:
# View targets in Prometheus UI
open https://your-domain.com/prometheus/targets
# Check alert evaluation
open https://your-domain.com/prometheus/alertsVerify AlertManager config:
# View AlertManager config
aws logs tail /ecs/rvr-test-alertmanager --region us-east-1 --since 5m | grep "config"- Edit
terraform/infrastructure/prometheus/alert_rules.yml - Change threshold values:
- alert: HighLatency expr: histogram_quantile(0.95, rate(duration_milliseconds_bucket[5m])) > 200 # Changed from 100ms for: 10m # Changed from 5m
Edit terraform/infrastructure/prometheus/alert_rules.yml:
- alert: LowRequestRate
expr: sum(rate(calls_total[5m])) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Request rate is unusually low"
description: "Only {{ $value }} req/sec for 10 minutes"For different latency profiles, edit terraform/infrastructure/collector/config.yml:
connectors:
spanmetrics:
histogram:
explicit:
# High-performance APIs (microseconds)
buckets: [10us, 50us, 100us, 500us, 1ms, 5ms, 10ms]
# OR typical web APIs (milliseconds)
buckets: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]Edit terraform/infrastructure/collector/config.yml:
connectors:
spanmetrics:
dimensions:
- name: http.method
- name: http.status_code
- name: service_name
- name: http.route # Add request path
- name: http.host # Add hostname
- name: deployment.environment # Add environment
⚠️ Warning: Avoid high-cardinality dimensions likeuser_id,request_id, ortrace_id. They exponentially increase metric series count and Prometheus memory usage.
# Navigate to infrastructure directory
cd terraform/infrastructure
# Destroy all AWS resources
terraform destroy
# Confirm with 'yes' when promptedWhat gets deleted:
- ✅ All ECS services and tasks
- ✅ Load balancer and target groups
- ✅ Security groups
- ✅ Service Connect configuration
- ❌ VPC, subnets (not managed by Retriever)
- ❌ ACM certificate (manual deletion required)
- ❌ S3 state bucket (kept for safety)
- ❌ Secrets Manager secrets (kept for safety)
Delete ACM certificate:
aws acm delete-certificate \
--certificate-arn your-cert-arn \
--region us-east-1Delete Secrets Manager secrets:
# JWT secret
aws secretsmanager delete-secret \
--secret-id retriever/jwt-secret \
--force-delete-without-recovery \
--region us-east-1
# Slack webhook
aws secretsmanager delete-secret \
--secret-id retriever-slack-webhookurl \
--force-delete-without-recovery \
--region us-east-1Delete S3 state bucket:
# Empty bucket first
aws s3 rm s3://retriever-tfstate-YOUR-ACCOUNT-ID --recursive
# Delete bucket
aws s3 rb s3://retriever-tfstate-YOUR-ACCOUNT-ID- Jaeger Documentation
- Prometheus Documentation
- OpenTelemetry Specification
- AWS ECS Best Practices
- Terraform AWS Provider Docs
This project is open source and available under the MIT License.
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
For questions or support:
- Open an issue on GitHub
- Check the Troubleshooting section
- Review Terraform logs:
terraform/infrastructure/terraform.log
Built with ❤️ for production observability on AWS
