Building Async Notifications Without Breaking Your Monolith
A security-critical feature was missing something obvious
When admins reset users’ authentication methods, those users received no notification.
They’d discover their security settings changed only when they tried to log in.
You know that moment when you realize a “simple” notification turns into a distributed systems problem? This was mine.
In this issue, I’ll take you through building an event-driven notification system that bridges services without coupling them.
No distributed systems theory—just real code, actual trade-offs, and lessons learned from production.
What you’ll learn:
How to design async notification systems using message queues
Why synchronous notification calls create cascading failures
Patterns for producer-consumer event architectures
Handling failures gracefully without blocking critical operations
The Problem
Here’s what we were dealing with—a security gap that seemed simple to fix:
# In auth-service: admin resets user’s 2FA
class TwoFactorResetController < ApplicationController
def reset
user = User.find(params[:user_id])
user.two_factor_methods.destroy_all
# TODO: notify user their 2FA was reset
# But how?
# - Can’t make sync HTTP call (might timeout, block critical operation)
# - Can’t access email service directly (different service boundary)
# - Can’t skip notification (security requirement)
render json: { success: true }
end
end
Warning Signs:
Critical security operations lacking user notifications
Synchronous HTTP calls blocking admin operations
Service boundaries preventing direct email access
No failure recovery if notifications fail
Compliance requirements for security event notifications
From Synchronous Blocking to Async Events
Why Synchronous Notifications Fail
The naive approach: just make an HTTP call to the notification service.
Before (Synchronous HTTP - WRONG):
class TwoFactorResetController < ApplicationController
def reset
user = User.find(params[:user_id])
user.two_factor_methods.destroy_all
# Synchronous HTTP call (BAD)
NotificationClient.send_email(
to: user.email,
template: ‘two_factor_reset’,
data: { name: user.name }
)
render json: { success: true }
rescue NotificationClient::Error => e
# What do we do here?
# - Rollback 2FA reset? (security operation already completed)
# - Fail the whole request? (admin operation blocked by email issue)
# - Log and continue? (user never gets notified)
end
end
Why This Fails:
Admin operations blocked by email service latency/failures
2FA reset and notification tightly coupled
No retry mechanism if email fails
Creates runtime dependency between services
Timeout in notification service breaks security operations
Impact:
Rejected this approach after simulating notification service outage
Admin operations must never be blocked by notification failures
Designing the Event-Driven Architecture
The solution: decouple the security operation from notification delivery using events.
The Architecture:
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Auth Service │ │ Message Bus │ │ Email Service │
│ │ │ (Kafka) │ │ │
│ Reset 2FA │────────▶│ │────────▶│ Send Email │
│ Publish Event │ │ Topic: │ │ (SendGrid) │
│ │ │ security- │ │ │
└─────────────────┘ │ events │ └─────────────────┘
│ │
│ DLQ: │
│ security- │
│ events-dlq │
└──────────────┘
The Producer (Auth Service):
# Define event types
module Events
module Security
TWO_FACTOR_RESET = “security.two_factor.reset”
end
end
# Define Kafka topics
module MessageBus
module Topics
def self.by_environment(base_name)
“#{Rails.env}_#{base_name}”
end
SECURITY_EVENTS = by_environment(”security_events”)
SECURITY_EVENTS_DLQ = by_environment(”security_events_dlq”)
end
end
# Producer for 2FA reset events
class TwoFactorResetProducer
def self.publish(user_identifier:, email:, role:, name:)
event = {
event_type: Events::Security::TWO_FACTOR_RESET,
timestamp: Time.current.iso8601,
resource: {
user_identifier: user_identifier,
email: email,
role: role,
name: name
}
}
kafka_producer.produce(
event.to_json,
topic: MessageBus::Topics::SECURITY_EVENTS,
key: user_identifier
)
true
rescue => e
# Log but don’t fail the operation
ErrorLogger.log(
error: e,
context: “Failed to publish 2FA reset event”,
user_identifier: user_identifier
)
# Return false but don’t raise - notification failure
# shouldn’t block security operation
false
end
private
def self.kafka_producer
@kafka_producer ||= Kafka.new(
seed_brokers: ENV[’KAFKA_BROKERS’].split(’,’),
client_id: ‘auth-service’
).async_producer
end
end
# Updated controller using producer
class TwoFactorResetController < ApplicationController
def reset
user = User.find(params[:user_id])
user.two_factor_methods.destroy_all
# Publish event asynchronously
TwoFactorResetProducer.publish(
user_identifier: user.id,
email: user.email,
role: user.role,
name: user.display_name
)
# Admin operation succeeds regardless of notification
render json: { success: true }
end
end
Key Design Decisions:
Fire-and-forget publishing: Admin operation never blocks
Graceful degradation: Log failures but continue
Event enrichment: Include all data consumer needs
Environment-specific topics: Prevent dev/prod mixing
Dead letter queue: Capture failed events for investigation
Impact:
Admin operations now take 50ms instead of 200-500ms
Zero coupling between security ops and email delivery
Can retry failed notifications independently
Email service outages don’t affect security operations
Building the Consumer (Email Service)
The consumer transforms security events into email notifications.
# Event handler for 2FA reset
class SecurityEventHandler
ROLE_TEMPLATES = {
‘staff’ => ‘774f9baf-4250-483b-9ebb-4945c4c3b4c6’,
‘parent’ => ‘59ce8688-8c83-4caa-b2ac-94b9ad5722ee’,
‘student’ => ‘a1b2c3d4-5678-90ab-cdef-1234567890ab’
}.freeze
DEFAULT_TEMPLATE = ROLE_TEMPLATES[’staff’]
def handle_two_factor_reset(event)
resource = event[’resource’]
template_id = ROLE_TEMPLATES.fetch(
resource[’role’],
DEFAULT_TEMPLATE
)
EmailSender.send_template(
to: resource[’email’],
template_id: template_id,
substitutions: {
user_name: resource[’name’]
}
)
AuditLog.create!(
event_type: ‘two_factor_reset_notification_sent’,
user_identifier: resource[’user_identifier’],
email: resource[’email’],
timestamp: Time.current
)
rescue => e
ErrorLogger.log(
error: e,
context: “Failed to send 2FA reset email”,
event: event
)
# Re-raise to send to DLQ
raise
end
end
# Kafka consumer
class SecurityEventsConsumer
def initialize
@kafka = Kafka.new(
seed_brokers: ENV[’KAFKA_BROKERS’].split(’,’),
client_id: ‘email-service’
)
@consumer = @kafka.consumer(
group_id: ‘email-service-security-events’
)
@consumer.subscribe(MessageBus::Topics::SECURITY_EVENTS)
end
def consume
@consumer.each_message do |message|
event = JSON.parse(message.value)
case event[’event_type’]
when Events::Security::TWO_FACTOR_RESET
SecurityEventHandler.new.handle_two_factor_reset(event)
else
Rails.logger.warn(”Unknown event type: #{event[’event_type’]}”)
end
rescue => e
ErrorLogger.log(
error: e,
message_offset: message.offset,
partition: message.partition
)
# Send to DLQ
dlq_producer.produce(
message.value,
topic: MessageBus::Topics::SECURITY_EVENTS_DLQ,
key: message.key
)
end
end
private
def dlq_producer
@dlq_producer ||= @kafka.async_producer
end
end
Consumer Features:
Role-based templates: Different emails for different user types
Graceful error handling: Failed messages go to DLQ, not lost
Audit logging: Track notification delivery
Unknown event handling: Log but don’t fail on unexpected events
Consumer groups: Multiple instances process in parallel
Impact:
Email delivery decoupled from security operations
Failed emails can be retried without re-executing security ops
Different services can consume same events for different purposes
Consumer can be scaled independently based on event volume
Handling the Feature Flag Dance
The unexpected complexity: coordinating feature flags across services.
# In auth-service: respect Kafka production flag
class TwoFactorResetProducer
def self.publish(user_identifier:, email:, role:, name:)
return false unless feature_enabled?
# ... publish event
end
private
def self.feature_enabled?
Settings.kafka.event_production_enabled
end
end
# In email-service: respect consumer flag
class SecurityEventsConsumer
def consume
return unless feature_enabled?
# ... consume events
end
private
def feature_enabled?
Settings.kafka.event_consumption_enabled
end
end
Deployment Strategy:
Deploy consumer first (deployed, flag off)
Enable consumer flag (consuming but no events yet)
Deploy producer (deployed, flag off)
Enable producer flag (events now flow through system)
Monitor for 24 hours
Remove flags after stability confirmed
Why This Order Matters:
If you deploy producer first and enable it, events will pile up with no consumer. When consumer eventually deploys, it processes huge backlog, potentially causing email storms.
The breakthrough: async notifications aren’t about making things faster—they’re about making critical operations reliable by decoupling them from ancillary concerns.
The principle: Security operations succeed or fail based on security logic, not email delivery status.
Some Numbers From This Implementation
After rolling out event-driven notifications:
Admin operation latency: 200-500ms → 50ms (4-10x faster)
Operation success rate: 97% → 99.9% (email failures no longer block)
Email delivery reliability: Same (moved problem to consumer, not eliminated)
System complexity: Increased (added message bus, consumers, DLQ handling)
Debugging difficulty: Higher (distributed traces across services)
Retry capability: None → Full (can replay events from Kafka)
Async vs Sync
Use synchronous notifications when:
Notification failure should block the operation
User needs immediate confirmation
Operations are low-frequency
Services share same deployment boundary
Use async events when:
Operation success independent of notification
Need retry capability
High-frequency operations
Multiple consumers need same event
Services deployed independently
The Complete Flow
1. Admin clicks “Reset 2FA” in UI
2. Auth service resets 2FA methods in database
3. Auth service publishes event to Kafka (fire-and-forget)
4. Admin sees success immediately (operation completed)
5. Kafka persists event to topic
6. Email service consumer picks up event
7. Email service sends notification via SendGrid
8. If email fails: event goes to DLQ for investigation
9. If email succeeds: audit log created
Key Properties:
Steps 1-4 complete in 50ms
Steps 5-9 happen asynchronously
Failure in steps 5-9 don’t affect steps 1-4
Can replay/retry steps 5-9 independently
Monday Morning Action Items
Quick Wins (5-Minute Changes)
List all synchronous notification calls in critical paths
Identify operations blocked by notification failures
Check if your message bus has dead letter queues configured
Next Steps
Implement one async notification using events
Add dead letter queue handling for failed events
Create monitoring for consumer lag
Document event schemas for team
Your Turn!
The Notification Audit Challenge
Analyze your notification patterns:
# Questions to answer
- Which operations block on notification delivery?
- What’s your notification failure rate?
- How do you retry failed notifications?
- Can users recover from missed notifications?
- What’s your average notification latency?
- Do notification failures affect user operations?
Discussion Prompts:
How do you currently handle notification failures?
Have notification issues ever caused operation failures?
What’s your strategy for notification retries?
What’s Next?
Next week: “Dead Letter Queues: The Error Handling Pattern You’re Missing” - Why successful systems need a place to put things that failed, and how to design DLQ processing that actually works.
Useful Resources:
Kafka: The Definitive Guide - Comprehensive Kafka patterns
Enterprise Integration Patterns - Message-based integration patterns
AWS SQS Dead Letter Queues - DLQ concepts
Found this useful? Share it with someone building notification systems. Reply with your async notification patterns—I’m curious what message buses people are actually using in production.
Happy coding!
Tips and Notes:
Note: Async notifications trade immediacy for reliability—make sure that’s the right trade-off for your use case
Pro Tip: Always deploy consumers before producers to avoid event backlogs
Remember: Dead letter queues are not optional—failed messages need somewhere to go for investigation
