Building Async Notifications Without Breaking Your Monolith

A security-critical feature was missing something obvious

Nov 20, 2025

When admins reset users’ authentication methods, those users received no notification.

They’d discover their security settings changed only when they tried to log in.

You know that moment when you realize a “simple” notification turns into a distributed systems problem? This was mine.

In this issue, I’ll take you through building an event-driven notification system that bridges services without coupling them.

No distributed systems theory—just real code, actual trade-offs, and lessons learned from production.

What you’ll learn:

How to design async notification systems using message queues
Why synchronous notification calls create cascading failures
Patterns for producer-consumer event architectures
Handling failures gracefully without blocking critical operations

The Problem

Here’s what we were dealing with—a security gap that seemed simple to fix:

# In auth-service: admin resets user’s 2FA
class TwoFactorResetController < ApplicationController
  def reset
    user = User.find(params[:user_id])
    user.two_factor_methods.destroy_all
    
    # TODO: notify user their 2FA was reset
    # But how?
    # - Can’t make sync HTTP call (might timeout, block critical operation)
    # - Can’t access email service directly (different service boundary)
    # - Can’t skip notification (security requirement)
    
    render json: { success: true }
  end
end

Warning Signs:

Critical security operations lacking user notifications
Synchronous HTTP calls blocking admin operations
Service boundaries preventing direct email access
No failure recovery if notifications fail
Compliance requirements for security event notifications

From Synchronous Blocking to Async Events

Why Synchronous Notifications Fail

The naive approach: just make an HTTP call to the notification service.

Before (Synchronous HTTP - WRONG):

class TwoFactorResetController < ApplicationController
  def reset
    user = User.find(params[:user_id])
    user.two_factor_methods.destroy_all
    
    # Synchronous HTTP call (BAD)
    NotificationClient.send_email(
      to: user.email,
      template: ‘two_factor_reset’,
      data: { name: user.name }
    )
    
    render json: { success: true }
  rescue NotificationClient::Error => e
    # What do we do here?
    # - Rollback 2FA reset? (security operation already completed)
    # - Fail the whole request? (admin operation blocked by email issue)
    # - Log and continue? (user never gets notified)
  end
end

Why This Fails:

Admin operations blocked by email service latency/failures
2FA reset and notification tightly coupled
No retry mechanism if email fails
Creates runtime dependency between services
Timeout in notification service breaks security operations

Impact:

Rejected this approach after simulating notification service outage
Admin operations must never be blocked by notification failures

Designing the Event-Driven Architecture

The solution: decouple the security operation from notification delivery using events.

The Architecture:

┌─────────────────┐         ┌──────────────┐         ┌─────────────────┐
│  Auth Service   │         │ Message Bus  │         │ Email Service   │
│                 │         │   (Kafka)    │         │                 │
│  Reset 2FA      │────────▶│              │────────▶│ Send Email      │
│  Publish Event  │         │  Topic:      │         │ (SendGrid)      │
│                 │         │  security-   │         │                 │
└─────────────────┘         │  events      │         └─────────────────┘
                            │              │
                            │  DLQ:        │
                            │  security-   │
                            │  events-dlq  │
                            └──────────────┘

The Producer (Auth Service):

# Define event types
module Events
  module Security
    TWO_FACTOR_RESET = “security.two_factor.reset”
  end
end

# Define Kafka topics
module MessageBus
  module Topics
    def self.by_environment(base_name)
      “#{Rails.env}_#{base_name}”
    end
    
    SECURITY_EVENTS = by_environment(”security_events”)
    SECURITY_EVENTS_DLQ = by_environment(”security_events_dlq”)
  end
end

# Producer for 2FA reset events
class TwoFactorResetProducer
  def self.publish(user_identifier:, email:, role:, name:)
    event = {
      event_type: Events::Security::TWO_FACTOR_RESET,
      timestamp: Time.current.iso8601,
      resource: {
        user_identifier: user_identifier,
        email: email,
        role: role,
        name: name
      }
    }
    
    kafka_producer.produce(
      event.to_json,
      topic: MessageBus::Topics::SECURITY_EVENTS,
      key: user_identifier
    )
    
    true
  rescue => e
    # Log but don’t fail the operation
    ErrorLogger.log(
      error: e,
      context: “Failed to publish 2FA reset event”,
      user_identifier: user_identifier
    )
    
    # Return false but don’t raise - notification failure
    # shouldn’t block security operation
    false
  end
  
  private
  
  def self.kafka_producer
    @kafka_producer ||= Kafka.new(
      seed_brokers: ENV[’KAFKA_BROKERS’].split(’,’),
      client_id: ‘auth-service’
    ).async_producer
  end
end

# Updated controller using producer
class TwoFactorResetController < ApplicationController
  def reset
    user = User.find(params[:user_id])
    user.two_factor_methods.destroy_all
    
    # Publish event asynchronously
    TwoFactorResetProducer.publish(
      user_identifier: user.id,
      email: user.email,
      role: user.role,
      name: user.display_name
    )
    
    # Admin operation succeeds regardless of notification
    render json: { success: true }
  end
end

Key Design Decisions:

Fire-and-forget publishing: Admin operation never blocks
Graceful degradation: Log failures but continue
Event enrichment: Include all data consumer needs
Environment-specific topics: Prevent dev/prod mixing
Dead letter queue: Capture failed events for investigation

Impact:

Admin operations now take 50ms instead of 200-500ms
Zero coupling between security ops and email delivery
Can retry failed notifications independently
Email service outages don’t affect security operations

Building the Consumer (Email Service)

The consumer transforms security events into email notifications.

# Event handler for 2FA reset
class SecurityEventHandler
  ROLE_TEMPLATES = {
    ‘staff’ => ‘774f9baf-4250-483b-9ebb-4945c4c3b4c6’,
    ‘parent’ => ‘59ce8688-8c83-4caa-b2ac-94b9ad5722ee’,
    ‘student’ => ‘a1b2c3d4-5678-90ab-cdef-1234567890ab’
  }.freeze
  
  DEFAULT_TEMPLATE = ROLE_TEMPLATES[’staff’]
  
  def handle_two_factor_reset(event)
    resource = event[’resource’]
    
    template_id = ROLE_TEMPLATES.fetch(
      resource[’role’],
      DEFAULT_TEMPLATE
    )
    
    EmailSender.send_template(
      to: resource[’email’],
      template_id: template_id,
      substitutions: {
        user_name: resource[’name’]
      }
    )
    
    AuditLog.create!(
      event_type: ‘two_factor_reset_notification_sent’,
      user_identifier: resource[’user_identifier’],
      email: resource[’email’],
      timestamp: Time.current
    )
  rescue => e
    ErrorLogger.log(
      error: e,
      context: “Failed to send 2FA reset email”,
      event: event
    )
    
    # Re-raise to send to DLQ
    raise
  end
end

# Kafka consumer
class SecurityEventsConsumer
  def initialize
    @kafka = Kafka.new(
      seed_brokers: ENV[’KAFKA_BROKERS’].split(’,’),
      client_id: ‘email-service’
    )
    
    @consumer = @kafka.consumer(
      group_id: ‘email-service-security-events’
    )
    
    @consumer.subscribe(MessageBus::Topics::SECURITY_EVENTS)
  end
  
  def consume
    @consumer.each_message do |message|
      event = JSON.parse(message.value)
      
      case event[’event_type’]
      when Events::Security::TWO_FACTOR_RESET
        SecurityEventHandler.new.handle_two_factor_reset(event)
      else
        Rails.logger.warn(”Unknown event type: #{event[’event_type’]}”)
      end
      
    rescue => e
      ErrorLogger.log(
        error: e,
        message_offset: message.offset,
        partition: message.partition
      )
      
      # Send to DLQ
      dlq_producer.produce(
        message.value,
        topic: MessageBus::Topics::SECURITY_EVENTS_DLQ,
        key: message.key
      )
    end
  end
  
  private
  
  def dlq_producer
    @dlq_producer ||= @kafka.async_producer
  end
end

Consumer Features:

Role-based templates: Different emails for different user types
Graceful error handling: Failed messages go to DLQ, not lost
Audit logging: Track notification delivery
Unknown event handling: Log but don’t fail on unexpected events
Consumer groups: Multiple instances process in parallel

Impact:

Email delivery decoupled from security operations
Failed emails can be retried without re-executing security ops
Different services can consume same events for different purposes
Consumer can be scaled independently based on event volume

Handling the Feature Flag Dance

The unexpected complexity: coordinating feature flags across services.

# In auth-service: respect Kafka production flag
class TwoFactorResetProducer
  def self.publish(user_identifier:, email:, role:, name:)
    return false unless feature_enabled?
    
    # ... publish event
  end
  
  private
  
  def self.feature_enabled?
    Settings.kafka.event_production_enabled
  end
end

# In email-service: respect consumer flag
class SecurityEventsConsumer
  def consume
    return unless feature_enabled?
    
    # ... consume events
  end
  
  private
  
  def feature_enabled?
    Settings.kafka.event_consumption_enabled
  end
end

Deployment Strategy:

Deploy consumer first (deployed, flag off)
Enable consumer flag (consuming but no events yet)
Deploy producer (deployed, flag off)
Enable producer flag (events now flow through system)
Monitor for 24 hours
Remove flags after stability confirmed

Why This Order Matters:

If you deploy producer first and enable it, events will pile up with no consumer. When consumer eventually deploys, it processes huge backlog, potentially causing email storms.

The breakthrough: async notifications aren’t about making things faster—they’re about making critical operations reliable by decoupling them from ancillary concerns.

The principle: Security operations succeed or fail based on security logic, not email delivery status.

Some Numbers From This Implementation

After rolling out event-driven notifications:

Admin operation latency: 200-500ms → 50ms (4-10x faster)
Operation success rate: 97% → 99.9% (email failures no longer block)
Email delivery reliability: Same (moved problem to consumer, not eliminated)
System complexity: Increased (added message bus, consumers, DLQ handling)
Debugging difficulty: Higher (distributed traces across services)
Retry capability: None → Full (can replay events from Kafka)

Async vs Sync

Use synchronous notifications when:

Notification failure should block the operation
User needs immediate confirmation
Operations are low-frequency
Services share same deployment boundary

Use async events when:

Operation success independent of notification
Need retry capability
High-frequency operations
Multiple consumers need same event
Services deployed independently

The Complete Flow

1. Admin clicks “Reset 2FA” in UI
2. Auth service resets 2FA methods in database
3. Auth service publishes event to Kafka (fire-and-forget)
4. Admin sees success immediately (operation completed)
5. Kafka persists event to topic
6. Email service consumer picks up event
7. Email service sends notification via SendGrid
8. If email fails: event goes to DLQ for investigation
9. If email succeeds: audit log created

Key Properties:

Steps 1-4 complete in 50ms
Steps 5-9 happen asynchronously
Failure in steps 5-9 don’t affect steps 1-4
Can replay/retry steps 5-9 independently

Monday Morning Action Items

Quick Wins (5-Minute Changes)
- List all synchronous notification calls in critical paths
- Identify operations blocked by notification failures
- Check if your message bus has dead letter queues configured
Next Steps
- Implement one async notification using events
- Add dead letter queue handling for failed events
- Create monitoring for consumer lag
- Document event schemas for team

Your Turn!

The Notification Audit Challenge

Analyze your notification patterns:

# Questions to answer
- Which operations block on notification delivery?
- What’s your notification failure rate?
- How do you retry failed notifications?
- Can users recover from missed notifications?
- What’s your average notification latency?
- Do notification failures affect user operations?

Discussion Prompts:

How do you currently handle notification failures?
Have notification issues ever caused operation failures?
What’s your strategy for notification retries?

What’s Next?

Next week: “Dead Letter Queues: The Error Handling Pattern You’re Missing” - Why successful systems need a place to put things that failed, and how to design DLQ processing that actually works.

Useful Resources:

Kafka: The Definitive Guide - Comprehensive Kafka patterns
Enterprise Integration Patterns - Message-based integration patterns
AWS SQS Dead Letter Queues - DLQ concepts

Found this useful? Share it with someone building notification systems. Reply with your async notification patterns—I’m curious what message buses people are actually using in production.

Happy coding!

Tips and Notes:

Note: Async notifications trade immediacy for reliability—make sure that’s the right trade-off for your use case
Pro Tip: Always deploy consumers before producers to avoid event backlogs
Remember: Dead letter queues are not optional—failed messages need somewhere to go for investigation

Snippets of Code

Discussion about this post

Ready for more?