Skip to content

feat: implement durable queue for redis queues that needs persistence in PG#5380

Merged
maidul98 merged 17 commits intomainfrom
feat/durable-queue
Feb 10, 2026
Merged

feat: implement durable queue for redis queues that needs persistence in PG#5380
maidul98 merged 17 commits intomainfrom
feat/durable-queue

Conversation

@akhilmhdh
Copy link
Member

Context

This PR implements postgres backup and recovery for redis queues that needs persistence over wipe out. Some queues that need are

  1. Dynamic secret leasing
  2. PAM session expiration

This PR completely removes pg_boss dependency usage and migrated back to redis queue.

There is two new queue jobs

  1. Recovery job: That run on container bootup. This checks all pending/failed jobs are present in redis or not. Failed ones are checked in order to retry.
  2. Reconcile job is the one used to stop a job and re-queue if a job takes long. As of now there is no queue that needs that long.

All the bullmq cron job has a key parameter. This ensure we alter the timing window later.

Screenshots

Steps to verify the change

  1. Create a dynamic secret and generate a lease
  2. Wipe out redis and bootup again
  3. The lease removal job should work fine
  4. Like wise for pam session expiration as well

Type

  • Fix
  • Feature
  • Improvement
  • Breaking
  • Docs
  • Chore

Checklist

  • Title follows the conventional commit format: type(scope): short description (scope is optional, e.g., fix: prevent crash on sync or fix(api): handle null response).
  • Tested locally
  • Updated docs (if needed)
  • Read the contributing guide

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Greptile Overview

Greptile Summary

This PR implements Postgres-backed recovery for Redis queues, migrating away from pg-boss to pure BullMQ with a persistence layer. The implementation adds a queue_jobs table to track critical jobs (dynamic secret revocation, PAM session expiration) that need to survive Redis data loss.

Key Changes:

  • Added queue_jobs table with status tracking, heartbeat monitoring, and retry logic
  • Implemented startup recovery job (runs 2 minutes after boot) to restore pending/failed jobs from Postgres
  • Added reconciliation job (runs every 2 minutes) to detect and re-queue stuck jobs
  • Migrated dynamic secret revocation and PAM session expiration queues to use persistence
  • Added queue job pruning to daily cleanup routine
  • Removed all pg-boss dependencies and related code

Architecture:
The system maintains dual state: Redis (active execution) and Postgres (persistence/recovery). Jobs are written to Postgres before Redis, with lifecycle updates tracked via event listeners. Recovery checks for jobs missing from Redis and re-queues them with recalculated delays.

Issues Found:

  • Migration not fully idempotent - could fail on retry with duplicate key errors
  • 2-minute startup recovery delay could impact security-critical credential revocations
  • No alerting when stuck jobs are marked as dead
  • Sequential recovery could be slow for large job backlogs

Confidence Score: 4/5

  • Safe to merge with minor concerns around migration idempotency and recovery timing for security-critical operations
  • The implementation is well-structured with proper persistence mechanisms for critical queues. The main concern is the migration's lack of full idempotency which could cause issues on retry. The 2-minute recovery delay could impact security-critical operations like credential revocation. Otherwise, the queue recovery and reconciliation logic is solid.
  • backend/src/db/migrations/20260203141935_durable-queue.ts requires fixing for full idempotency before deployment to production

Important Files Changed

Filename Overview
backend/src/db/migrations/20260203141935_durable-queue.ts Creates queue_jobs table and migrates pgboss jobs. Migration not fully idempotent - uses hasTable check but doesn't handle duplicate inserts or cleanup existing records.
backend/src/queue/queue-service.ts Implements postgres-backed queue recovery with startup recovery (2 min delay) and reconciliation (every 2 min). Adds persistence layer with heartbeat tracking for long-running jobs.
backend/src/queue/queue-jobs-dal.ts Data access layer for queue jobs with stuck job detection using COALESCE(lastHeartBeat, updatedAt) and pruning of completed/dead jobs in batches of 10k.
backend/src/ee/services/dynamic-secret-lease/dynamic-secret-lease-queue.ts Converted to use persistence-backed queue for lease revocation. Implements retry logic with exponential backoff and email alerts after max retries.
backend/src/services/pam-session-expiration/pam-session-expiration-queue.ts Converted to use persistence-backed queue for session expiration. Simple job that expires PAM sessions at scheduled time.
backend/src/services/resource-cleanup/resource-cleanup-queue.ts Migrated from pg-boss to BullMQ. Added queue jobs pruning to daily cleanup. Removed pg-boss specific code and simplified queue scheduling.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

@maidul98
Copy link
Collaborator

maidul98 commented Feb 5, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@akhilmhdh akhilmhdh requested a review from fangpenlin February 5, 2026 20:16
Copy link
Contributor

@fangpenlin fangpenlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a chance to carefully running it locally yet, but would like to provide some feedbacks / questions from my preliminary review.

@akhilmhdh akhilmhdh force-pushed the feat/durable-queue branch 2 times, most recently from bd4fa28 to 326f387 Compare February 7, 2026 07:02
fangpenlin
fangpenlin previously approved these changes Feb 9, 2026
@maidul98 maidul98 changed the title feat: implemented queue postgres recovery for redis queues that needs persistence feat: implement durable queue for redis queues that needs persistence in PG Feb 10, 2026
@maidul98 maidul98 merged commit 59af9f3 into main Feb 10, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants