Skip to content

serializedPoll in PostgresMessageQueue.listen() permanently stalls if handler hangs #595

@dahlia

Description

@dahlia

Description

In PostgresMessageQueue.listen(), the serializedPoll mechanism chains every poll() invocation onto a single promise (pollLock). If a poll() call never resolves—because the message handler hangs indefinitely on a network request or other I/O—then all subsequent poll() invocations are chained onto the pending promise and also never execute. This permanently halts all message processing for that instance, even though the process remains alive and healthy.

Relevant code

https://github.com/fedify-dev/fedify/blob/2.0-maintenance/packages/postgres/src/mq.ts#L321-L326

let pollLock: Promise<void> = Promise.resolve();
const serializedPoll = () => {
  const next = pollLock.then(poll);
  pollLock = next.catch(() => {});
  return next;
};

Inside poll(), the handler is awaited with no timeout:

// Line 215 (messages without ordering key):
await handler(row.message);

// Line 288 (messages with ordering key):
await handler(row.message);

The handler (processQueuedTask in Fedify's federation middleware) performs network I/O such as sending activities to remote inboxes and fetching ActivityPub documents. A single hung network request blocks the entire queue permanently.

Impact observed in production

We run 3 parallel instances of our application (Hackers' Pub), all sharing the same PostgreSQL message queue. The following sequence occurred:

  1. All three instances' poll() functions hung at different times (likely on unresponsive remote servers)
  2. The serializedPoll chain in each instance became permanently blocked
  3. For 3.5 hours, no messages were processed despite ~2,000 messages accumulating in the queue
  4. The processes remained alive — health checks (/nodeinfo/2.1) passed, HTTP requests were served, and PostgreSQL LISTEN connections stayed active
  5. Processing only resumed after the processes crashed due to a separate bug (see Unhandled Temporal.Duration.from() error in NOTIFY callback crashes the process #594) and were restarted by the container runtime

Additional consequences when handler hangs

  • The advisory lock acquired on the ordering key (lines 270–275) is held indefinitely, blocking other instances from processing messages with the same ordering key
  • The reserved database connection (line 267, this.#sql.reserve()) is held indefinitely, reducing the available connection pool

Suggested fixes

Option A: Add a timeout to handler execution

const HANDLER_TIMEOUT_MS = 60_000; // or make configurable

// Inside poll():
await Promise.race([
  handler(row.message),
  new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error("Handler timed out")), HANDLER_TIMEOUT_MS)
  ),
]);

This ensures that a hung handler is aborted after a reasonable duration, allowing poll() to complete (with an error caught by safeSerializedPoll) and the next poll cycle to proceed.

Option B: Replace promise chaining with a mutex that supports timeout/cancellation

Instead of chaining promises, use a lock mechanism that allows the interval-based poll to proceed independently if the previous poll is still running beyond a threshold.

Option C: Decouple the LISTEN-triggered poll from the interval poll

Allow the interval poll (line 359–371) to bypass the serializedPoll chain when it detects the chain has been stalled for too long.

Environment

  • @fedify/postgres used with Fedify 2.0.2
  • PostgreSQL 17
  • Deno runtime
  • 3 application instances behind a Caddy load balancer, sharing the same database

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions