-
-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Description
In PostgresMessageQueue.listen(), the serializedPoll mechanism chains every poll() invocation onto a single promise (pollLock). If a poll() call never resolves—because the message handler hangs indefinitely on a network request or other I/O—then all subsequent poll() invocations are chained onto the pending promise and also never execute. This permanently halts all message processing for that instance, even though the process remains alive and healthy.
Relevant code
https://github.com/fedify-dev/fedify/blob/2.0-maintenance/packages/postgres/src/mq.ts#L321-L326
let pollLock: Promise<void> = Promise.resolve();
const serializedPoll = () => {
const next = pollLock.then(poll);
pollLock = next.catch(() => {});
return next;
};Inside poll(), the handler is awaited with no timeout:
// Line 215 (messages without ordering key):
await handler(row.message);
// Line 288 (messages with ordering key):
await handler(row.message);The handler (processQueuedTask in Fedify's federation middleware) performs network I/O such as sending activities to remote inboxes and fetching ActivityPub documents. A single hung network request blocks the entire queue permanently.
Impact observed in production
We run 3 parallel instances of our application (Hackers' Pub), all sharing the same PostgreSQL message queue. The following sequence occurred:
- All three instances'
poll()functions hung at different times (likely on unresponsive remote servers) - The
serializedPollchain in each instance became permanently blocked - For 3.5 hours, no messages were processed despite ~2,000 messages accumulating in the queue
- The processes remained alive — health checks (
/nodeinfo/2.1) passed, HTTP requests were served, and PostgreSQLLISTENconnections stayed active - Processing only resumed after the processes crashed due to a separate bug (see Unhandled
Temporal.Duration.from()error inNOTIFYcallback crashes the process #594) and were restarted by the container runtime
Additional consequences when handler hangs
- The advisory lock acquired on the ordering key (lines 270–275) is held indefinitely, blocking other instances from processing messages with the same ordering key
- The reserved database connection (line 267,
this.#sql.reserve()) is held indefinitely, reducing the available connection pool
Suggested fixes
Option A: Add a timeout to handler execution
const HANDLER_TIMEOUT_MS = 60_000; // or make configurable
// Inside poll():
await Promise.race([
handler(row.message),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error("Handler timed out")), HANDLER_TIMEOUT_MS)
),
]);This ensures that a hung handler is aborted after a reasonable duration, allowing poll() to complete (with an error caught by safeSerializedPoll) and the next poll cycle to proceed.
Option B: Replace promise chaining with a mutex that supports timeout/cancellation
Instead of chaining promises, use a lock mechanism that allows the interval-based poll to proceed independently if the previous poll is still running beyond a threshold.
Option C: Decouple the LISTEN-triggered poll from the interval poll
Allow the interval poll (line 359–371) to bypass the serializedPoll chain when it detects the chain has been stalled for too long.
Environment
@fedify/postgresused with Fedify 2.0.2- PostgreSQL 17
- Deno runtime
- 3 application instances behind a Caddy load balancer, sharing the same database