-
-
Notifications
You must be signed in to change notification settings - Fork 57k
Description
Problem
The delivery queue stores failed deliveries as individual JSON files in a delivery-queue/failed/ subdirectory. Successful deliveries are deleted entirely (unlink). This design has several concrete problems:
Silent message loss is undetectable
When a delivery succeeds, the queue file is deleted — there is no record it ever existed. If a message is generated by the AI but the delivery silently fails without entering the queue (e.g. the enqueueDelivery call itself fails, which is caught and swallowed: .catch(() => null) in deliver.ts), the message is lost with zero trace. The operator has no way to know that messages are being dropped.
Failed deliveries accumulate with no visibility
The failed/ directory grows unboundedly with JSON files, one per failed delivery. There is no TTL, no pruning, no summary view. An operator cannot answer "how many messages failed to deliver this week?" or "which channel has the most failures?" without writing custom scripts to parse the directory. In practice, operators don't check this directory — failures accumulate silently.
No correlation between inbound and outbound
There is no link between a user's inbound message and the outbound delivery attempt(s) it triggered. When a user reports "I sent a message and never got a reply," the operator must: (1) grep structured logs for the inbound message, (2) find the corresponding AI run, (3) check if a delivery was enqueued, (4) check if the queue file still exists or was moved to failed/, (5) read the JSON file to see the error. This manual multi-step process makes diagnosing delivery issues impractical.
Retry state doesn't survive crashes
The queue file stores retryCount and lastAttemptAt as fields in a JSON file that is read-modify-written on each attempt. If the gateway crashes between reading the file and writing the updated version, the retry count is lost and the entry is replayed from its previous state — potentially re-delivering a message or resetting the backoff.
Expected behavior
Delivery attempts and their outcomes should be stored in a queryable, indexed store with structured fields (status, error text, retry count, timestamps). Successful deliveries should retain a record (not be deleted) so operators can audit the full delivery history. The store should be prunable by age to prevent unbounded growth.