-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Per-prefix tracking for Ordered S3Queue #71161
Description
Use case
Using S3Queue in Ordered mode with prefixed objects is inconvenient. When prefixes reflect logical subsets of data (for instance, when using hive-style partitioning), it is common for objects to be added to the different prefixes independently. In Ordered mode, this can cause objects in lexicographically lesser prefixes to be missed.
Example - each row indicates an object added to s3 a few minutes apart
my-bucket/city=amsterdam/2024-01-01.csv
my-bucket/city=amsterdam/2024-01-02.csv
my-bucket/city=amsterdam/2024-01-03.csv
my-bucket/city=berlin___/2024-01-03.csv
my-bucket/city=amsterdam/2024-01-04.csv # ignored!
my-bucket/city=berlin___/2024-01-04.csv
Describe the solution you'd like
Optionally allow the "last" object ingested to be tracked on a per-prefix basis
The prefix is the entire object key excepting the object name (the part after the last /)
Describe alternatives you've considered
Create an S3Queue for each prefix. This is not possible when the prefixes are not known in advance or is inadvisable when there are a large (many thousands) of prefixes.
Additional context
The alternative I use in practice is to stick a creation timestamp at the beginning of the prefix. It feels like a weird hack.