Skip to content

Per-prefix tracking for Ordered S3Queue #71161

@ekpdt

Description

@ekpdt

Use case

Using S3Queue in Ordered mode with prefixed objects is inconvenient. When prefixes reflect logical subsets of data (for instance, when using hive-style partitioning), it is common for objects to be added to the different prefixes independently. In Ordered mode, this can cause objects in lexicographically lesser prefixes to be missed.

Example - each row indicates an object added to s3 a few minutes apart

my-bucket/city=amsterdam/2024-01-01.csv
my-bucket/city=amsterdam/2024-01-02.csv
my-bucket/city=amsterdam/2024-01-03.csv
my-bucket/city=berlin___/2024-01-03.csv
my-bucket/city=amsterdam/2024-01-04.csv # ignored!
my-bucket/city=berlin___/2024-01-04.csv

Describe the solution you'd like

Optionally allow the "last" object ingested to be tracked on a per-prefix basis
The prefix is the entire object key excepting the object name (the part after the last /)

Describe alternatives you've considered

Create an S3Queue for each prefix. This is not possible when the prefixes are not known in advance or is inadvisable when there are a large (many thousands) of prefixes.
 
Additional context

The alternative I use in practice is to stick a creation timestamp at the beginning of the prefix. It feels like a weird hack.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions