Skip to content

[C++] Performance issue listing files over S3 #34213

@westonpace

Description

@westonpace

Describe the enhancement requested

The current GetFileInfo implementation (ignoring paging/continuation) in S3 is roughly:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path, delimiter=/)
  parallel for common_prefix in rsp:
    list_dir(common_prefix, results)
  for file in rsp:
    results.append(file)

The prefix and delimiter constructs are S3 constructs described in more detail in S3 docs

Since we use / as the delimiter this yields an experience that is very similar to "directory walking". We issue one HTTP request for every single directory.

Alternatively, one could simply do:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path)
  for file in rsp:
    results.append(file)

This would guarantee one HTTP request (ignoring paging) regardless of how many directories there are.

On the face of it, it would seem like this delimiter feature is always a bad idea (why would you want more HTTP requests?). However, from reading some documentation, it seems that the point of the delimiter feature is to allow concurrent list objects calls. However, this is expecting a situation where there are many many files (examples seem to consider millions) and they are more or less evenly distributed across prefixes (e.g. hundreds of thousands of files per prefix). Furthemore, this appears to very geared to the "container in the same datacenter as the S3 bucket" situation where the per-request latency is very small.

Some users are either using non-S3 technologies (e.g. minio) or they are downloading data from outside EC2 or they simply don't have very many parquet files per partition folder.

This leads to very slow (20-25x slower in #34145) performance when discovering datasets in S3.

I believe we should make the delimiter a property of the S3 filesystem and the default should be "no delimiter". This ought to speed up the normal? case and still makes it possible to optimize for a case where a user has structured their dataset to benefit from delimiters.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions