Skip to content

filesystem_cache with s3Cluster #72816

@jurajmasar

Description

@jurajmasar

Remote file caching for s3 deployed in 24.10 by @kssenii works great with s3() function, but unfortunately it seems to be unusable with multiple hosts in a cluster with s3Cluster().

s3Cluster() seems to distribute each file to a randomly chosen host in the cluster so if a single data source is queried repeatedly, the cache ends up being fully duplicated on every node.

Could we please adjust s3Cluster() so that the distribution of files among hosts is stable, e.i. it keeps asking the same host in the cluster for the same S3 file?

Sample query to replicate this issue

clickhouse_config.xml

<clickhouse>
  <filesystem_caches>
    <s3_cache>
      <path>/var/lib/clickhouse/s3_cache</path>
      <max_size>10Gi</max_size>
    </s3_cache>
  </filesystem_caches>
</clickhouse>
SELECT *
FROM s3Cluster(primary,'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames')
FORMAT NULL 
SETTINGS filesystem_cache_name = 's3_cache', enable_filesystem_cache = 1

Current behavior

After repeated runs, the cache directory on each host ends up with the entire data set.

Proposed situation

s3Cluster() would keep distributing files among hosts the same way across multiple queries so that the entire dataset would be distributed among all hosts even after multiple query runs.

Thank you! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions