-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
Remote file caching for s3 deployed in 24.10 by @kssenii works great with s3() function, but unfortunately it seems to be unusable with multiple hosts in a cluster with s3Cluster().
s3Cluster() seems to distribute each file to a randomly chosen host in the cluster so if a single data source is queried repeatedly, the cache ends up being fully duplicated on every node.
Could we please adjust s3Cluster() so that the distribution of files among hosts is stable, e.i. it keeps asking the same host in the cluster for the same S3 file?
Sample query to replicate this issue
clickhouse_config.xml
<clickhouse>
<filesystem_caches>
<s3_cache>
<path>/var/lib/clickhouse/s3_cache</path>
<max_size>10Gi</max_size>
</s3_cache>
</filesystem_caches>
</clickhouse>
SELECT *
FROM s3Cluster(primary,'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames')
FORMAT NULL
SETTINGS filesystem_cache_name = 's3_cache', enable_filesystem_cache = 1
Current behavior
After repeated runs, the cache directory on each host ends up with the entire data set.
Proposed situation
s3Cluster() would keep distributing files among hosts the same way across multiple queries so that the entire dataset would be distributed among all hosts even after multiple query runs.
Thank you! 🙏