The following was run on Ubuntu 20 on a e2-highcpu-32 GCP VM with 32 GB of RAM and 32 vCPUs. I used ClickHouse 22.13.1.1361 (official build).
I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet. The dataset has ~11M records.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
| jq -c '.properties * {geom: .geometry|tostring}' \
> California.jsonl
$ head -n1 California.jsonl | jq .
{
"release": 1,
"capture_dates_range": "",
"geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}
The following completes in around 19 seconds.
$ clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
< California.jsonl \
> ch.snappy.pq
I can see 4 cores being used on my 32-core VM. htop reports one of ClickHouse's processes is using 190% of one CPU but that's it.
Is there any way to utilise the other 28 cores on this system and speed up the above task?
The following was run on Ubuntu 20 on a
e2-highcpu-32GCP VM with 32 GB of RAM and 32 vCPUs. I used ClickHouse 22.13.1.1361 (official build).I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet. The dataset has ~11M records.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \ | jq -c '.properties * {geom: .geometry|tostring}' \ > California.jsonl $ head -n1 California.jsonl | jq .{ "release": 1, "capture_dates_range": "", "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}" }The following completes in around 19 seconds.
I can see 4 cores being used on my 32-core VM. htop reports one of ClickHouse's processes is using 190% of one CPU but that's it.
Is there any way to utilise the other 28 cores on this system and speed up the above task?