Skip to content

Take advantage of more CPU cores when converting JSONL to Parquet? #45014

@marklit

Description

@marklit

The following was run on Ubuntu 20 on a e2-highcpu-32 GCP VM with 32 GB of RAM and 32 vCPUs. I used ClickHouse 22.13.1.1361 (official build).

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet. The dataset has ~11M records.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

The following completes in around 19 seconds.

$ clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < California.jsonl \
    > ch.snappy.pq

I can see 4 cores being used on my 32-core VM. htop reports one of ClickHouse's processes is using 190% of one CPU but that's it.

Is there any way to utilise the other 28 cores on this system and speed up the above task?

Metadata

Metadata

Labels

performancest-fixedThe problem is fixed, but pending a test.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions