Take advantage of more CPU cores when converting JSONL to Parquet?

The following was run on Ubuntu 20 on a `e2-highcpu-32` GCP VM with 32 GB of RAM and 32 vCPUs. I used ClickHouse 22.13.1.1361 (official build).

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet. The dataset has ~11M records.

```bash
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
```

```json
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}
```

The following completes in around 19 seconds.

```bash
$ clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < California.jsonl \
    > ch.snappy.pq
```

I can see 4 cores being used on my 32-core VM. htop reports one of ClickHouse's processes is using 190% of one CPU but that's it.

Is there any way to utilise the other 28 cores on this system and speed up the above task?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take advantage of more CPU cores when converting JSONL to Parquet? #45014

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Take advantage of more CPU cores when converting JSONL to Parquet? #45014

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions