Skip to content

Speed up JSON reading? (Formerly: Parquet writing) #2091

@marklit

Description

@marklit

Version of Awkward Array

2.0.5

Description and code to reproduce

The following was run on Ubuntu 20 on a e2-highcpu-32 GCP VM with 32 GB of RAM and 32 vCPUs.

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

Awkward is able to produce a 947 MB Parquet file in 64.60 seconds.

/usr/bin/time -v \
    python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

With ClickHouse I'm able to complete the same task in 18.26 seconds. Its resulting file size is 794 MB.

$ /usr/bin/time -v \
    clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < California.jsonl \
    > ch.snappy.pq

The resulting Awkward Parquet almost matches ClickHouse in terms of row groups and using snappy compression.

<pyarrow._parquet.FileMetaData object at 0x7fb89c696d10>
  created_by: parquet-cpp-arrow version 10.0.1
  num_columns: 2
  num_rows: 11542912
  num_row_groups: 306
  format_version: 2.6
  serialized_size: 73744
<pyarrow._parquet.FileMetaData object at 0x7f0926d54860>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

Below is a flame graph from Awkward's execution.

parquet awkward snappy

I ran a 10-line version of the above file through both PyArrow and ClickHouse. This is what strace and perf reported.

$ sudo su
$ source .pq/bin/activate
$ strace -wc \
    python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.01    0.158111         261       604        50 openat
 10.15    0.026304          14      1760       146 stat
  6.44    0.016697          17       937           read
  4.57    0.011833          11       992           fstat
  3.98    0.010303          11       876         5 lseek
  2.84    0.007351          13       563           close
  2.61    0.006760          16       411           mmap
  2.17    0.005612          12       455       438 ioctl
  1.26    0.003253          27       117           munmap
  1.15    0.002983          90        33           clone
  0.81    0.002101          19       110           mprotect
  0.80    0.002083          22        92           getdents64
  0.60    0.001552          19        80           futex
  0.56    0.001448          14       102           getcwd
  0.35    0.000901          16        56           brk
  0.26    0.000671           9        68           rt_sigaction
  0.09    0.000224         224         1           execve
  0.08    0.000213          11        18           pread64
  0.06    0.000159          15        10           write
  0.03    0.000069          13         5         2 readlink
  0.03    0.000067          11         6           getpid
  0.02    0.000051          12         4           getrandom
  0.02    0.000044          14         3           uname
  0.02    0.000042          21         2           open
  0.01    0.000037          18         2           pipe2
  0.01    0.000034          16         2           madvise
  0.01    0.000032          10         3           sigaltstack
  0.01    0.000031          10         3           rt_sigprocmask
  0.01    0.000030          10         3           dup
  0.01    0.000028          28         1           wait4
  0.01    0.000023          11         2           sched_getaffinity
  0.01    0.000022          11         2         1 arch_prctl
  0.01    0.000014          14         1           sysinfo
  0.01    0.000014          13         1         1 access
  0.00    0.000011          11         1           fcntl
  0.00    0.000011          11         1           prlimit64
  0.00    0.000011          11         1           gettid
  0.00    0.000009           9         1           set_tid_address
  0.00    0.000009           8         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.259150                  7330       643 total
$ perf stat -dd \
    ython3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
  4,150.43 msec task-clock                #   11.326 CPUs utilized
       105      context-switches          #   25.299 /sec
         2      cpu-migrations            #    0.482 /sec
    12,034      page-faults               #    2.899 K/sec                  

ClickHouse's syscall counts were all much lower:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 29.52    0.019018        1584        12           futex
 21.15    0.013625          63       214           gettid
 11.19    0.007209         514        14           mprotect
 11.06    0.007123         791         9         4 stat
  8.72    0.005617         108        52           close
  5.16    0.003327        1109         3           poll
  2.19    0.001412          23        60           mmap
  2.09    0.001344          39        34         1 openat
  1.27    0.000816          18        44           read
...
  0.15    0.000098          48         2           write

As were context switch and page fault counts.

  44      context-switches          #  372.955 /sec
4997      page-faults               #   42.356 K/sec

These are the versions of software involved:

  • awkward-2.0.5-py3-none-any.whl (541 kB)
  • awkward_cpp-6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • ClickHouse 22.13.1.1361 (official build)

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceWorks, but not fast enough or uses too much memory

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions