Version of Awkward Array
2.0.5
Description and code to reproduce
The following was run on Ubuntu 20 on a e2-highcpu-32 GCP VM with 32 GB of RAM and 32 vCPUs.
I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
| jq -c '.properties * {geom: .geometry|tostring}' \
> California.jsonl
$ head -n1 California.jsonl | jq .
{
"release": 1,
"capture_dates_range": "",
"geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}
Awkward is able to produce a 947 MB Parquet file in 64.60 seconds.
/usr/bin/time -v \
python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
With ClickHouse I'm able to complete the same task in 18.26 seconds. Its resulting file size is 794 MB.
$ /usr/bin/time -v \
clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
< California.jsonl \
> ch.snappy.pq
The resulting Awkward Parquet almost matches ClickHouse in terms of row groups and using snappy compression.
<pyarrow._parquet.FileMetaData object at 0x7fb89c696d10>
created_by: parquet-cpp-arrow version 10.0.1
num_columns: 2
num_rows: 11542912
num_row_groups: 306
format_version: 2.6
serialized_size: 73744
<pyarrow._parquet.FileMetaData object at 0x7f0926d54860>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 11542912
num_row_groups: 306
format_version: 1.0
serialized_size: 228389
Below is a flame graph from Awkward's execution.

I ran a 10-line version of the above file through both PyArrow and ClickHouse. This is what strace and perf reported.
$ sudo su
$ source .pq/bin/activate
$ strace -wc \
python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
61.01 0.158111 261 604 50 openat
10.15 0.026304 14 1760 146 stat
6.44 0.016697 17 937 read
4.57 0.011833 11 992 fstat
3.98 0.010303 11 876 5 lseek
2.84 0.007351 13 563 close
2.61 0.006760 16 411 mmap
2.17 0.005612 12 455 438 ioctl
1.26 0.003253 27 117 munmap
1.15 0.002983 90 33 clone
0.81 0.002101 19 110 mprotect
0.80 0.002083 22 92 getdents64
0.60 0.001552 19 80 futex
0.56 0.001448 14 102 getcwd
0.35 0.000901 16 56 brk
0.26 0.000671 9 68 rt_sigaction
0.09 0.000224 224 1 execve
0.08 0.000213 11 18 pread64
0.06 0.000159 15 10 write
0.03 0.000069 13 5 2 readlink
0.03 0.000067 11 6 getpid
0.02 0.000051 12 4 getrandom
0.02 0.000044 14 3 uname
0.02 0.000042 21 2 open
0.01 0.000037 18 2 pipe2
0.01 0.000034 16 2 madvise
0.01 0.000032 10 3 sigaltstack
0.01 0.000031 10 3 rt_sigprocmask
0.01 0.000030 10 3 dup
0.01 0.000028 28 1 wait4
0.01 0.000023 11 2 sched_getaffinity
0.01 0.000022 11 2 1 arch_prctl
0.01 0.000014 14 1 sysinfo
0.01 0.000014 13 1 1 access
0.00 0.000011 11 1 fcntl
0.00 0.000011 11 1 prlimit64
0.00 0.000011 11 1 gettid
0.00 0.000009 9 1 set_tid_address
0.00 0.000009 8 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.259150 7330 643 total
$ perf stat -dd \
ython3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
4,150.43 msec task-clock # 11.326 CPUs utilized
105 context-switches # 25.299 /sec
2 cpu-migrations # 0.482 /sec
12,034 page-faults # 2.899 K/sec
ClickHouse's syscall counts were all much lower:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
29.52 0.019018 1584 12 futex
21.15 0.013625 63 214 gettid
11.19 0.007209 514 14 mprotect
11.06 0.007123 791 9 4 stat
8.72 0.005617 108 52 close
5.16 0.003327 1109 3 poll
2.19 0.001412 23 60 mmap
2.09 0.001344 39 34 1 openat
1.27 0.000816 18 44 read
...
0.15 0.000098 48 2 write
As were context switch and page fault counts.
44 context-switches # 372.955 /sec
4997 page-faults # 42.356 K/sec
These are the versions of software involved:
- awkward-2.0.5-py3-none-any.whl (541 kB)
- awkward_cpp-6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- ClickHouse 22.13.1.1361 (official build)
Version of Awkward Array
2.0.5
Description and code to reproduce
The following was run on Ubuntu 20 on a
e2-highcpu-32GCP VM with 32 GB of RAM and 32 vCPUs.I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \ | jq -c '.properties * {geom: .geometry|tostring}' \ > California.jsonl $ head -n1 California.jsonl | jq .{ "release": 1, "capture_dates_range": "", "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}" }Awkward is able to produce a 947 MB Parquet file in 64.60 seconds.
/usr/bin/time -v \ python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"With ClickHouse I'm able to complete the same task in 18.26 seconds. Its resulting file size is 794 MB.
The resulting Awkward Parquet almost matches ClickHouse in terms of row groups and using snappy compression.
Below is a flame graph from Awkward's execution.
I ran a 10-line version of the above file through both PyArrow and ClickHouse. This is what
straceandperfreported.$ perf stat -dd \ ython3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"ClickHouse's syscall counts were all much lower:
As were context switch and page fault counts.
These are the versions of software involved: