-
Notifications
You must be signed in to change notification settings - Fork 113
parquet quasi-binary import #52
Description
In the parquet dataset tmdb-celeb-10k there's a field "image". When using anyquery (Windows) with the query
CREATE TABLE parq_import As Select * from read_parquet('train-00000-of-00001-d95dffd623223e73.parquet')
it ends up as a json, containing bytes property that looks like a unicode-escaped string. So for the first row the start looks like
{"bytes":"\ufffd\ufffd\ufffd\ufffd\u0000\u0010JFIF ...
The json is correct and parsable, but obviosly it looks like the bytes that are the higher part of ANSI (> 128) the replacement character is used (ufffd), so the process is lossy.
My final destination is a SQLite file so previously I tried to use DuckDB export to csv with something like (python)
duckdb.sql("""
COPY (SELECT * FROM 'train-00000-of-00001-d95dffd623223e73.parquet')
TO 'intermediate.csv' (HEADER, FORMAT 'csv')
and the field was converted in non-standard json when for the same the start looks like
{'bytes': \xFF\xD8\xFF\xE0\x00\x10JFIF\
The conversion also was broken regarding the escaping, but at least the converter did preserve the bytes.
Is it possible to somehow correctly keep this "binary" field?