Skip to content

parquet quasi-binary import #52

@Codereamp

Description

@Codereamp

In the parquet dataset tmdb-celeb-10k there's a field "image". When using anyquery (Windows) with the query

CREATE TABLE parq_import As Select * from read_parquet('train-00000-of-00001-d95dffd623223e73.parquet')

it ends up as a json, containing bytes property that looks like a unicode-escaped string. So for the first row the start looks like

{"bytes":"\ufffd\ufffd\ufffd\ufffd\u0000\u0010JFIF ...

The json is correct and parsable, but obviosly it looks like the bytes that are the higher part of ANSI (> 128) the replacement character is used (ufffd), so the process is lossy.

My final destination is a SQLite file so previously I tried to use DuckDB export to csv with something like (python)

duckdb.sql("""
    COPY (SELECT * FROM 'train-00000-of-00001-d95dffd623223e73.parquet') 
    TO 'intermediate.csv' (HEADER, FORMAT 'csv')

and the field was converted in non-standard json when for the same the start looks like

{'bytes': \xFF\xD8\xFF\xE0\x00\x10JFIF\

The conversion also was broken regarding the escaping, but at least the converter did preserve the bytes.

Is it possible to somehow correctly keep this "binary" field?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions