parquet quasi-binary import

In the parquet dataset [tmdb-celeb-10k](https://huggingface.co/datasets/ashraq/tmdb-celeb-10k) there's a field "image". When using anyquery (Windows) with the query

`CREATE TABLE parq_import As Select * from read_parquet('train-00000-of-00001-d95dffd623223e73.parquet')`

it ends up as a json, containing bytes property that looks like a unicode-escaped string. So for the first row the start looks like 
 
 ` {"bytes":"\ufffd\ufffd\ufffd\ufffd\u0000\u0010JFIF ... `

The json is correct and parsable, but obviosly it looks like the bytes that are the higher part of ANSI (> 128) the replacement character is used (ufffd), so the process is lossy.

My final destination is a SQLite file so previously I tried to use DuckDB export to csv with something like (python)

```
duckdb.sql("""
    COPY (SELECT * FROM 'train-00000-of-00001-d95dffd623223e73.parquet') 
    TO 'intermediate.csv' (HEADER, FORMAT 'csv')

```

and the field was converted in non-standard json when for the same the start looks like

`{'bytes': \xFF\xD8\xFF\xE0\x00\x10JFIF\`

The conversion also was broken regarding the escaping, but at least the converter did preserve the bytes.

Is it possible to somehow correctly keep this "binary" field?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet quasi-binary import #52

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

parquet quasi-binary import #52

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions