-
Notifications
You must be signed in to change notification settings - Fork 105
[Go][Parquet] A uint16 number written to parquet file not parseable by DuckDB #209
Description
Hi @zeroshade I've come across this closed issue #38616 and I could still reproduce it while writing arrow data to a parquet file using pqarrow.
Here is the code that's writing to parquet file, I'm using one of your examples:
arrChan := make(chan arrow.Record, 10)
go func(ch <-chan arrow.Record) {
first_rec := <-ch
f, err := os.OpenFile("./test.parquet", os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
panic(err)
}
defer f.Close()
// ...
// we'll use the default writer properties, but you could easily pass
// properties to customize the writer
props := parquet.NewWriterProperties()
writer, err := pqarrow.NewFileWriter(first_rec.Schema(), f, props,
pqarrow.DefaultWriterProps())
if err != nil {
panic(err)
}
defer writer.Close()
fmt.Println("here")
if err := writer.Write(first_rec); err != nil {
fmt.Println(err)
panic(err)
}
// first_rec.Release()
for rec := range ch {
if err := writer.Write(rec); err != nil {
panic(err)
}
// rec.Release()
}
}(arrChan)The arrow records are Released outside this function.
This code writes out a test.parquet file and when I read it using DuckDB, I get this error:
Error: Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT16
Here is the output from the parquet-cli tool similar to what's in #38616
$ parquet pages test.parquet
Column: id
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 4.00 B 4 B
0-1 data _ R 1 3.00 B 3 B 0 "0" / "0"
Column: resource.id
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 4.00 B 4 B
0-1 data _ R 1 9.00 B 9 B 0 "4294967295" / "0"
Column: resource.schema_url
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 43.00 B 43 B
0-1 data _ R 1 9.00 B 9 B 0 "https://opentelemetry.io/..." / "https://opentelemetry.io/..."
Column: scope.id
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 4.00 B 4 B
0-1 data _ R 1 9.00 B 9 B 0 "4294967295" / "0"
Column: metric_type
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 4.00 B 4 B
0-1 data _ R 1 3.00 B 3 B 0 "1" / "1"
Column: name
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 7.00 B 7 B
0-1 data _ R 1 3.00 B 3 B "gen" / "gen"
Columns: resource.id and scope.id have incorrect min values.
$ parquet meta test.parquet
File path: test.parquet
Created by: parquet-go version 18.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
required int32 id (INTEGER(16,false));
required group resource {
optional int32 id (INTEGER(16,false));
optional binary schema_url (STRING);
}
required group scope {
optional int32 id (INTEGER(16,false));
}
required int32 metric_type (INTEGER(8,false));
required binary name (STRING);
}
Row group 0: count: 1 464.00 B records start: 4 total(compressed): 464 B total(uncompressed):464 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
id INT32 _ _ R 1 56.00 B 0 "0" / "0"
resource.id INT32 _ _ R 1 62.00 B 0 "4294967295" / "0"
resource.schema_url BINARY _ _ R 1 171.00 B 0 "https://opentelemetry.io/..." / "https://opentelemetry.io/..."
scope.id INT32 _ _ R 1 62.00 B 0 "4294967295" / "0"
metric_type INT32 _ _ R 1 56.00 B 0 "1" / "1"
name BINARY _ _ R 1 57.00 B "gen" / "gen"
I'm hoping these reproduction details are sufficient., if there are any missing details that I can provide, please let me know and I can produce them as soon as possible. Thank you :thank
GOARCH='amd64'
GOOS='linux'
GOVERSION='go1.23.4'
Component(s)
Parquet