Skip to content

[Go][Parquet] A uint16 number written to parquet file not parseable by DuckDB #209

@venkat-oss

Description

@venkat-oss

Hi @zeroshade I've come across this closed issue #38616 and I could still reproduce it while writing arrow data to a parquet file using pqarrow.

Here is the code that's writing to parquet file, I'm using one of your examples:

arrChan := make(chan arrow.Record, 10)

go func(ch <-chan arrow.Record) {
  
    first_rec := <-ch
    f, err := os.OpenFile("./test.parquet", os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
	    panic(err)
    }
    defer f.Close()
    // ...
    // we'll use the default writer properties, but you could easily pass
    // properties to customize the writer
    props := parquet.NewWriterProperties()
    writer, err := pqarrow.NewFileWriter(first_rec.Schema(), f, props,
	    pqarrow.DefaultWriterProps())
    if err != nil {
	    panic(err)
    }
    defer writer.Close()
    fmt.Println("here")
    
    if err := writer.Write(first_rec); err != nil {
	    fmt.Println(err)
	    panic(err)
    }
    // first_rec.Release()
    
    for rec := range ch {
	    if err := writer.Write(rec); err != nil {
		    panic(err)
	    }
	    // rec.Release()
}
}(arrChan)

The arrow records are Released outside this function.

This code writes out a test.parquet file and when I read it using DuckDB, I get this error:

Error: Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT16

Here is the output from the parquet-cli tool similar to what's in #38616

$ parquet pages test.parquet

Column: id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       3.00 B     3 B                 0       "0" / "0"


Column: resource.id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       9.00 B     9 B                 0       "4294967295" / "0"


Column: resource.schema_url
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       43.00 B    43 B      
  0-1    data  _ R  1       9.00 B     9 B                 0       "https://opentelemetry.io/..." / "https://opentelemetry.io/..."


Column: scope.id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       9.00 B     9 B                 0       "4294967295" / "0"


Column: metric_type
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       3.00 B     3 B                 0       "1" / "1"


Column: name
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       7.00 B     7 B       
  0-1    data  _ R  1       3.00 B     3 B                         "gen" / "gen"

Columns: resource.id and scope.id have incorrect min values.

$ parquet meta test.parquet

File path:  test.parquet
Created by: parquet-go version 18.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
  required int32 id (INTEGER(16,false));
  required group resource {
    optional int32 id (INTEGER(16,false));
    optional binary schema_url (STRING);
  }
  required group scope {
    optional int32 id (INTEGER(16,false));
  }
  required int32 metric_type (INTEGER(8,false));
  required binary name (STRING);
}


Row group 0:  count: 1  464.00 B records  start: 4  total(compressed): 464 B total(uncompressed):464 B 
--------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
id                   INT32     _ _ R     1         56.00 B    0       "0" / "0"
resource.id          INT32     _ _ R     1         62.00 B    0       "4294967295" / "0"
resource.schema_url  BINARY    _ _ R     1         171.00 B   0       "https://opentelemetry.io/..." / "https://opentelemetry.io/..."
scope.id             INT32     _ _ R     1         62.00 B    0       "4294967295" / "0"
metric_type          INT32     _ _ R     1         56.00 B    0       "1" / "1"
name                 BINARY    _ _ R     1         57.00 B            "gen" / "gen"

I'm hoping these reproduction details are sufficient., if there are any missing details that I can provide, please let me know and I can produce them as soon as possible. Thank you :thank

GOARCH='amd64'
GOOS='linux'
GOVERSION='go1.23.4'

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions