Skip to content

[Go][Parquet] A uint16 number written to parquet file not parseable by DuckDB #38616

@eest

Description

@eest

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

When trying to write out a parquet file containing a uint16 column the file will be written without complaints but once I tried looking at it with DuckDB it is unable to read it. The error looks like this:

$ duckdb -c 'select * from test.parquet'
Error: Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT16

The file can be generated with this code:

package main

import (
	"log"
	"os"

	"github.com/apache/arrow/go/v14/arrow"
	"github.com/apache/arrow/go/v14/arrow/array"
	"github.com/apache/arrow/go/v14/arrow/memory"
	"github.com/apache/arrow/go/v14/parquet/pqarrow"
)

func main() {
	schema := arrow.NewSchema(
		[]arrow.Field{{Name: "port", Type: arrow.PrimitiveTypes.Uint16, Nullable: true}},
		nil,
	)

	pool := memory.NewGoAllocator()

	rb := array.NewRecordBuilder(pool, schema)
	defer rb.Release()

	port := rb.Field(0).(*array.Uint16Builder)
	defer port.Release()

	port.Append(12)

	record := rb.NewRecord()

	outFile, err := os.Create("test.parquet")
	if err != nil {
		log.Fatalf("unable to open session file: %s", err)
	}

	parquetWriter, err := pqarrow.NewFileWriter(schema, outFile, nil, pqarrow.DefaultWriterProps())
	if err != nil {
		log.Fatalf("unable to create parquet writer: %s", err)
	}

	err = parquetWriter.Write(record)
	if err != nil {
		log.Fatalf("unable to write parquet file: %s", err)
	}

	err = parquetWriter.Close()
	if err != nil {
		log.Fatalf("unable to close parquet file: %s", err)
	}
}

As I was unsure where that large number came from (as can be seen the test code is appending the number 12) I tried inspecting the file using the parquet-cli tool, with the following results:

$ parquet pages test.parquet

Column: port
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4,00 B     4 B
  0-1    data  _ R  1       9,00 B     9 B                 0       "4294967295" / "12"

While max reflects the number actually appended to the column, the min value for some reason is that large number that DuckDB fails to cast to an uint16. The number is also visible when calling the meta command:

$ parquet meta test.parquet

File path:  test.parquet
Created by: parquet-go version 14.0.0
Properties: (none)
Schema:
message schema {
  optional int32 port (INTEGER(16,false));
}


Row group 0:  count: 1  62,00 B records  start: 4  total(compressed): 62 B total(uncompressed):62 B
--------------------------------------------------------------------------------
      type      encodings count     avg size   nulls   min / max
port  INT32     _ _ R     1         62,00 B    0       "4294967295" / "12"

It seems strange to me that min is not also 12 as a count of 1 should indicate there is only one value present.

Component(s)

Go

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions