Newest 'pyarrow' Questions

1 vote

1 answer

51 views

Is PyArrow Append Columns an In-Place Operation?

col_a = pa.array([1, 2, 3], pa.int32()) col_b = pa.array(["X", "Y", "Z"], pa.string()) table = pa.Table.from_arrays( [col_a, col_b], schema=pa.schema([ ...

Sze Yu Sim

93

asked Dec 30, 2025 at 1:30

0 votes

0 answers

45 views

How to BULK INSERT hex strings into a VARBINARY column in Azure SQL (from CSV) without staging?

I am loading data from Parquet into Azure SQL Database using this pipeline: Parquet → PyArrow → CSV (Azure Blob) → BULK INSERT One column in the Parquet file is binary (hashed passwords). PyArrow CSV ...

mysin

1

asked Dec 2, 2025 at 22:35

0 votes

1 answer

101 views

DLL load failure when importing both libsumo and pyarrow

I'm creating new venv (using virtualenv) with Python 3.12. The only two packages I'm installing are libsumo and pyarrow. When I run only this line: import libsumo or only this line: import pyarrow ...

Godzy

93

asked Nov 9, 2025 at 23:49

1 vote

0 answers

63 views

How to efficiently denormalize a SQL DB to produce Parquet files

I'm trying to create a parquet file from a heavily normalized SQL database with a snowflake schema. Some of the dimensions have very long text attributes so that a simply running a big set of joins to ...

Davor Cubranic

1,150

asked Oct 11, 2025 at 18:19

1 vote

0 answers

31 views

Is going from pyarrow chunkedarray to pyarrow table a zero-copy operation? How to check?

If I have import pyarrow as pa ca = pa.chunked_array([[1,2,3]]) and then do t = pa.table({'a': ca}) then was the pa.table operation a zero-copy one? I would expect it to be, but is there any way to ...

ignoring_gravity

11.2k

asked Oct 6, 2025 at 6:57

1 vote

1 answer

300 views

PySpark ArrayType usage in transformWithStateInPandas state causes java.lang.IllegalArgumentException

I have the following python code that uses PySpark to mock a fraud detection system for credit cards: from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col, ...

Marco Filippozzi

301

asked Sep 5, 2025 at 7:36

5 votes

1 answer

768 views

import tensorflow statement crashes or hangs on macOS

I have the following Python statement, which I cannot execute in Jupyter Notebook or Python REPL: import tensorflow Python 3.11.10 (main, Sep 20 2024, 14:23:57) [Clang 16.0.0 (clang-1600.0.26.3)] on ...

Mikko Ohtamaa

85.2k

asked Aug 23, 2025 at 15:16

3 votes

1 answer

259 views

Lambda container - Pyarrow and numpy

I have difficulties from this: (aws-lambda-python-alpha): Failed to install numpy 2.3.0 with Python 3.11 or lower My Dockerfile: FROM public.ecr.aws/lambda/python:3.11 # Install RUN pip install '...

Flo

485

asked Aug 5, 2025 at 21:37

3 votes

1 answer

410 views

Memory Not Released After Each Request Despite Cleanup Attempts

We're running a FastAPI service that fetches data from Trino, processes it using PyArrow and Polars, and uploads the result to AWS S3 in Parquet format. However, we're facing a persistent issue where ...

DonOfDen

4,068

asked Jul 8, 2025 at 11:50

1 vote

2 answers

103 views

Cumulative count per group in PyArrow

Say I have data = {'a': [1,1,2], 'b': [4,5,6]} and I'd like to get a cumulative count (1-indexed) per group. In pandas, I can do: import pandas as pd pd.DataFrame(data).groupby('a').cumcount()+1 ...

ignoring_gravity

11.2k

asked Jun 14, 2025 at 18:58

0 votes

1 answer

159 views

How to predict the size of a Parquet in memory?

I am loading a large Parquet file with pyarrow, and then convert it to a Pandas DataFrame. Since this can be very memory-intensive, I need to see if loading the entire file in one go can fit into the ...

TheLegs

92

asked Jun 5, 2025 at 5:51

1 vote

0 answers

204 views

"Out of Memory Error: Failed to allocate block of Bytes" using DuckDB

I'm using DuckDB to process data stored in Parquet files, organized in a Hive-style directory structure partitioned by year, month, day, and hour. Each Parquet file contains around 150 columns, and I ...

Deepank Dhillon

91

asked Jun 2, 2025 at 9:09

0 votes

2 answers

271 views

Delta Lake / Arrow Timestamp Precision/Schema Error

I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the ...

neal301

11

asked May 29, 2025 at 13:37

0 votes

0 answers

28 views

Modin: Unable to access .cat.codes after groupby even though dtype is still category

I'm encountering an issue in Modin (v0.32.0) where I can access .cat.codes on a categorical column before a groupby, but not after grouping. import modin.pandas as pd df = pd.read_parquet(path="....

Sumukha G C

13

asked May 25, 2025 at 8:24

0 votes

0 answers

56 views

Modin: switch to Pandas because of "Mixed Partitioning columns in Parquet"

I would like to use Modin to read a partitioned parquet. The parquet has a single partition key of type int. When I run it automatically switches to the default pandas implementation with the ...

MarcelloDG

98

asked Apr 29, 2025 at 9:11

Collectives™ on Stack Overflow

Is PyArrow Append Columns an In-Place Operation?

How to BULK INSERT hex strings into a VARBINARY column in Azure SQL (from CSV) without staging?

DLL load failure when importing both libsumo and pyarrow

How to efficiently denormalize a SQL DB to produce Parquet files

Is going from pyarrow chunkedarray to pyarrow table a zero-copy operation? How to check?

PySpark ArrayType usage in transformWithStateInPandas state causes java.lang.IllegalArgumentException

import tensorflow statement crashes or hangs on macOS

Lambda container - Pyarrow and numpy

Memory Not Released After Each Request Despite Cleanup Attempts

Cumulative count per group in PyArrow

How to predict the size of a Parquet in memory?

"Out of Memory Error: Failed to allocate block of Bytes" using DuckDB

Delta Lake / Arrow Timestamp Precision/Schema Error

Modin: Unable to access .cat.codes after groupby even though dtype is still category

Modin: switch to Pandas because of "Mixed Partitioning columns in Parquet"

Hot Network Questions