Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
51 views

col_a = pa.array([1, 2, 3], pa.int32()) col_b = pa.array(["X", "Y", "Z"], pa.string()) table = pa.Table.from_arrays( [col_a, col_b], schema=pa.schema([ ...
Sze Yu Sim's user avatar
0 votes
0 answers
45 views

I am loading data from Parquet into Azure SQL Database using this pipeline: Parquet → PyArrow → CSV (Azure Blob) → BULK INSERT One column in the Parquet file is binary (hashed passwords). PyArrow CSV ...
mysin's user avatar
  • 1
0 votes
1 answer
101 views

I'm creating new venv (using virtualenv) with Python 3.12. The only two packages I'm installing are libsumo and pyarrow. When I run only this line: import libsumo or only this line: import pyarrow ...
Godzy's user avatar
  • 93
1 vote
0 answers
63 views

I'm trying to create a parquet file from a heavily normalized SQL database with a snowflake schema. Some of the dimensions have very long text attributes so that a simply running a big set of joins to ...
Davor Cubranic's user avatar
1 vote
0 answers
31 views

If I have import pyarrow as pa ca = pa.chunked_array([[1,2,3]]) and then do t = pa.table({'a': ca}) then was the pa.table operation a zero-copy one? I would expect it to be, but is there any way to ...
ignoring_gravity's user avatar
1 vote
1 answer
300 views

I have the following python code that uses PySpark to mock a fraud detection system for credit cards: from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col, ...
Marco Filippozzi's user avatar
5 votes
1 answer
768 views

I have the following Python statement, which I cannot execute in Jupyter Notebook or Python REPL: import tensorflow Python 3.11.10 (main, Sep 20 2024, 14:23:57) [Clang 16.0.0 (clang-1600.0.26.3)] on ...
Mikko Ohtamaa's user avatar
3 votes
1 answer
259 views

I have difficulties from this: (aws-lambda-python-alpha): Failed to install numpy 2.3.0 with Python 3.11 or lower My Dockerfile: FROM public.ecr.aws/lambda/python:3.11 # Install RUN pip install '...
Flo's user avatar
  • 485
3 votes
1 answer
410 views

We're running a FastAPI service that fetches data from Trino, processes it using PyArrow and Polars, and uploads the result to AWS S3 in Parquet format. However, we're facing a persistent issue where ...
DonOfDen's user avatar
  • 4,068
1 vote
2 answers
103 views

Say I have data = {'a': [1,1,2], 'b': [4,5,6]} and I'd like to get a cumulative count (1-indexed) per group. In pandas, I can do: import pandas as pd pd.DataFrame(data).groupby('a').cumcount()+1 ...
ignoring_gravity's user avatar
0 votes
1 answer
159 views

I am loading a large Parquet file with pyarrow, and then convert it to a Pandas DataFrame. Since this can be very memory-intensive, I need to see if loading the entire file in one go can fit into the ...
TheLegs's user avatar
  • 92
1 vote
0 answers
204 views

I'm using DuckDB to process data stored in Parquet files, organized in a Hive-style directory structure partitioned by year, month, day, and hour. Each Parquet file contains around 150 columns, and I ...
Deepank Dhillon's user avatar
0 votes
2 answers
271 views

I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the ...
neal301's user avatar
  • 11
0 votes
0 answers
28 views

I'm encountering an issue in Modin (v0.32.0) where I can access .cat.codes on a categorical column before a groupby, but not after grouping. import modin.pandas as pd df = pd.read_parquet(path="....
Sumukha G C's user avatar
0 votes
0 answers
56 views

I would like to use Modin to read a partitioned parquet. The parquet has a single partition key of type int. When I run it automatically switches to the default pandas implementation with the ...
MarcelloDG's user avatar

15 30 50 per page
1
2 3 4 5
84