-
-
Notifications
You must be signed in to change notification settings - Fork 757
Closed
Description
(Comes from #1978 (comment))
What happened:
$ ipython
Python 3.8.3 (default, May 20 2020, 12:50:54)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: # coding: utf-8
...: from dask import dataframe as dd
...: import pandas as pd
...: from distributed import Client
...: client = Client()
...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
...: payment_types = {
...: 1: "Credit Card",
...: 2: "Cash",
...: 3: "No Charge",
...: 4: "Dispute",
...: 5: "Unknown",
...: 6: "Voided trip"
...: }
...: payment_names = pd.Series(
...: payment_types, name="payment_name"
...: ).to_frame()
...: df2 = df.merge(
...: payment_names, left_on="payment_type", right_index=True
...: )
...: op = df2.groupby("payment_name")["tip_amount"].mean()
...: client.compute(op)
...:
Out[1]: <Future: pending, key: finalize-85edcc1f23785545f628c932abd19768>
In [2]: distributed.worker - WARNING - Compute Failed
Function: _apply_chunk
args: ( VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag ... mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge payment_name
0 1 2019-01-04 14:08:46 2019-01-04 14:18:10 1 1.70 1 N ... 0.5 0.0 0.00 0.3 9.30 NaN Cash
1 1 2019-01-04 14:20:33 2019-01-04 14:25:10 1 0.90 1 N ... 0.5 0.0 0.00 0.3 6.30 NaN Cash
13 2 2019-01-04 14:14:45 2019-01-04 14:26:00 5 1.63 1 N ... 0.5 0.0 0.00 0.3 9.80 NaN Cash
15 2 2019-01-04 14:49:45 2019-01-04 15:0
kwargs: {'chunk': <methodcaller: sum>, 'columns': 'tip_amount'}
Exception: ValueError('buffer source array is read-only')
In [2]:
In [2]: client
Out[2]: <Client: 'tcp://127.0.0.1:33689' processes=4 threads=4, memory=16.70 GB>
In [3]: _1
Out[3]: <Future: error, key: finalize-85edcc1f23785545f628c932abd19768>
What you expected to happen: The operation finishes without error.
Minimal Complete Verifiable Example:
# coding: utf-8
from dask import dataframe as dd
import pandas as pd
from distributed import Client
client = Client()
df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
payment_types = {
1: "Credit Card",
2: "Cash",
3: "No Charge",
4: "Dispute",
5: "Unknown",
6: "Voided trip"
}
payment_names = pd.Series(
payment_types, name="payment_name"
).to_frame()
df2 = df.merge(
payment_names, left_on="payment_type", right_index=True
)
op = df2.groupby("payment_name")["tip_amount"].mean()
client.compute(op)Data:
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
Anything else we need to know?: I managed to avoid this error by reducing the number of files, but then it hit me again at a later point. I expect this behavior to be dependent on the available RAM.
Environment:
- Dask version: 2.18.1
- Python version: 3.8.3
- Operating System: Linux Mint 19.3
- Install method (conda, pip, source): pip
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels