-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
dataframegood second issueClearly described, educational, but less trivial than "good first issue".Clearly described, educational, but less trivial than "good first issue".
Description
This article seems decently well done. It has a nice notebook at the end that includes simple workflows that apparently Dask Dataframe didn't perform very well on. I think that it would be a useful exercise for someone to go through it, produce performance reports for each section and present them here for analysis. I suspect that we could find some opportunities for performance optimization.
Edit: lessons learned
- make len(df) be len(df.index), and that this actually triggers the existing columns optimization for parquet (Dispatch
iloccalls togetitem#6355) - specifically for len(df) from parquet, use metadata nrows value rather than loading (Use parquet metadata to get length #6387)
- combine column selections and also pass down to parquet (and csv or others!) (When combining complex column selections on dataframes, join the selection and push down #6388)
- gather_statistics in read_parquet: revisit when this is triggered at all, and quit early if failing, preventing the load of metadata even if gather_statistics=True (Revisit gather_statistics in read_parquet #6389)
- revisit memory limit limits for default local cluster (Revise memory limits on the default LocalCluster distributed#3956)
- investigate a way to monitor and report on GIL contention or thread lock waits (Investigate GIL monitoring #6391)
- find more examples to make profile investigations from for the future (Continue to make profile investigations #6392)
- why didn't the profile tab show up for me??
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dataframegood second issueClearly described, educational, but less trivial than "good first issue".Clearly described, educational, but less trivial than "good first issue".