Conversation
|
@dask/maintenance this is ready for review. |
enforce_alignment to map_partitions to allow passing a dataframe as an arg directly
dask/dataframe/core.py
Outdated
| applying the function. | ||
| pandas) will be repartitioned to align (if necessary) before | ||
| applying the function (see ``enforce_alignment`` to control). | ||
| enforce_metadata : bool |
There was a problem hiding this comment.
While you're at it, could you document transform_divisions as well?
dask/dataframe/core.py
Outdated
| Whether or not to enforce the structure of the metadata at runtime. | ||
| This will rename and reorder columns for each partition, | ||
| and will raise an error if this doesn't work or types don't match. | ||
| enforce_alignment : bool |
There was a problem hiding this comment.
nit: I'd slightly prefer something like align_dataframes or align_inputs instead, since it's less "enforcement" in the way that enforce_metadata is runtime enforcement. Plus, that matches with align_arrays in Array.map_blocks.
There was a problem hiding this comment.
yeah that makes total sense as a rename. I was never very happy with this name.
|
@gjoseph92 is it enough for you to have this switch or are you hoping to change the default behavior? |
|
I think we should have this switch either way. It's also handy when you have unknown divisions, but happen to know when it's okay to skip alignment anyway. I still think broadcasting single-partition DataFrames would be more intuitive, but surely that will break other things, so with the docs this is maybe sufficient for now. |
|
ok great! I'll fix this up and get it in. |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
…into enforce-alignment
enforce_alignment to map_partitions to allow passing a dataframe as an arg directlyalign_dataframes to map_partitions to allow passing a dataframe as an arg directly
black dask/flake8 daskWhen dataframes are passed to
map_partitions, by default they are aligned according to the partitions of the first dataframe. This can be surprising when the user intended to broadcast them instead. This PR addsalign_dataframesas amap_partitionskwarg (just likealign_arraysinmap_blocks) so you can work around this when necessary (without going the hacky route of using kwargs when you intend to broadcast).Not sure if this is a good idea or not since
map_paritionsaccepts arbitrary user kwargs.