P2P shuffle silently dropping data if provided meta includes columns that do not exist in data

I don't believe this can be easily triggered by an end user but in case `meta` is improperly constructed internally (e.g. in dask-expr during optimization) and passed to the extension this can cause severe data loss.

The relevant section is this

https://github.com/dask/distributed/blob/a5a6e996c9b66f8c250b610c8d8a09d1c55a3d55/distributed/shuffle/_shuffle.py#L521-L531

In the example I am currently debugging, the `data` here looks like

![image](https://github.com/dask/distributed/assets/8629629/2dc0a2fb-27c9-4281-8bdd-7182e806e4a3)

while `meta` includes additional columns

![image](https://github.com/dask/distributed/assets/8629629/9e7266a6-9ba4-498e-bfde-5305eafb5cb9)

`convert_shards` then raises a `KeyError` here https://github.com/dask/distributed/blob/b03efeeda5f17c48d020f4820ff89a752beeaf89/distributed/shuffle/_arrow.py#L78

which causes `_get_output_partition` to return an empty dataframe. I haven't spent time figuring out why this exception is handled the way it is here but if this logic is required, the bare minimum we have to do is to verify columns of meta and payload data before transmission.

cc @hendrikmakait 

	def _get_output_partition(
	self,
	partition_id: int,
	key: Key,
	**kwargs: Any,
	) -> pd.DataFrame:
	try:
	data = self._read_from_disk((partition_id,))
	return convert_shards(data, self.meta)
	except KeyError:
	return self.meta.copy()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

P2P shuffle silently dropping data if provided meta includes columns that do not exist in data #8519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

P2P shuffle silently dropping data if provided meta includes columns that do not exist in data #8519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions