Skip to content

Test failures with explicit comms #549

@jakirkham

Description

@jakirkham

In PR ( #546 ), we noticed some errors cropping up recently. Not sure exactly the cause, but they may be related to PR ( dask/distributed#4531 ). Copying some more details about what was observed in the log below

Seeing this in the log

10:58:40   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 628, in recv
10:58:40     ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
10:58:40 ucp.exceptions.UCXMsgTruncated: <[Recv #112] ep: 0x7f0e25cde0d8, tag: 0xfa85496d273cdec2, nbytes: 260, type: <class 'numpy.ndarray'>>: length mismatch: 16 (got) != 260 (expected)

Also seeing this

11:05:40   File "/var/lib/jenkins/workspace/rapidsai/gpuci/dask-cuda/prb/dask-cuda-gpu-test/CUDA/10.1/GPU_LABEL/gpu-t4||gpu/OS/ubuntu16.04/PYTHON/3.7/dask_cuda/explicit_comms/dataframe/shuffle.py", line 196, in local_shuffle
11:05:40     out_parts[i] = None
11:05:40 TypeError: 'tuple' object does not support item assignment
11:05:40 FAILED

cc @rjzamora @madsbk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions