Skip to content

[Datasets] [Bug] workers and actors leaking in trivial dataset use #21999

@disruptek

Description

@disruptek

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

I have a stock cluster and I'm testing datasets, but after running the code below, an actor named ray::_DesignatedBlockOwner persists in addition to three ray::IDLE workers. Eventually, these will cumulatively consume all the memory in the cluster.

I will also eventually receive warnings such as this:

The actor '_DesignatedBlockOwner' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See https://github.com/ray-project/ray/issues/6240 for more discussion.
The remote function 'ray.data.read_api.remote_read' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See https://github.com/ray-project/ray/issues/6240 for more discussion.
The remote function '__main__.foo' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See https://github.com/ray-project/ray/issues/6240 for more discussion.

Versions / Dependencies

ray-1.9.2
python-3.7.7 (and 3.9.5)
pyarrow-6.0.1
ubuntu-21.04 (default ray containers)
ami-029536273cb04d4d9; so-called deep learning ami (ubuntu) version 55

Reproduction script

import ray

@ray.remote
def foo():
    return ray.data.range_arrow(10000, parallelism = 1)

ray.init(address="auto")
ref = foo.remote()
print(ray.get(ref))
ray.shutdown()

Anything else

The code leaks three workers and one actor every time it is invoked against the cluster.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions