Skip to content

[data] New executor backend [2/n]--- Add allocation tracing util#31283

Merged
ericl merged 3 commits intoray-project:masterfrom
ericl:mem-trace
Dec 23, 2022
Merged

[data] New executor backend [2/n]--- Add allocation tracing util#31283
ericl merged 3 commits intoray-project:masterfrom
ericl:mem-trace

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Dec 22, 2022

Why are these changes needed?

Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues.

This is split out from #30903

Sample output:

[mem_tracing] ===== Leaked objects =====
[mem_tracing] Leaked object, created at test1, size 8388742, skipped dealloc at test3: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000004000000)
[mem_tracing] Leaked object, created at test5, size 8388742: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000006000000)
[mem_tracing] ===== End leaked objects =====
[mem_tracing] ===== Freed objects =====
[mem_tracing] Freed object from test2 at test4, size 8388742: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000005000000)
[mem_tracing] ===== End freed objects =====
[mem_tracing] Peak size bytes 25166226
[mem_tracing] Current size bytes 16777484

Signed-off-by: Eric Liang <ekhliang@gmail.com>
if ref not in self.allocated:
meta = ray.experimental.get_object_locations([ref])
size_bytes = meta.get("object_size", 0)
if not size_bytes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happen - object_size is not available?

Seems concerning that we have to fetch all objects locally here, to cause potential memory issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a memory issue since you'd only enable this for debugging unit tests, etc. Performance doesn't matter in this case.

from ray import cloudpickle as pickle

try:
obj = ray.get(ref, timeout=5.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a warning log that

The object_size metadata is not available for {ref}. Fetching the object reference locally to get size of object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the common case... the warning is too verbose.



@ray.remote
class _MemActor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this memory actor something we plan to enable by default in the long term? Or just disable by default and enable when debugging is needed?

Wondering if the logic can be done in Dataset driver DatasetStats itself, or we must need another actor for it. All the logic looks pretty lightweight except fetching the object if size is missing.

As we are preparing to kill stats actor, I want to make sure we think twice before introducing another actor for Data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd ever enable this by default, since the performance overhead is fairly substantial. Or if we did, it would require more optimizations.

If you take a look at the full PR, I have added lighter weight memory stats per-operator to the graph, which can also be used for debugging. Though, it is not as comprehensive as this one, which covers both the new code and the legacy code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. shall we add comment in this file, to capture the caveat:

NOTE: the performance overhead of tracing object allocation is fairly substantial. This is meant to use in unit test for debugging. Please do not enable in production, without performance optimization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to main PR.


Args:
ref: The object created.
loc: A human-readable string identifying the call site.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality, how to we want to pass the call site? Is it <class-name>:<method-name>:<variable-name>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just passed stuff like "output iterator" (see the other PR).

@c21
Copy link
Copy Markdown
Contributor

c21 commented Dec 22, 2022

btw this looks pretty useful to be included in DatasetStats. It will be great if we can report this object memory usage information in ds.stats().


try:
obj = ray.get(ref, timeout=5.0)
size_bytes = len(pickle.dumps(obj))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't include the MessagePack wrapper that exists when the object's data goes through Ray's full serialization machinery, so this allocation size and derived stats (e.g. peak memory) can end up being a slight underestimate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any size is fine, it's just to distinguish between trivial objects and those that are significant in size.

from ray.data.context import DatasetContext


def trace_allocation(ref: ray.ObjectRef, loc: str) -> None:
Copy link
Copy Markdown
Contributor

@clarkzinzow clarkzinzow Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create loc automatically for the caller with something akin to inspect.getframeinfo(inspect.stack()[1]) or inspect.getouterframes(inspect.currentframe())[2]? Or we could reuse something like our get_py_stack utility:

ray/python/ray/_raylet.pyx

Lines 1375 to 1415 in 2bedf24

# This function introduces ~2-7us of overhead per call (i.e., it can be called
# up to hundreds of thousands of times per second).
cdef void get_py_stack(c_string* stack_out) nogil:
"""Get the Python call site.
This can be called from within C++ code to retrieve the file name and line
number of the Python code that is calling into the core worker.
"""
with gil:
try:
frame = inspect.currentframe()
except ValueError: # overhead of exception handling is about 20us
stack_out[0] = "".encode("ascii")
return
msg_frames = []
while frame and len(msg_frames) < 4:
filename = frame.f_code.co_filename
# Decode Ray internal frames to add annotations.
if filename.endswith("_private/worker.py"):
if frame.f_code.co_name == "put":
msg_frames = ["(put object) "]
elif filename.endswith("_private/workers/default_worker.py"):
pass
elif filename.endswith("ray/remote_function.py"):
# TODO(ekl) distinguish between task return objects and
# arguments. This can only be done in the core worker.
msg_frames = ["(task call) "]
elif filename.endswith("ray/actor.py"):
# TODO(ekl) distinguish between actor return objects and
# arguments. This can only be done in the core worker.
msg_frames = ["(actor call) "]
elif filename.endswith("_private/serialization.py"):
if frame.f_code.co_name == "id_deserializer":
msg_frames = ["(deserialize task arg) "]
else:
msg_frames.append("{}:{}:{}".format(
frame.f_code.co_filename, frame.f_code.co_name,
frame.f_lineno))
frame = frame.f_back
stack_out[0] = (ray_constants.CALL_STACK_LINE_DELIMITER
.join(msg_frames).encode("ascii"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I'll add this as a TODO, as it isn't really a pain point.

Signed-off-by: Eric Liang <ekhliang@gmail.com>
Copy link
Copy Markdown
Contributor Author

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

from ray.data.context import DatasetContext


def trace_allocation(ref: ray.ObjectRef, loc: str) -> None:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I'll add this as a TODO, as it isn't really a pain point.


Args:
ref: The object created.
loc: A human-readable string identifying the call site.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just passed stuff like "output iterator" (see the other PR).



@ray.remote
class _MemActor:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd ever enable this by default, since the performance overhead is fairly substantial. Or if we did, it would require more optimizations.

If you take a look at the full PR, I have added lighter weight memory stats per-operator to the graph, which can also be used for debugging. Though, it is not as comprehensive as this one, which covers both the new code and the legacy code.

if ref not in self.allocated:
meta = ray.experimental.get_object_locations([ref])
size_bytes = meta.get("object_size", 0)
if not size_bytes:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a memory issue since you'd only enable this for debugging unit tests, etc. Performance doesn't matter in this case.

from ray import cloudpickle as pickle

try:
obj = ray.get(ref, timeout=5.0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the common case... the warning is too verbose.


try:
obj = ray.get(ref, timeout=5.0)
size_bytes = len(pickle.dumps(obj))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any size is fine, it's just to distinguish between trivial objects and those that are significant in size.

Copy link
Copy Markdown
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ericl ericl merged commit 90b9d44 into ray-project:master Dec 23, 2022
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
)

Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues.

This is split out from #30903
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…-project#31283)

Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues.

This is split out from ray-project#30903

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants