[data] New executor backend [2/n]--- Add allocation tracing util by ericl · Pull Request #31283 · ray-project/ray

ericl · 2022-12-22T00:59:46Z

Why are these changes needed?

Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues.

This is split out from #30903

Sample output:

[mem_tracing] ===== Leaked objects =====
[mem_tracing] Leaked object, created at test1, size 8388742, skipped dealloc at test3: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000004000000)
[mem_tracing] Leaked object, created at test5, size 8388742: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000006000000)
[mem_tracing] ===== End leaked objects =====
[mem_tracing] ===== Freed objects =====
[mem_tracing] Freed object from test2 at test4, size 8388742: ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000005000000)
[mem_tracing] ===== End freed objects =====
[mem_tracing] Peak size bytes 25166226
[mem_tracing] Current size bytes 16777484

Signed-off-by: Eric Liang <ekhliang@gmail.com>

python/ray/data/_internal/memory_tracing.py

c21 · 2022-12-22T19:05:21Z

python/ray/data/_internal/memory_tracing.py

+        if ref not in self.allocated:
+            meta = ray.experimental.get_object_locations([ref])
+            size_bytes = meta.get("object_size", 0)
+            if not size_bytes:


when will this happen - object_size is not available?

Seems concerning that we have to fetch all objects locally here, to cause potential memory issue.

It's not a memory issue since you'd only enable this for debugging unit tests, etc. Performance doesn't matter in this case.

c21 · 2022-12-22T19:06:43Z

python/ray/data/_internal/memory_tracing.py

+                from ray import cloudpickle as pickle
+
+                try:
+                    obj = ray.get(ref, timeout=5.0)


shall we add a warning log that

The object_size metadata is not available for {ref}. Fetching the object reference locally to get size of object.

This is actually the common case... the warning is too verbose.

c21 · 2022-12-22T19:14:39Z

python/ray/data/_internal/memory_tracing.py

+
+
+@ray.remote
+class _MemActor:


is this memory actor something we plan to enable by default in the long term? Or just disable by default and enable when debugging is needed?

Wondering if the logic can be done in Dataset driver DatasetStats itself, or we must need another actor for it. All the logic looks pretty lightweight except fetching the object if size is missing.

As we are preparing to kill stats actor, I want to make sure we think twice before introducing another actor for Data.

I don't think we'd ever enable this by default, since the performance overhead is fairly substantial. Or if we did, it would require more optimizations.

If you take a look at the full PR, I have added lighter weight memory stats per-operator to the graph, which can also be used for debugging. Though, it is not as comprehensive as this one, which covers both the new code and the legacy code.

SG. shall we add comment in this file, to capture the caveat:

NOTE: the performance overhead of tracing object allocation is fairly substantial. This is meant to use in unit test for debugging. Please do not enable in production, without performance optimization.

Added to main PR.

c21 · 2022-12-22T19:15:50Z

python/ray/data/_internal/memory_tracing.py

+
+    Args:
+        ref: The object created.
+        loc: A human-readable string identifying the call site.


In reality, how to we want to pass the call site? Is it <class-name>:<method-name>:<variable-name>?

I've just passed stuff like "output iterator" (see the other PR).

c21 · 2022-12-22T19:17:32Z

btw this looks pretty useful to be included in DatasetStats. It will be great if we can report this object memory usage information in ds.stats().

python/ray/data/_internal/memory_tracing.py

clarkzinzow · 2022-12-22T19:42:10Z

python/ray/data/_internal/memory_tracing.py

+
+                try:
+                    obj = ray.get(ref, timeout=5.0)
+                    size_bytes = len(pickle.dumps(obj))


This won't include the MessagePack wrapper that exists when the object's data goes through Ray's full serialization machinery, so this allocation size and derived stats (e.g. peak memory) can end up being a slight underestimate.

Any size is fine, it's just to distinguish between trivial objects and those that are significant in size.

python/ray/data/_internal/memory_tracing.py

clarkzinzow · 2022-12-22T19:54:23Z

python/ray/data/_internal/memory_tracing.py

+from ray.data.context import DatasetContext
+
+
+def trace_allocation(ref: ray.ObjectRef, loc: str) -> None:


Could we create loc automatically for the caller with something akin to inspect.getframeinfo(inspect.stack()[1]) or inspect.getouterframes(inspect.currentframe())[2]? Or we could reuse something like our get_py_stack utility:

ray/python/ray/_raylet.pyx

Lines 1375 to 1415 in 2bedf24

# This function introduces ~2-7us of overhead per call (i.e., it can be called

# up to hundreds of thousands of times per second).

cdef void get_py_stack(c_string* stack_out) nogil:

"""Get the Python call site.

This can be called from within C++ code to retrieve the file name and line

number of the Python code that is calling into the core worker.

"""

with gil:

try:

frame = inspect.currentframe()

except ValueError: # overhead of exception handling is about 20us

stack_out[0] = "".encode("ascii")

return

msg_frames = []

while frame and len(msg_frames) < 4:

filename = frame.f_code.co_filename

# Decode Ray internal frames to add annotations.

if filename.endswith("_private/worker.py"):

if frame.f_code.co_name == "put":

msg_frames = ["(put object) "]

elif filename.endswith("_private/workers/default_worker.py"):

pass

elif filename.endswith("ray/remote_function.py"):

# TODO(ekl) distinguish between task return objects and

# arguments. This can only be done in the core worker.

msg_frames = ["(task call) "]

elif filename.endswith("ray/actor.py"):

# TODO(ekl) distinguish between actor return objects and

# arguments. This can only be done in the core worker.

msg_frames = ["(actor call) "]

elif filename.endswith("_private/serialization.py"):

if frame.f_code.co_name == "id_deserializer":

msg_frames = ["(deserialize task arg) "]

else:

msg_frames.append("{}:{}:{}".format(

frame.f_code.co_filename, frame.f_code.co_name,

frame.f_lineno))

frame = frame.f_back

stack_out[0] = (ray_constants.CALL_STACK_LINE_DELIMITER

.join(msg_frames).encode("ascii"))

Probably. I'll add this as a TODO, as it isn't really a pain point.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl

Updated.

ericl · 2022-12-22T22:31:34Z

python/ray/data/_internal/memory_tracing.py

+from ray.data.context import DatasetContext
+
+
+def trace_allocation(ref: ray.ObjectRef, loc: str) -> None:


Probably. I'll add this as a TODO, as it isn't really a pain point.

ericl · 2022-12-22T22:32:25Z

python/ray/data/_internal/memory_tracing.py

+
+    Args:
+        ref: The object created.
+        loc: A human-readable string identifying the call site.


I've just passed stuff like "output iterator" (see the other PR).

ericl · 2022-12-22T22:33:29Z

python/ray/data/_internal/memory_tracing.py

+
+
+@ray.remote
+class _MemActor:


I don't think we'd ever enable this by default, since the performance overhead is fairly substantial. Or if we did, it would require more optimizations.

If you take a look at the full PR, I have added lighter weight memory stats per-operator to the graph, which can also be used for debugging. Though, it is not as comprehensive as this one, which covers both the new code and the legacy code.

ericl · 2022-12-22T22:34:33Z

python/ray/data/_internal/memory_tracing.py

+        if ref not in self.allocated:
+            meta = ray.experimental.get_object_locations([ref])
+            size_bytes = meta.get("object_size", 0)
+            if not size_bytes:


It's not a memory issue since you'd only enable this for debugging unit tests, etc. Performance doesn't matter in this case.

ericl · 2022-12-22T22:34:59Z

python/ray/data/_internal/memory_tracing.py

+                from ray import cloudpickle as pickle
+
+                try:
+                    obj = ray.get(ref, timeout=5.0)


This is actually the common case... the warning is too verbose.

ericl · 2022-12-22T22:35:17Z

python/ray/data/_internal/memory_tracing.py

+
+                try:
+                    obj = ray.get(ref, timeout=5.0)
+                    size_bytes = len(pickle.dumps(obj))


Any size is fine, it's just to distinguish between trivial objects and those that are significant in size.

c21

LGTM

) Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues. This is split out from #30903

…-project#31283) Add a utility class for tracing object allocation / freeing. This makes it a lot easier to debug memory allocation / freeing issues. This is split out from ray-project#30903 Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

ericl added 2 commits December 21, 2022 16:43

add trace util

7f5a432

add memory tracing util

c7ae90c

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl assigned stephanie-wang Dec 22, 2022

ericl requested review from c21, clarkzinzow, jianoaix, jjyao and scv119 as code owners December 22, 2022 00:59

ericl assigned c21, clarkzinzow and jianoaix Dec 22, 2022

c21 reviewed Dec 22, 2022

View reviewed changes

clarkzinzow reviewed Dec 22, 2022

View reviewed changes

comments

8a08690

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl commented Dec 22, 2022

View reviewed changes

c21 approved these changes Dec 23, 2022

View reviewed changes

ericl merged commit 90b9d44 into ray-project:master Dec 23, 2022

ericl mentioned this pull request Jan 6, 2023

[data] New executor [5/n]--- Add legacy compatibility shims #31482

Merged

		from ray.data.context import DatasetContext


		def trace_allocation(ref: ray.ObjectRef, loc: str) -> None:

	# This function introduces ~2-7us of overhead per call (i.e., it can be called
	# up to hundreds of thousands of times per second).
	cdef void get_py_stack(c_string* stack_out) nogil:
	"""Get the Python call site.

	This can be called from within C++ code to retrieve the file name and line
	number of the Python code that is calling into the core worker.
	"""
	with gil:
	try:
	frame = inspect.currentframe()
	except ValueError: # overhead of exception handling is about 20us
	stack_out[0] = "".encode("ascii")
	return
	msg_frames = []
	while frame and len(msg_frames) < 4:
	filename = frame.f_code.co_filename
	# Decode Ray internal frames to add annotations.
	if filename.endswith("_private/worker.py"):
	if frame.f_code.co_name == "put":
	msg_frames = ["(put object) "]
	elif filename.endswith("_private/workers/default_worker.py"):
	pass
	elif filename.endswith("ray/remote_function.py"):
	# TODO(ekl) distinguish between task return objects and
	# arguments. This can only be done in the core worker.
	msg_frames = ["(task call) "]
	elif filename.endswith("ray/actor.py"):
	# TODO(ekl) distinguish between actor return objects and
	# arguments. This can only be done in the core worker.
	msg_frames = ["(actor call) "]
	elif filename.endswith("_private/serialization.py"):
	if frame.f_code.co_name == "id_deserializer":
	msg_frames = ["(deserialize task arg) "]
	else:
	msg_frames.append("{}:{}:{}".format(
	frame.f_code.co_filename, frame.f_code.co_name,
	frame.f_lineno))
	frame = frame.f_back
	stack_out[0] = (ray_constants.CALL_STACK_LINE_DELIMITER
	.join(msg_frames).encode("ascii"))

Conversation

ericl commented Dec 22, 2022

Why are these changes needed?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c21 commented Dec 22, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clarkzinzow Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clarkzinzow Dec 22, 2022 •

edited

Loading