Skip to content

Unhandled exception handler based on local ref counting#14049

Merged
ericl merged 19 commits intoray-project:masterfrom
ericl:error-handling
Feb 13, 2021
Merged

Unhandled exception handler based on local ref counting#14049
ericl merged 19 commits intoray-project:masterfrom
ericl:error-handling

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Feb 10, 2021

Why are these changes needed?

Instead of suppressing remote task errors based on a leaky heuristic, use ref counting to report only the errors that are truly unhandled.

This also adds an env var RAY_IGNORE_UNHANDLED_ERRORS that can be used to suppress the messages.

Related issue number

Closes #12018

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

metadata = Buffer.make(error.GetMetadata()).to_pybytes()
results = worker.deserialize_objects(
[(data, metadata)], [ray.ObjectRef.nil()])
print("DESERIALIZATION DONE")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes any idea why deserialization would succeed (deserialize_objects prints "DESER RESULT..." ), but then we never get to this second print?

The output I get is

DESERIALIZATION RESULT [RayTaskError('__main__.f', 'Traceback (most recent call last):\n  File "python/ray/_raylet.pyx", line 487, in ray._raylet.execute_task\n  File "test.py", line 7, in f\n    raise ValueError("hi")\nValueError: hi\n', <class 'ValueError'>)]
terminate called without an active exception
zsh: abort      python test.py

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why you're not hitting the 2nd print, but the only time I've seen this error message is when a c++ thread goes out of scope without joining it or detaching it. If this is being called on a callback from the core worker there might be something weird happening with the lifetime management in c++

@ericl ericl changed the title [WIP] Unhandled exception handler based on local ref counting Unhandled exception handler based on local ref counting Feb 11, 2021
@ericl ericl assigned rkooo567 and edoakes and unassigned rkooo567 Feb 11, 2021
auto iter = objects_.find(object_id);
if (iter != objects_.end()) {
auto obj = iter->second;
obj->SetAccessed();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the accessed flag? I thought it would be enough to just throw the error below, if we Put the value but there's no ref to the object.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, for cases where the ref is still in scope but never used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your comment, this is just recording the access.

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for the memory store? This is long overdue but for now we could just test that the callback gets triggered correctly.

check_signals_(check_signals),
unhandled_exception_handler_(unhandled_exception_handler) {}

void CoreWorkerMemoryStore::GetAsync(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also SetAccessed here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, really good change both from UX and design standpoint

Comment on lines +1463 to +1464
# We need to release the gil since object destruction may call the
# unhandled exception handler.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only time that the unhandled exception handler is called, should we just hold the gil through the whole call to RemoveLocalReference and not acquire it in the handler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly we need to hold it during the call to .native().

num_exceptions = 0

def interceptor(e):
nonlocal num_exceptions
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've legitimately never seen nonlocal before.... interesting

Comment on lines +73 to +74
logger.error("Unhandled error (suppress with "
"RAY_IGNORE_UNHANDLED_ERRORS=1): {}".format(e))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of unhandled error it might be clearer if we explicitly say something like:
Exception from ObjectRef that was never retrieved with ray.get()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's kind of long though... I don't suppose "ungetted error" is a real word?

Comment on lines +1075 to +1076
# TODO(ekl) remove task push errors entirely now that we have
# the separate unhandled exception handler.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Feb 13, 2021

Comments addressed.

@ericl ericl merged commit 9dc671a into ray-project:master Feb 13, 2021
rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 15, 2021
ericl pushed a commit that referenced this pull request Feb 15, 2021
ericl added a commit to ericl/ray that referenced this pull request Feb 15, 2021
ericl added a commit that referenced this pull request Feb 15, 2021
#14113)

* Revert "Revert "Unhandled exception handler based on local ref counting (#14049)" (#14099)"

This reverts commit b45ae76.

* reomve test

* fix

* fix
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
ray-project#14113)

* Revert "Revert "Unhandled exception handler based on local ref counting (ray-project#14049)" (ray-project#14099)"

This reverts commit b45ae76.

* reomve test

* fix

* fix
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
@architkulkarni architkulkarni mentioned this pull request Feb 16, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] [Error Message] Possible unhandled error from worker

4 participants