Skip to content

iOS: fix termination crash in ProvisionalDispatcher (again)#2276

Merged
jpsim merged 8 commits intomainfrom
ios-fix-termination-crash-in-provisionaldispatcher-again
May 17, 2022
Merged

iOS: fix termination crash in ProvisionalDispatcher (again)#2276
jpsim merged 8 commits intomainfrom
ios-fix-termination-crash-in-provisionaldispatcher-again

Conversation

@jpsim
Copy link
Copy Markdown
Contributor

@jpsim jpsim commented May 13, 2022

I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on std::thread::join().

So moving this to before the main_thread_.join() may help.

Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated.
Release Notes: Added

Co-authored-by: Mike Schore mike.schore@gmail.com
Signed-off-by: JP Simard jp@jpsim.com

jpsim added 3 commits May 13, 2022 16:17
I believe #2059 helped
reduce these occurrences, but we're still seeing some termination
crashes where a different thread is waiting on `std::thread::join()`.

So moving this to before the `main_thread_.join()` may help.

Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: JP Simard <jp@jpsim.com>
@jpsim jpsim marked this pull request as ready for review May 13, 2022 20:48
@jpsim jpsim requested review from goaway and snowp May 13, 2022 20:48
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What data race are you concerned about? Seems like this is all done under a lock

I wonder if it would be helpful to set up a integration test which keeps issuing calls against the dispatcher while shutting down, then using bazels "run this test 1000 times" feature to see if we're still hitting crashes

Comment on lines 171 to +172
event_dispatcher_->exit();
dispatcher_->terminate();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the provisional dispatcher valid after event_dispatcher_->exit() is called? Should the order here be flipped?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually suggest moving the exit() call inside of ProvisionalDispatcher::terminate() to make this less delicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted it to what it was before. Moving the exit into terminate() caused the java tests to crash:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000014156e5d9, pid=28881, tid=29959
#
# JRE version: OpenJDK Runtime Environment Zulu11.50+19-CA (11.0.12+7) (build 11.0.12+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu11.50+19-CA (11.0.12+7-LTS, mixed mode, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# C  [libndk_envoy_jni.jnilib+0x1fee5d9]  Envoy::Event::ProvisionalDispatcher::terminate()+0x29

@goaway
Copy link
Copy Markdown
Contributor

goaway commented May 16, 2022

What data race are you concerned about? Seems like this is all done under a lock

I wonder if it would be helpful to set up a integration test which keeps issuing calls against the dispatcher while shutting down, then using bazels "run this test 1000 times" feature to see if we're still hitting crashes

If you look at the logic in envoy_status_t Engine::terminate(), the lock is released (necessarily) before attempting to join the main thread. This sequence of events in this PR is the one I alluded to earlier that I thought would be more robust.

Thanks @jpsim!

Copy link
Copy Markdown
Contributor

@goaway goaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See minor suggestion above.

jpsim added 3 commits May 17, 2022 08:16
…atcher-again

* main:
  envoy: update to d88f31b (#2279)
  api: disallow setting 'host' header directly (#2275)
  android: add support for registering a platform KV store (#2134)
  Bump Lyft Support Rotation (#2278)
  tools: Enable the VSCode completion db to use bazelisk if available (#2277)

Signed-off-by: JP Simard <jp@jpsim.com>
To the provisional dispatcher's terminate

Signed-off-by: JP Simard <jp@jpsim.com>
This reverts commit e9d1254.

Signed-off-by: JP Simard <jp@jpsim.com>
@jpsim jpsim requested review from goaway and snowp May 17, 2022 12:51
jpsim added 2 commits May 17, 2022 17:18
…rash-in-provisionaldispatcher-again

* origin/main:
  tools: use `bazelw` bazel wrapper in `vscode_compdb.sh` (#2284)
  ci: add scheduled job to update Envoy weekly (#2282)
  docs: update python packages to work with Python 3.10 (#2286)
  test: adding a cancel test, cleaning up copy-paste code (#2283)

Signed-off-by: JP Simard <jp@jpsim.com>
Copy link
Copy Markdown
Contributor

@goaway goaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpsim!

@jpsim jpsim merged commit 1a96dac into main May 17, 2022
@jpsim jpsim deleted the ios-fix-termination-crash-in-provisionaldispatcher-again branch May 17, 2022 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants