Skip to content

common: introduce new "safe" init manager#6296

Merged
htuch merged 9 commits intoenvoyproxy:masterfrom
mergeconflict:safe_init
Mar 22, 2019
Merged

common: introduce new "safe" init manager#6296
htuch merged 9 commits intoenvoyproxy:masterfrom
mergeconflict:safe_init

Conversation

@mergeconflict
Copy link
Copy Markdown

Description: Introduce a new "safe" init manager, to replace the existing one that's prone to use-after-free issues (see e.g. #6116). Users of the existing init manager will be upgraded one-by-one in subsequent PRs if this design is approved. See also previous false starts in PRs #6136 and #6245.
Risk Level: Low, no existing users of the existing init manager are changed in this PR.
Testing: New unit tests added.
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Dan Rosen mergeconflict@google.com

Signed-off-by: Dan Rosen <mergeconflict@google.com>
@mergeconflict
Copy link
Copy Markdown
Author

/review @htuch @mattklein123 @lizan

@mattklein123
Copy link
Copy Markdown
Member

@mergeconflict my initial reaction is this is a lot of code churn to fix a relatively small issue. Before we start the review process, can you describe why it would not be better to just fix the other code like we already discussed? (Cancel callbacks when init manager is destroyed.) Thank you.

@mergeconflict
Copy link
Copy Markdown
Author

@mergeconflict my initial reaction is this is a lot of code churn to fix a relatively small issue. Before we start the review process, can you describe why it would not be better to just fix the other code like we already discussed? (Cancel callbacks when init manager is destroyed.) Thank you.

Yeah, I understand, it does feel like a lot.

The issue I found with the explicit cancelation approach is that we just end up with a use-after-free in the opposite direction, where the init manager attempts to cancel a target that had been destroyed before it. This was demonstrated in one or two unit and integration tests IIRC.

I still think the lightest-touch fix is the one with a shared_ptr<boolean> indicating whether the InitManager has been destroyed:
https://github.com/mergeconflict/envoy/blob/9b27afc25b363aac63d76622c6cce49378dbcabd/source/server/init_manager_impl.cc
It doesn't solve all possible UAFs, so @htuch encouraged me to push forward with this more comprehensive approach, but it does have the virtue that no other code has to change. I'll let you guys discuss; I'm happy to go either way.

@mattklein123
Copy link
Copy Markdown
Member

@mergeconflict if you discussed with @htuch and you both think the churn is worth it, I'm happy to review, I just wanted to make sure we were not doing change for the sake of change.

@htuch htuch self-assigned this Mar 15, 2019
@htuch
Copy link
Copy Markdown
Member

htuch commented Mar 15, 2019

@mattklein123 yep, I will review tomorrow. I realize this is a lot of churn, but there have been things that have been bugging me around InitManager for ages and then more recently:

  1. The naming of this class and methods, as well as continuation passing style, makes it really hard to page back in how it works every 6 months or so I return back to this code.
  2. We have had a number of memory safety and lifetime issues exposed by fuzzers and crash reports around the existing setup.
    So, I'm excited that @mergeconflict has put together a structural approach to solving both these problems.

@mattklein123
Copy link
Copy Markdown
Member

@htuch @mergeconflict sg will review also.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm think this is a really nice design, some low level nits and probably needs some non-Googler to take a pass (@lizan or @mattklein123).

/wait

Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
@mergeconflict
Copy link
Copy Markdown
Author

@lizan @mattklein123 by the way, I guess I should say, I think the best place to start reviewing this might be test/safe_init/manager_impl_test.cc. I think the UnavailableTarget, UnavailableManager and UnavailableWatcher test cases show what I'm fundamentally trying to accomplish.

Also, let me know if it'd be helpful to chat on Slack or Zoom. And thanks again for your time reviewing and helping me with this!

@mattklein123 mattklein123 self-assigned this Mar 17, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out a bunch of comments. Thanks this is very nice and well done. I think we should move forward with this, though I do want to make one general comment. That comment is that although this design will certainly fix the issue that you initially found, and is certainly much better from a use after free scenario, it may actually hide subtle initialization bugs that would have crashed previously but now might just not work correctly, and in some sense will be harder to debug than outright crashing. (I actually want to do a blog post on this exact topic: basically that sometimes increased memory safety can actually lead to different bugs that are harder to identify). Anyhow, that's really a non-actionable comment, but I thought I would throw it out there in case you can think of additional ways to try to prevent us getting into a situation in which "we don't crash but things juts don't work correctly." That's probably as many ASSERTS as possible and not being wishy-washy about how people should use the interface (I had one comment on that).

Thank you!

/wait

/**
* Signal to the init manager that this target has finished initializing. This should ideally
* only be called once, after `initialize` was called. Calling it before initialization begins
* or after it has already been called before will have no effect.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If callers aren't supposed to call this way, can we at least ASSERT that is the case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should actually probably change the comment, since I think it'll be actually called multiple times on purpose in some places. For example, in https://github.com/envoyproxy/envoy/blob/master/source/common/router/rds_impl.cc#L136:

void RdsRouteConfigSubscription::runInitializeCallbackIfAny() {
  if (initialize_callback_) {
    initialize_callback_();
    initialize_callback_ = nullptr;
  }
}

That's called from a whole bunch of places in the file. The same pattern occurs in other implementations of Init::Target, so I figured I'd encapsulate it here.

@htuch
Copy link
Copy Markdown
Member

htuch commented Mar 18, 2019

@mattklein123 yeah, @mergeconflict and I discussed the hiding vs. memory safety aspect a bit in the design phase of this. We reached a conclusion that there are definitely some legitimate scenarios where ownership needs to be yanked by both parties (e.g. multiple listeners warming on the same route), even though in some cases this would potentially hide problematic ownership or lifetime organization.

OTOH, the status quo is that if you mess up lifetime management, you get subtle errors anyway, since a lot of the issues we discovered around InitManager only occurred as a result of flakes in TSAN or random production crash reports. So, ¯\_(ツ)_/¯

Dan Rosen added 2 commits March 18, 2019 10:11
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
@mergeconflict
Copy link
Copy Markdown
Author

I just realized I have a minor design issue in TargetImpl: it is currently designed to be used as a mix-in class, but it also inherits from Logger::Loggable. This creates a problem for any class that inherits from it and also inherits from Logger::Loggable. So I'll change TargetImpl to not be a mix-in, just a regular data member like WatcherImpl.

/wait

Signed-off-by: Dan Rosen <mergeconflict@google.com>
@htuch htuch removed the waiting label Mar 19, 2019
@mergeconflict
Copy link
Copy Markdown
Author

@mattklein123 I think I've addressed everything so far. Have another look if you get a chance, and let me know if there's anybody else you'd recommend to have a look.

I have a local branch queued up to actually change all the use sites (https://github.com/mergeconflict/envoy/tree/safe_init_conversion), which is mostly mechanical and boring changes. So whenever this lands, I can rebase that and put up a PR.

mattklein123
mattklein123 previously approved these changes Mar 21, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, at a high level LGTM. Very nice! I will defer to @htuch for the remainder of the review.

Dan Rosen added 2 commits March 21, 2019 13:09
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like to discuss the mocking approach before merging.

namespace SafeInit {

/**
* MockWatcher is a real WatcherImpl, subclassed to add a mock `ready` method that you can set
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not such a fan of this style TBH. Ideally mocks should be pure, so that when used in some unrelated code, which only needs to use part of the mocked class, e.g. just the add(), it only needs the single EXPECT_CALL and nothing else. This approach pushes a lot of policy into the mock. Sometimes we do this for convenience, when we don't want a zillion tests reinventing how to mock ClusterManager, for example, but in this case I think we could have a purer mock here and override it based on WatcherImpl in the specific tests.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: these really aren't mocks, they are real WatcherImpl and TargetImpl subclassed to make tests less cumbersome to write. Renamed.

@htuch htuch added the waiting label Mar 21, 2019
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a great improvement and reduction in cognitive load around the init flow.

@htuch htuch merged commit 1301f11 into envoyproxy:master Mar 22, 2019
@mergeconflict mergeconflict deleted the safe_init branch March 27, 2019 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants