Skip to content

run the test longer time for multi_widget_construction_perf#64185

Merged
CareF merged 1 commit intoflutter:masterfrom
CareF:multi_widget_longer_time
Aug 20, 2020
Merged

run the test longer time for multi_widget_construction_perf#64185
CareF merged 1 commit intoflutter:masterfrom
CareF:multi_widget_longer_time

Conversation

@CareF
Copy link
Contributor

@CareF CareF commented Aug 19, 2020

Description

This may seem to result in a regression but it's intended!

The current version, especially the e2e version, have the 90th percentile frame build time number jumping to 20% larger in some test runs. We believe the reason is the 90th percentile lies accidentally at a transition place:
figure_2
(the plot is the sorted frame build time in microsecond, and the vertical line is 90-th percentile).

Change the test sample duration to a larger number may reduce this issue. We mark the driver version flaky here because this change requires larger memory and may #59263. #59263 typically is more often seen on my local test device than on device lab, so I'm not sure if it's as a severe problem on device lab.

Checklist

Before you create this PR, confirm that it meets all requirements listed below by checking the relevant checkboxes ([x]). This will ensure a smooth and quick review process.

  • I read the Contributor Guide and followed the process outlined there for submitting PRs.
  • I signed the CLA.
  • I read and followed the Flutter Style Guide, including Features we expect every widget to implement.
  • I read the Tree Hygiene wiki page, which explains my responsibilities.
  • I updated/added relevant documentation (doc comments with ///).
  • All existing and new tests are passing.
  • The analyzer (flutter analyze --flutter-repo) does not report any problems on my PR.
  • I am willing to follow-up on review comments in a timely manner.

Breaking Change

Did any tests fail when you ran them? Please read Handling breaking changes.

  • No, no existing tests failed, so this is not a breaking change.

@CareF CareF requested a review from liyuqian August 19, 2020 20:39
@flutter-dashboard flutter-dashboard bot added the c: contributor-productivity Team-specific productivity, code health, technical debt. label Aug 19, 2020
@CareF
Copy link
Contributor Author

CareF commented Aug 19, 2020

The same figure after this commit will be:
20

Copy link
Contributor

@liyuqian liyuqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liyuqian
Copy link
Contributor

CC @Hixie @flar @zanderso @kf6gpe : this is an interesting analysis and it may be applicable to many of our bi-modal benchmarks. @flar 's graphing tool could be handy in these cases.

@fluttergithubbot
Copy link
Contributor

This pull request is not suitable for automatic merging in its current state.

  • The status or check suite hostonly_devicelab_tests-0-macos has failed. Please fix the issues identified (or deflake) before re-applying this label.

@zanderso
Copy link
Member

Why are the slow frames slow? Are the slow frames slow because of the thing the benchmark is trying to measure, or for an unrelated reason? What should we do if/when the 90% point falls on a boundary like this again?

@Hixie
Copy link
Contributor

Hixie commented Aug 19, 2020

The point of tracking the 90th percentile is to catch the slowest frames other than the occasional system-induced craziness (if it wasn't for the latter, we'd just track the slowest frame).

I would expect the graph to look more like:

|             |     /
|             |    /
|             |   /
|~~~~~~~~~~~~~|~~'
|             |
|             |
+-------------|----------

If there are so many frames affected by craziness that 10% of them are so affected, we should look at what is going on on the system that it is so common for the system to mess up our benchmarks.

If, on the other hand, 10% of our frames are much slower than others because of something we're doing, we should change to tracking the 99th percentile instead and figure out why so many of our frames are slow.

If our data is trimodal as suggested by the graphs above, we should see if we can understand why.

@flar
Copy link
Contributor

flar commented Aug 19, 2020

CC @Hixie @flar @zanderso @kf6gpe : this is an interesting analysis and it may be applicable to many of our bi-modal benchmarks. @flar 's graphing tool could be handy in these cases.

The original version of my graphing tool (without using the --web option) can make those frame time distribution graphs. I need to teach the web version to do the same. If you pub activate those tools and use graphTimeline [timeline_summary.json] (and it can also use the timeline.json file as well) it will show you the graph you want.

@CareF
Copy link
Contributor Author

CareF commented Aug 19, 2020

Why are the slow frames slow? Are the slow frames slow because of the thing the benchmark is trying to measure, or for an unrelated reason? What should we do if/when the 90% point falls on a boundary like this again?

If I don't plot the frame build time sorted but just vs frame index (time ordered but not time):
figure_1

I believe these peaks are GC related, and the faster -> slower transition is due to heat or memory used issue: the slower part (after the 40th frame) becomes longer if I sample longer time.

For this specific case, the frame building is intentionally slow due to heavy widget tree re-construction, and therefore for the 10s sampling time the total number of frames is ~80, therefore close to the step the result is sensitive to +- 1 frame. With more frames there will still be fluctuation but not as sensitive.

I'm not sure about other test cases but generally sampling for longer time should be a very general solution. The only limiting factor is the out-of-memory crash due to a very long timeline, which is an issue my new host-independent test doesn't have.

@zanderso

@CareF
Copy link
Contributor Author

CareF commented Aug 20, 2020

The point of tracking the 90th percentile is to catch the slowest frames other than the occasional system-induced craziness (if it wasn't for the latter, we'd just track the slowest frame).

I would expect the graph to look more like:

|             |     /
|             |    /
|             |   /
|~~~~~~~~~~~~~|~~'
|             |
|             |
+-------------|----------

If there are so many frames affected by craziness that 10% of them are so affected, we should look at what is going on on the system that it is so common for the system to mess up our benchmarks.

If, on the other hand, 10% of our frames are much slower than others because of something we're doing, we should change to tracking the 99th percentile instead and figure out why so many of our frames are slow.

If our data is trimodal as suggested by the graphs above, we should see if we can understand why.

This test should not be considered as a conventional test in terms of build time distribution because I wrote the test case intentionally forcing the system to rebuild the widget tree and its corresponding element tree each frame, to test the performance in building the tree: this is not what a typical well-designed Flutter app should do.

If I run the test even longer (40s), this is what I got:

figure_3

It's qualitatively same as the figure of the result of this PR (20s). Compare to the time-ordered figure in my last reply, I believe the three stages are

  • At the beginning the base line of the fast phase
  • After that, the base line of the slower phase
  • Those peaks, which my interpretation is GC, because this test re-build the widget tree every frame, and memory is used up very soon.
  • (the final tail) "occasional system-induced craziness"

So for this test case, the average frame build time is a good indication of the efficiency of building widgets (and its corresponding elements and objects), while 90th percentile is the worse frames when GC is involved. @Hixie

@fluttergithubbot
Copy link
Contributor

This pull request is not suitable for automatic merging in its current state.

  • The status or check suite hostonly_devicelab_tests-0-macos has failed. Please fix the issues identified (or deflake) before re-applying this label.

@Hixie
Copy link
Contributor

Hixie commented Aug 20, 2020

That chart clarifies what's going on a lot, thanks.

What are the units on the charts? (The vertical axis in particular.)

Figuring out exactly why we have a "fast" mode and a "slow" mode would be good, as would figuring out whether we can do better on those GC-affected frames. Once we are confident that we're ok with those four different modalities, we should probably figure out how to measure them separately and track them separately. There's no point looking at a median or other percentile across all four modalities, it's not really a meaningful measure, as you point out, it could easily slide through a transition point. The way to make that non-flaky is to separate them.

@CareF
Copy link
Contributor Author

CareF commented Aug 20, 2020

@Hixie For all these figures, the y-axis is the frame build time in microsecond. x-axis are frame index (no unit).

@CareF CareF merged commit 1fc3a5e into flutter:master Aug 20, 2020
@CareF
Copy link
Contributor Author

CareF commented Aug 20, 2020

The discussion may go on. I merge the PR to run the test on device lab to first confirm my theory. If we find a better way to separate those modalities we can make updates accordingly.

@liyuqian
Copy link
Contributor

Parallel to figuring out why this happened, I wonder if replacing 90th percentile with the average of all build times between 90th-percentile to 100-th percentile would alleviate this bi-modal issue? It seems to at least alleviate this transition case.

In general, I think it's very difficult to control what distributions a performance metrics could have. There could be all kinds of apps/tests and performance metrics with many unknown performance issues in addition to thermal throttling, GC, etc.. It would be nice if there's a general mechanism (e.g., average all numbers in the 90-th percentile to 100-th percentile range) that can be sensitive to worse-case numbers, but not very sensitive to small distribution shifts (e.g., a few more frames get build).

@Hixie
Copy link
Contributor

Hixie commented Aug 20, 2020

I think it's very difficult to control what distributions a performance metrics could have

I agree, but we should know when those distributions change, at least.

@Hixie
Copy link
Contributor

Hixie commented Aug 20, 2020

BTW what I am imagining in my description above is a form of clustering, where we take all the points, cluster them into groups, and track their medians (and assert that there's the right number of groups).

@zanderso
Copy link
Member

zanderso commented Aug 20, 2020

Possibly helpful here: https://pub.dev/packages/k_means_cluster. Last update was more than a year ago, but the core algorithm is <200 lines, so might be straightforward to adapt to this use-case.

The number of groups for these sorts of algorithms tends to be an input, so instead of asserting that you have the right number of groups, you'd probably assert that the per-group sum of squared distances doesn't get too big.

@CareF CareF deleted the multi_widget_longer_time branch August 20, 2020 17:09
@flar
Copy link
Contributor

flar commented Aug 20, 2020

The web benchmarks use std deviation to identify outliers so the metric could be the number of frames in the outlier group (outlier being only the ones that took longer as there could also be shorter outliers).

(Just looked back at the bimodal nature of the benchmark - I'm not sure std-dev would help with that)

@liyuqian
Copy link
Contributor

liyuqian commented Aug 20, 2020

BTW what I am imagining in my description above is a form of clustering, where we take all the points, cluster them into groups, and track their medians (and assert that there's the right number of groups).

Yes, besides averaging, clustering might be another good mechanism to get a stable reading. In our cases, we only need to cluster 1-dimensional numbers which should be easier than clustering higher-dimensional data. As noted by Zach, k-means may not be directly applicable because we don't know the k (the number of clusters or modalities), but there's some known method to choose a good k automatically: https://en.wikipedia.org/wiki/Elbow_method_(clustering)

smadey pushed a commit to smadey/flutter that referenced this pull request Aug 27, 2020
smadey pushed a commit to smadey/flutter that referenced this pull request Aug 27, 2020
mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020
mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

c: contributor-productivity Team-specific productivity, code health, technical debt.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants