run the test longer time for multi_widget_construction_perf by CareF · Pull Request #64185 · flutter/flutter

CareF · 2020-08-19T20:39:28Z

Description

This may seem to result in a regression but it's intended!

The current version, especially the e2e version, have the 90th percentile frame build time number jumping to 20% larger in some test runs. We believe the reason is the 90th percentile lies accidentally at a transition place:

(the plot is the sorted frame build time in microsecond, and the vertical line is 90-th percentile).

Change the test sample duration to a larger number may reduce this issue. We mark the driver version flaky here because this change requires larger memory and may #59263. #59263 typically is more often seen on my local test device than on device lab, so I'm not sure if it's as a severe problem on device lab.

Checklist

Before you create this PR, confirm that it meets all requirements listed below by checking the relevant checkboxes ([x]). This will ensure a smooth and quick review process.

I read the Contributor Guide and followed the process outlined there for submitting PRs.
I signed the CLA.
I read and followed the Flutter Style Guide, including Features we expect every widget to implement.
I read the Tree Hygiene wiki page, which explains my responsibilities.
I updated/added relevant documentation (doc comments with ///).
All existing and new tests are passing.
The analyzer (flutter analyze --flutter-repo) does not report any problems on my PR.
I am willing to follow-up on review comments in a timely manner.

Breaking Change

Did any tests fail when you ran them? Please read Handling breaking changes.

No, no existing tests failed, so this is not a breaking change.

CareF · 2020-08-19T20:41:57Z

The same figure after this commit will be:

liyuqian

LGTM

liyuqian · 2020-08-19T20:58:06Z

CC @Hixie @flar @zanderso @kf6gpe : this is an interesting analysis and it may be applicable to many of our bi-modal benchmarks. @flar 's graphing tool could be handy in these cases.

fluttergithubbot · 2020-08-19T21:36:10Z

This pull request is not suitable for automatic merging in its current state.

The status or check suite hostonly_devicelab_tests-0-macos has failed. Please fix the issues identified (or deflake) before re-applying this label.

zanderso · 2020-08-19T21:39:03Z

Why are the slow frames slow? Are the slow frames slow because of the thing the benchmark is trying to measure, or for an unrelated reason? What should we do if/when the 90% point falls on a boundary like this again?

Hixie · 2020-08-19T21:55:53Z

The point of tracking the 90th percentile is to catch the slowest frames other than the occasional system-induced craziness (if it wasn't for the latter, we'd just track the slowest frame).

I would expect the graph to look more like:

|             |     /
|             |    /
|             |   /
|~~~~~~~~~~~~~|~~'
|             |
|             |
+-------------|----------

If there are so many frames affected by craziness that 10% of them are so affected, we should look at what is going on on the system that it is so common for the system to mess up our benchmarks.

If, on the other hand, 10% of our frames are much slower than others because of something we're doing, we should change to tracking the 99th percentile instead and figure out why so many of our frames are slow.

If our data is trimodal as suggested by the graphs above, we should see if we can understand why.

flar · 2020-08-19T22:28:06Z

CC @Hixie @flar @zanderso @kf6gpe : this is an interesting analysis and it may be applicable to many of our bi-modal benchmarks. @flar 's graphing tool could be handy in these cases.

The original version of my graphing tool (without using the --web option) can make those frame time distribution graphs. I need to teach the web version to do the same. If you pub activate those tools and use graphTimeline [timeline_summary.json] (and it can also use the timeline.json file as well) it will show you the graph you want.

CareF · 2020-08-19T23:54:46Z

Why are the slow frames slow? Are the slow frames slow because of the thing the benchmark is trying to measure, or for an unrelated reason? What should we do if/when the 90% point falls on a boundary like this again?

If I don't plot the frame build time sorted but just vs frame index (time ordered but not time):

I believe these peaks are GC related, and the faster -> slower transition is due to heat or memory used issue: the slower part (after the 40th frame) becomes longer if I sample longer time.

For this specific case, the frame building is intentionally slow due to heavy widget tree re-construction, and therefore for the 10s sampling time the total number of frames is ~80, therefore close to the step the result is sensitive to +- 1 frame. With more frames there will still be fluctuation but not as sensitive.

I'm not sure about other test cases but generally sampling for longer time should be a very general solution. The only limiting factor is the out-of-memory crash due to a very long timeline, which is an issue my new host-independent test doesn't have.

@zanderso

CareF · 2020-08-20T00:03:44Z

The point of tracking the 90th percentile is to catch the slowest frames other than the occasional system-induced craziness (if it wasn't for the latter, we'd just track the slowest frame).

I would expect the graph to look more like:
|             |     /
|             |    /
|             |   /
|~~~~~~~~~~~~~|~~'
|             |
|             |
+-------------|----------
If there are so many frames affected by craziness that 10% of them are so affected, we should look at what is going on on the system that it is so common for the system to mess up our benchmarks.

If, on the other hand, 10% of our frames are much slower than others because of something we're doing, we should change to tracking the 99th percentile instead and figure out why so many of our frames are slow.

If our data is trimodal as suggested by the graphs above, we should see if we can understand why.

This test should not be considered as a conventional test in terms of build time distribution because I wrote the test case intentionally forcing the system to rebuild the widget tree and its corresponding element tree each frame, to test the performance in building the tree: this is not what a typical well-designed Flutter app should do.

If I run the test even longer (40s), this is what I got:

It's qualitatively same as the figure of the result of this PR (20s). Compare to the time-ordered figure in my last reply, I believe the three stages are

At the beginning the base line of the fast phase
After that, the base line of the slower phase
Those peaks, which my interpretation is GC, because this test re-build the widget tree every frame, and memory is used up very soon.
(the final tail) "occasional system-induced craziness"

So for this test case, the average frame build time is a good indication of the efficiency of building widgets (and its corresponding elements and objects), while 90th percentile is the worse frames when GC is involved. @Hixie

fluttergithubbot · 2020-08-20T00:16:03Z

This pull request is not suitable for automatic merging in its current state.

The status or check suite hostonly_devicelab_tests-0-macos has failed. Please fix the issues identified (or deflake) before re-applying this label.

Hixie · 2020-08-20T00:56:27Z

That chart clarifies what's going on a lot, thanks.

What are the units on the charts? (The vertical axis in particular.)

Figuring out exactly why we have a "fast" mode and a "slow" mode would be good, as would figuring out whether we can do better on those GC-affected frames. Once we are confident that we're ok with those four different modalities, we should probably figure out how to measure them separately and track them separately. There's no point looking at a median or other percentile across all four modalities, it's not really a meaningful measure, as you point out, it could easily slide through a transition point. The way to make that non-flaky is to separate them.

CareF · 2020-08-20T01:18:45Z

@Hixie For all these figures, the y-axis is the frame build time in microsecond. x-axis are frame index (no unit).

CareF · 2020-08-20T01:25:24Z

The discussion may go on. I merge the PR to run the test on device lab to first confirm my theory. If we find a better way to separate those modalities we can make updates accordingly.

liyuqian · 2020-08-20T05:02:07Z

Parallel to figuring out why this happened, I wonder if replacing 90th percentile with the average of all build times between 90th-percentile to 100-th percentile would alleviate this bi-modal issue? It seems to at least alleviate this transition case.

In general, I think it's very difficult to control what distributions a performance metrics could have. There could be all kinds of apps/tests and performance metrics with many unknown performance issues in addition to thermal throttling, GC, etc.. It would be nice if there's a general mechanism (e.g., average all numbers in the 90-th percentile to 100-th percentile range) that can be sensitive to worse-case numbers, but not very sensitive to small distribution shifts (e.g., a few more frames get build).

Hixie · 2020-08-20T05:25:25Z

I think it's very difficult to control what distributions a performance metrics could have

I agree, but we should know when those distributions change, at least.

Hixie · 2020-08-20T05:26:26Z

BTW what I am imagining in my description above is a form of clustering, where we take all the points, cluster them into groups, and track their medians (and assert that there's the right number of groups).

This reverts commit 1fc3a5e.

zanderso · 2020-08-20T16:58:40Z

Possibly helpful here: https://pub.dev/packages/k_means_cluster. Last update was more than a year ago, but the core algorithm is <200 lines, so might be straightforward to adapt to this use-case.

The number of groups for these sorts of algorithms tends to be an input, so instead of asserting that you have the right number of groups, you'd probably assert that the per-group sum of squared distances doesn't get too big.

flar · 2020-08-20T17:50:05Z

The web benchmarks use std deviation to identify outliers so the metric could be the number of frames in the outlier group (outlier being only the ones that took longer as there could also be shorter outliers).

(Just looked back at the bimodal nature of the benchmark - I'm not sure std-dev would help with that)

liyuqian · 2020-08-20T17:59:56Z

BTW what I am imagining in my description above is a form of clustering, where we take all the points, cluster them into groups, and track their medians (and assert that there's the right number of groups).

Yes, besides averaging, clustering might be another good mechanism to get a stable reading. In our cases, we only need to cluster 1-dimensional numbers which should be easier than clustering higher-dimensional data. As noted by Zach, k-means may not be directly applicable because we don't know the k (the number of clusters or modalities), but there's some known method to choose a good k automatically: https://en.wikipedia.org/wiki/Elbow_method_(clustering)

run the test longer time

2b67887

CareF requested a review from liyuqian August 19, 2020 20:39

flutter-dashboard bot added the c: contributor-productivity Team-specific productivity, code health, technical debt. label Aug 19, 2020

googlebot added the cla: yes label Aug 19, 2020

liyuqian approved these changes Aug 19, 2020

View reviewed changes

CareF added the waiting for tree to go green label Aug 19, 2020

fluttergithubbot removed the waiting for tree to go green label Aug 19, 2020

CareF added the waiting for tree to go green label Aug 20, 2020

fluttergithubbot removed the waiting for tree to go green label Aug 20, 2020

CareF merged commit 1fc3a5e into flutter:master Aug 20, 2020

CareF added a commit that referenced this pull request Aug 20, 2020

Revert "run the test longer time (#64185)"

655df1d

This reverts commit 1fc3a5e.

CareF mentioned this pull request Aug 20, 2020

Revert "run the test longer time for multi_widget_construction_perf" #64255

Merged

CareF deleted the multi_widget_longer_time branch August 20, 2020 17:09

fluttergithubbot pushed a commit that referenced this pull request Aug 20, 2020

Revert "run the test longer time (#64185)" (#64255)

0f4d248

smadey pushed a commit to smadey/flutter that referenced this pull request Aug 27, 2020

run the test longer time (flutter#64185)

cf2ae46

smadey pushed a commit to smadey/flutter that referenced this pull request Aug 27, 2020

Revert "run the test longer time (flutter#64185)" (flutter#64255)

e32d061

mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020

run the test longer time (flutter#64185)

dd4bdaf

mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020

Revert "run the test longer time (flutter#64185)" (flutter#64255)

35e7722

github-actions bot locked as resolved and limited conversation to collaborators Aug 20, 2021

Conversation

CareF commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Breaking Change

Uh oh!

CareF commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyuqian left a comment

Choose a reason for hiding this comment

Uh oh!

liyuqian commented Aug 19, 2020

Uh oh!

fluttergithubbot commented Aug 19, 2020

Uh oh!

zanderso commented Aug 19, 2020

Uh oh!

Hixie commented Aug 19, 2020

Uh oh!

flar commented Aug 19, 2020

Uh oh!

CareF commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CareF commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fluttergithubbot commented Aug 20, 2020

Uh oh!

Hixie commented Aug 20, 2020

Uh oh!

CareF commented Aug 20, 2020

Uh oh!

CareF commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyuqian commented Aug 20, 2020

Uh oh!

Hixie commented Aug 20, 2020

Uh oh!

Hixie commented Aug 20, 2020

Uh oh!

zanderso commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flar commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyuqian commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CareF commented Aug 19, 2020 •

edited

Loading

CareF commented Aug 19, 2020 •

edited

Loading

CareF commented Aug 19, 2020 •

edited

Loading

CareF commented Aug 20, 2020 •

edited

Loading

CareF commented Aug 20, 2020 •

edited

Loading

zanderso commented Aug 20, 2020 •

edited

Loading

flar commented Aug 20, 2020 •

edited

Loading

liyuqian commented Aug 20, 2020 •

edited

Loading