run the test longer time for multi_widget_construction_perf#64185
run the test longer time for multi_widget_construction_perf#64185CareF merged 1 commit intoflutter:masterfrom CareF:multi_widget_longer_time
Conversation
|
This pull request is not suitable for automatic merging in its current state.
|
|
Why are the slow frames slow? Are the slow frames slow because of the thing the benchmark is trying to measure, or for an unrelated reason? What should we do if/when the 90% point falls on a boundary like this again? |
|
The point of tracking the 90th percentile is to catch the slowest frames other than the occasional system-induced craziness (if it wasn't for the latter, we'd just track the slowest frame). I would expect the graph to look more like: If there are so many frames affected by craziness that 10% of them are so affected, we should look at what is going on on the system that it is so common for the system to mess up our benchmarks. If, on the other hand, 10% of our frames are much slower than others because of something we're doing, we should change to tracking the 99th percentile instead and figure out why so many of our frames are slow. If our data is trimodal as suggested by the graphs above, we should see if we can understand why. |
The original version of my graphing tool (without using the |
This test should not be considered as a conventional test in terms of build time distribution because I wrote the test case intentionally forcing the system to rebuild the widget tree and its corresponding element tree each frame, to test the performance in building the tree: this is not what a typical well-designed Flutter app should do. If I run the test even longer (40s), this is what I got: It's qualitatively same as the figure of the result of this PR (20s). Compare to the time-ordered figure in my last reply, I believe the three stages are
So for this test case, the average frame build time is a good indication of the efficiency of building widgets (and its corresponding elements and objects), while 90th percentile is the worse frames when GC is involved. @Hixie |
|
This pull request is not suitable for automatic merging in its current state.
|
|
That chart clarifies what's going on a lot, thanks. What are the units on the charts? (The vertical axis in particular.) Figuring out exactly why we have a "fast" mode and a "slow" mode would be good, as would figuring out whether we can do better on those GC-affected frames. Once we are confident that we're ok with those four different modalities, we should probably figure out how to measure them separately and track them separately. There's no point looking at a median or other percentile across all four modalities, it's not really a meaningful measure, as you point out, it could easily slide through a transition point. The way to make that non-flaky is to separate them. |
|
@Hixie For all these figures, the y-axis is the frame build time in microsecond. x-axis are frame index (no unit). |
|
The discussion may go on. I merge the PR to run the test on device lab to first confirm my theory. If we find a better way to separate those modalities we can make updates accordingly. |
|
Parallel to figuring out why this happened, I wonder if replacing 90th percentile with the average of all build times between 90th-percentile to 100-th percentile would alleviate this bi-modal issue? It seems to at least alleviate this transition case. In general, I think it's very difficult to control what distributions a performance metrics could have. There could be all kinds of apps/tests and performance metrics with many unknown performance issues in addition to thermal throttling, GC, etc.. It would be nice if there's a general mechanism (e.g., average all numbers in the 90-th percentile to 100-th percentile range) that can be sensitive to worse-case numbers, but not very sensitive to small distribution shifts (e.g., a few more frames get build). |
I agree, but we should know when those distributions change, at least. |
|
BTW what I am imagining in my description above is a form of clustering, where we take all the points, cluster them into groups, and track their medians (and assert that there's the right number of groups). |
|
Possibly helpful here: https://pub.dev/packages/k_means_cluster. Last update was more than a year ago, but the core algorithm is <200 lines, so might be straightforward to adapt to this use-case. The number of groups for these sorts of algorithms tends to be an input, so instead of asserting that you have the right number of groups, you'd probably assert that the per-group sum of squared distances doesn't get too big. |
|
The web benchmarks use std deviation to identify outliers so the metric could be the number of frames in the outlier group (outlier being only the ones that took longer as there could also be shorter outliers). (Just looked back at the bimodal nature of the benchmark - I'm not sure std-dev would help with that) |
Yes, besides averaging, clustering might be another good mechanism to get a stable reading. In our cases, we only need to cluster 1-dimensional numbers which should be easier than clustering higher-dimensional data. As noted by Zach, k-means may not be directly applicable because we don't know the k (the number of clusters or modalities), but there's some known method to choose a good k automatically: https://en.wikipedia.org/wiki/Elbow_method_(clustering) |



Description
This may seem to result in a regression but it's intended!
The current version, especially the e2e version, have the 90th percentile frame build time number jumping to 20% larger in some test runs. We believe the reason is the 90th percentile lies accidentally at a transition place:

(the plot is the sorted frame build time in microsecond, and the vertical line is 90-th percentile).
Change the test sample duration to a larger number may reduce this issue. We mark the driver version flaky here because this change requires larger memory and may #59263. #59263 typically is more often seen on my local test device than on device lab, so I'm not sure if it's as a severe problem on device lab.
Checklist
Before you create this PR, confirm that it meets all requirements listed below by checking the relevant checkboxes (
[x]). This will ensure a smooth and quick review process.///).flutter analyze --flutter-repo) does not report any problems on my PR.Breaking Change
Did any tests fail when you ran them? Please read Handling breaking changes.