Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Enable Skia reduceOpsTaskSplitting option#26067

Closed
Adlai-Holler wants to merge 2 commits into
flutter-team-archive:masterfrom
Adlai-Holler:reduce_ops_task_splitting
Closed

Enable Skia reduceOpsTaskSplitting option#26067
Adlai-Holler wants to merge 2 commits into
flutter-team-archive:masterfrom
Adlai-Holler:reduce_ops_task_splitting

Conversation

@Adlai-Holler

Copy link
Copy Markdown
Contributor

This option is enabled on our bots and is rolling out to our clients. It provides a performance improvement by reordering tasks to combine render tasks targeting the same surfaces.

For instance, if you call

draw
saveLayer
  draw
restore
draw

By default, Skia will treat this as

draw to A
draw to B
draw B to A
draw more to A

With this option, we will reorder to

draw to B
draw to A
draw B to A
draw more to A

This reduction in target swapping is faster but may consume more VRAM. If the reordered version would go over our memory budget, then we'll stick to the original version.

This option is enabled on our bots and is rolling out to
our clients. It provides a performance improvement by
reordering tasks to combine render tasks targeting the
same surfaces.

For instance, if you call

draw
saveLayer
  draw
restoreLayer
draw

By default, Skia will treat this as
draw to A
draw to B
draw B to A
draw more to A

With this option, we will reorder to
draw to B
draw to A
draw B to A
draw more to A

This reduction in target swapping is faster but may consume more
VRAM. If the reordered version would go over our memory budget,
then we'll stick to the original version.
@flutter-dashboard

Copy link
Copy Markdown

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat.

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

@google-cla google-cla Bot added the cla: yes label May 11, 2021
@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

@chinmaygarde wdyt?

@chinmaygarde chinmaygarde left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see how the example you linked would consume more VRAM if it were reordered. I guess the reordering the following would lead to more VRAM usage however since the second draw render target can be reused today? Just so I understand, is this what you meant?

draw
saveLayer
  draw
restore
draw 
saveLayer
  draw
restore
draw

// A similar work-around is also used in shell/common/io_manager.cc.
options.fDisableGpuYUVConversion = true;

options.fReduceOpsTaskSplitting = GrContextOptions::Enable::kYes;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only in effect in the GL backend. The identical change must be applied to gpu_surface_vulkan and gpu_surface_metal.

@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

Oh, indeed this particular example wouldn't use more VRAM. Instead imagine we have 2 offscreen surfaces. Without reordering, the surface may be reusable for both offscreen draws. With reordering, we need to actually allocate 2 separate surfaces.

@chinmaygarde

Copy link
Copy Markdown
Contributor

We are currently fairly sensitive to increases in VRAM usage. Is there way to enable the optimization only if it doesn't allocate more surfaces for purposes of reordering? In the first case, reducing render target switches makes sense. In the second case, I am not so sure.

@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

Hmm, there is no capability to do that. However we will fall back to the un-reordered version if the reordered version would not fit within the VRAM budget specified by the user. Would it be possible to tune that value? Is there a good way that we could try this option out and get data on the trade-off?

@chinmaygarde

Copy link
Copy Markdown
Contributor

Would it be possible to tune that value?

Right now, it is some multiple of the screen size specified.

Is there a good way that we could try this option out and get data on the trade-off?

There are benchmarks that run on Skia perf. But, it may be that most of them measure time and the few that measure memory may actually be measuring host memory. @dnfield may have more precise guidance on the specific benchmarks.

@dnfield

dnfield commented May 11, 2021

Copy link
Copy Markdown
Contributor

Is the VRAM budget the same as the resource cache budget? I don't know that we're telling Skia about anything but resource cache budgets.

We do not have any VRAM specific benchmarks. We do have some memory benchmarks that are basically polling adb shell dumpsys meminfo and mainly looking at RSS.

@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

Yes the resource cache budget is the VRAM budget for Skia GPU. We'll avoid going over that whenever possible, including falling-back from reordering tasks.

@dnfield

dnfield commented May 11, 2021

Copy link
Copy Markdown
Contributor

Would you be up for following https://github.com/flutter/flutter/tree/master/dev/devicelab#running-an-ab-test-for-engine-changes and testing out flutter_gallery__memory_nav and complex_layout_scroll_perf__devtools_memory?

@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

Hey @dnfield, I'm attempting to do that but running into some trouble on my Mac. I'm trying to run on Android.

Specifically, I've used gn to generate the android_release_arm64 variant but attempting to build it with ninja -C out/android_release_arm64 generates a ton of errors in C++ and Java and Dart code, only reaching 303/3723 files built. https://pastebin.com/Qge8sVuR

If you have a suitable environment, would you be willing to patch in this change locally and see how it fares?

@dnfield

dnfield commented May 12, 2021

Copy link
Copy Markdown
Contributor

That paste link seems dead. I can try to take a look at this today.

@dnfield

dnfield commented May 13, 2021

Copy link
Copy Markdown
Contributor

I did not get to it yesterday but looking now :)

@dnfield

dnfield commented May 13, 2021

Copy link
Copy Markdown
Contributor

Results for flutter_gallery__memory_nav:

════════════════════════════╡ ••• Raw results ••• ╞═════════════════════════════

start-min:
  A:	105820.00	105990.00	105961.00	
  B:	118927.00	110670.00	101765.00	
start-max:
  A:	113444.00	112400.00	113004.00	
  B:	124858.00	123504.00	120261.00	
start-median:
  A:	105960.00	106358.00	106332.00	
  B:	119227.00	119563.00	119146.00	
end-min:
  A:	230458.00	229345.00	230025.00	
  B:	241728.00	241957.00	242080.00	
end-max:
  A:	234842.00	236941.00	237314.00	
  B:	251830.00	247217.00	248769.00	
end-median:
  A:	231343.00	231931.00	232670.00	
  B:	243571.00	243522.00	244325.00	
diff-min:
  A:	121398.00	123355.00	123343.00	
  B:	122206.00	122394.00	123999.00	
diff-max:
  A:	127169.00	127427.00	128176.00	
  B:	131483.00	136547.00	146390.00	
diff-median:
  A:	125089.00	124541.00	125014.00	
  B:	123913.00	123530.00	124849.00	



═════════════════════════╡ ••• Final A/B results ••• ╞══════════════════════════

Score	Average A (noise)	Average B (noise)	Speed-up
start-min	105923.67 (0.07%)	110454.00 (6.34%)	0.96x	
start-max	112949.33 (0.38%)	122874.33 (1.57%)	0.92x	
start-median	106216.67 (0.17%)	119312.00 (0.15%)	0.89x	
end-min	229942.67 (0.20%)	241921.67 (0.06%)	0.95x	
end-max	236365.67 (0.46%)	249272.00 (0.77%)	0.95x	
end-median	231981.33 (0.23%)	243806.00 (0.15%)	0.95x	
diff-min	122698.67 (0.75%)	122866.33 (0.65%)	1.00x	
diff-max	127590.67 (0.33%)	138140.00 (4.48%)	0.92x	
diff-median	124881.33 (0.19%)	124097.33 (0.45%)	1.01x	

The other memory benchmark I was suggesting was having trouble with the AB. However, for flutter_gallery__transition_perf

════════════════════════════╡ ••• Raw results ••• ╞═════════════════════════════

missed_transition_count:
  A:	0.00	0.00	0.00	
  B:	0.00	0.00	0.00	
average_frame_build_time_millis:
  A:	1.80	1.76	1.85	
  B:	1.78	1.78	1.78	
worst_frame_build_time_millis:
  A:	36.86	34.21	35.35	
  B:	31.67	34.83	31.00	
90th_percentile_frame_build_time_millis:
  A:	2.12	2.10	2.32	
  B:	2.19	2.14	2.03	
99th_percentile_frame_build_time_millis:
  A:	14.94	13.34	14.02	
  B:	13.34	12.63	12.71	
average_frame_rasterizer_time_millis:
  A:	4.93	5.44	5.32	
  B:	4.77	5.06	5.01	
worst_frame_rasterizer_time_millis:
  A:	60.88	60.09	63.93	
  B:	64.38	80.42	59.08	
90th_percentile_frame_rasterizer_time_millis:
  A:	7.82	10.80	10.19	
  B:	9.70	10.39	10.00	
99th_percentile_frame_rasterizer_time_millis:
  A:	22.52	22.61	23.51	
  B:	22.45	21.42	24.53	



═════════════════════════╡ ••• Final A/B results ••• ╞══════════════════════════

Score	Average A (noise)	Average B (noise)	Speed-up
missed_transition_count	0.00 (0.00%)	0.00 (0.00%)	NaNx	
average_frame_build_time_millis	1.80 (2.09%)	1.78 (0.14%)	1.01x	
worst_frame_build_time_millis	35.47 (3.06%)	32.50 (5.14%)	1.09x	
90th_percentile_frame_build_time_millis	2.18 (4.51%)	2.12 (3.05%)	1.03x	
99th_percentile_frame_build_time_millis	14.10 (4.64%)	12.89 (2.44%)	1.09x	
average_frame_rasterizer_time_millis	5.23 (4.11%)	4.94 (2.56%)	1.06x	
worst_frame_rasterizer_time_millis	61.63 (2.68%)	67.96 (13.35%)	0.91x	
90th_percentile_frame_rasterizer_time_millis	9.60 (13.37%)	10.03 (2.83%)	0.96x	
99th_percentile_frame_rasterizer_time_millis	22.88 (1.95%)	22.80 (5.67%)	1.00x	

@dnfield

dnfield commented May 13, 2021

Copy link
Copy Markdown
Contributor

This seems to more significantly regress memory performance. I'm a bit suspicious of the frame time improvements, since this really should only impact raster time - it's possible that there's som enoise due to my framework commit not using the same hash as what this is based on. However, the rasterizer actually shows a decrease in worst time and 90th percentile, though the average increase is nice.

@dnfield

dnfield commented May 13, 2021

Copy link
Copy Markdown
Contributor

Given these results I'm not really inclined to say we should land this change. If there's a better benchmark we could run, or a real life application this benefits, it would help.

@chinmaygarde chinmaygarde left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are super sensitive to device memory usage right now. I'd say we shouldn't land this right now unless the option to reorder draw calls only when there is a guarantee that additional surfaces wont be allocated is available.

@Adlai-Holler

Copy link
Copy Markdown
Contributor Author

I hear you. FWIW, in order to get more information on the trade-off, I temporarily turned this option off on the Skia bots to gather perf data from our mobile device lab. Here's the result for the Pixel4XL on Vulkan:

https://perf.skia.org/e/?begin=1621542595&end=1621618244&keys=X6a1d49475c5eb82730dfe1872fe7701c&num_commits=50&request_type=1&xbaroffset=55307

As you can see, the result depends on the specific captured workload (Chrome snapshot) we're looking at but the performance is always improved, and in some cases like desk_weather.skp (weather.com) and mobi_slashdot, the numbers moved quite dramatically – from 4.8ms median to 3.16ms in the latter case.

Does this move the needle on your thinking?

@dnfield

dnfield commented May 21, 2021

Copy link
Copy Markdown
Contributor

@apwilson - would you be able to check out how applying this affects benchmarks on Fuchsia?

I don't think we have many customers using Vulkan on Android, but Fuchsia is using Vulkan AFAIK.

@dnfield

dnfield commented May 21, 2021

Copy link
Copy Markdown
Contributor

(The benchmarks I ran were OpenGL based)

@Adlai-Holler

Adlai-Holler commented May 25, 2021

Copy link
Copy Markdown
Contributor Author

Sure, the flag also affects gl. For instance here's the timing data from the iPhone6 in gles on our skp suite. The "hump" is when we temporarily disabled the flag to measure the impact. https://perf.skia.org/e/?begin=1621537044&end=1621648544&keys=X70d8e3c479150e3a2b75931f96229ccf&num_commits=50&request_type=1&xbaroffset=55308

Adlai-Holler added a commit to Adlai-Holler/engine that referenced this pull request Jun 3, 2021
This option will be the default in Skia soon. Per discussion in flutter-team-archive#26067, let’s be explicit about disabling it for the time being, and we can revisit the flag in the future if desirable.
@Adlai-Holler Adlai-Holler deleted the reduce_ops_task_splitting branch June 3, 2021 18:06
Adlai-Holler added a commit to Adlai-Holler/engine that referenced this pull request Jun 11, 2021
This option will be the default in Skia soon. Per discussion in flutter-team-archive#26067, let’s be explicit about disabling it for the time being, and we can revisit the flag in the future if desirable.
chinmaygarde pushed a commit that referenced this pull request Jun 17, 2021
This option will be the default in Skia soon. Per discussion in #26067, let’s be explicit about disabling it for the time being, and we can revisit the flag in the future if desirable.
wqyfavor pushed a commit to wqyfavor/engine that referenced this pull request Jun 21, 2021
This option will be the default in Skia soon. Per discussion in flutter-team-archive#26067, let’s be explicit about disabling it for the time being, and we can revisit the flag in the future if desirable.
naudzghebre pushed a commit to naudzghebre/engine that referenced this pull request Sep 2, 2021
This option will be the default in Skia soon. Per discussion in flutter-team-archive#26067, let’s be explicit about disabling it for the time being, and we can revisit the flag in the future if desirable.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Development

Successfully merging this pull request may close these issues.

3 participants