[Impeller] Add shader library; reduce branching in advanced blends #34349

bdero · 2022-06-28T10:07:28Z

Introduce a library path for shaders and place some common utilities with header guards within. Remove a bunch of branches from advanced blends.

flar · 2022-06-28T17:23:47Z

impeller/compiler/shader_lib/impeller/texture.glsl

+vec4 SampleWithBorder(sampler2D tex, vec2 uv) {
+  float within_bounds = GreaterThan(uv.x, 0) * GreaterThan(uv.y, 0) *
+                        LessThan(uv.x, 1) * LessThan(uv.y, 1);
+  return texture(tex, uv) * within_bounds;


This looks like it samples transparent black outside of the texture? Which is kDecal, not kClamp.

Also, this clamps to 0 right at the edge of the texture, but kDecal will fade to 0 from the center of the edge pixel to the center of the pixel next to it that is outside of the texture.

(assuming kLinear interpolation)

This looks like it samples transparent black outside of the texture? Which is kDecal, not kClamp.

This is using Metal/GL/Vulkan terminology, where Skia's kDecal == ClampToBorder/ClampToBorderColor(0,0,0,0), and kClamp == ClampToEdge. The main reason for this is that Impeller's render layer uses this terminology, and anyone using this function is likely to be interfacing directly with Impeller's render layer.

That being said, if we were to ever expose this shader library to Flutter users, I think we should strongly consider using Skia terminology.

Also, this clamps to 0 right at the edge of the texture, but kDecal will fade to 0 from the center of the edge pixel to the center of the pixel next to it that is outside of the texture.

Hmm yes, that makes sense. This is tricky to handle right now given it's dependent on the sampler config... I suspect we'll end up being able to flip on whatever extensions we need for ClampToBorder sampling nearly 100% of the time in practice, so hopefully this problem will just magically get solved as we continue optimizing shaders.

chinmaygarde

Extremely supportive of a shader support library. Did you measure the performance differences in the advanced blends? Some nits, otherwise LGTM. For the GLSL, can you also paste the before and after to see check inlining?

chinmaygarde · 2022-06-28T17:49:37Z

impeller/compiler/shader_lib/impeller/branching.glsl

+
+#include <impeller/constants.glsl>
+
+vec3 EqualTo3(vec3 x, float y) {


How about using C style "namespacing" for this library code? For instance, "IPVec3IsEqual"? "IP" the Impeller namespace, "Vec3" the class and then the method name.

Sure, done. I also ended up renaming the "Mix" functions to better reflect their utility: IPVec3ChooseCutoff and IPVec3Choose.

chinmaygarde · 2022-06-28T17:52:37Z

impeller/compiler/shader_lib/impeller/branching.glsl

+#ifndef BRANCHING_GLSL_
+#define BRANCHING_GLSL_
+
+#include <impeller/constants.glsl>


Can we settle on a naming for SL code in a header? How about ".sl.h"? Should make it easier for editors and such to highlight by default. And also denote that the file is meant to be included. .glsl is also fine I suppose.

The reason I went for .glsl here was that vscode w/ the popular shader language extension treats it the same as .frag and .vert:

.sl.h activates clangd and use C syntax highlighting, so GLSL builtins aren't highlighted/don't complete and the file gets flooded with nonsensical errors.

.glsl, .vert, and .frag all resolve to GLSL highlighting. The #include lines are gracefully ignored.

I'm not married to this and am totally cool with revisiting. At some point it might be worth making editor plugins that support shaderc features and configurable include paths for code navigation. I haven't really experienced such goodness outside of ShaderLab. :)

chinmaygarde · 2022-06-28T17:55:06Z

impeller/compiler/shader_lib/impeller/branching.glsl

+  return 1 - abs(sign(x - y));
+}
+
+float GreaterThan(float x, float y) {


Here and elsewhere, some headerdoc comments? Simple ones to start off with are fine.

chinmaygarde · 2022-06-28T17:56:27Z

impeller/compiler/shader_lib/impeller/branching.glsl

+  return max(sign(x - y), 0);
+}
+
+float LessThan(float x, float y) {


IMO, floats as booleans could be made more readable using a typedef (or define more likely) in constants.h. How about BooleanF or BoolF?

Sure, added BoolF and BoolV[2-4]. We could add a bunch of convenience methods for logic gates too and use this convention for inputs, but gonna wait to see if a realistic need arises.

bdero · 2022-06-29T18:35:52Z

@chinmaygarde

Did you measure the performance differences in the advanced blends?

I took some captures while removing the branches, but I need to come up with a less noisy way of capturing. There's way too much variance from run to run.

I've been testing colorburns that appear in one of our apps, and my hypothesis is that these branch elimination tricks have a slightly negative effect on any branch that only involves uniforms, so I aught to make both branched and unbranched versions of some things.
For colorburn/colordodge in particular, I actually wrote the original shader using branch removal tricks, so I haven't yet captured the difference they make here (the only difference is easier reading :)).

TL;DR: I'm going to spend some more time here to figure out what improves execution speed and what doesn't. I also need to de-branch and test the non-seperable blends like Luminosity.

For the GLSL, can you also paste the before and after to see check inlining?

advanced_blend_colorburn.frag.gles before:

#version 120
#ifdef GL_ARB_shading_language_420pack
#extension GL_ARB_shading_language_420pack : require
#endif

struct BlendInfo
{
    float color_factor;
    vec4 color;
};

uniform BlendInfo blend_info;

uniform sampler2D texture_sampler_dst;
uniform sampler2D texture_sampler_src;

varying vec2 v_dst_texture_coords;
varying vec2 v_src_texture_coords;

void main()
{
    vec4 _415 = vec4(0.0);
    do
    {
        bool _220 = v_dst_texture_coords.x > 0.0;
        bool _226 = false;
        if (_220)
        {
            _226 = v_dst_texture_coords.y > 0.0;
        }
        else
        {
            _226 = _220;
        }
        bool _232 = false;
        if (_226)
        {
            _232 = v_dst_texture_coords.x < 1.0;
        }
        else
        {
            _232 = _226;
        }
        bool _238 = false;
        if (_232)
        {
            _238 = v_dst_texture_coords.y < 1.0;
        }
        else
        {
            _238 = _232;
        }
        if (_238)
        {
            _415 = texture2D(texture_sampler_dst, v_dst_texture_coords);
            break;
        }
        _415 = vec4(0.0);
        break;
    } while(false);
    vec4 _416 = vec4(0.0);
    do
    {
        if (_415.w == 0.0)
        {
            _416 = vec4(0.0);
            break;
        }
        _416 = vec4(_415.xyz / vec3(_415.w), _415.w);
        break;
    } while(false);
    vec4 _419 = vec4(0.0);
    if (blend_info.color_factor > 0.0)
    {
        _419 = blend_info.color;
    }
    else
    {
        vec4 _417 = vec4(0.0);
        do
        {
            bool _277 = v_src_texture_coords.x > 0.0;
            bool _283 = false;
            if (_277)
            {
                _283 = v_src_texture_coords.y > 0.0;
            }
            else
            {
                _283 = _277;
            }
            bool _289 = false;
            if (_283)
            {
                _289 = v_src_texture_coords.x < 1.0;
            }
            else
            {
                _289 = _283;
            }
            bool _295 = false;
            if (_289)
            {
                _295 = v_src_texture_coords.y < 1.0;
            }
            else
            {
                _295 = _289;
            }
            if (_295)
            {
                _417 = texture2D(texture_sampler_src, v_src_texture_coords);
                break;
            }
            _417 = vec4(0.0);
            break;
        } while(false);
        vec4 _418 = vec4(0.0);
        do
        {
            if (_417.w == 0.0)
            {
                _418 = vec4(0.0);
                break;
            }
            _418 = vec4(_417.xyz / vec3(_417.w), _417.w);
            break;
        } while(false);
        _419 = _418;
    }
    vec3 _358 = abs(_416.xyz - vec3(1.0));
    vec3 _379 = abs(_419.xyz);
    gl_FragData[0] = vec4(mix(mix(vec3(1.0) - min(vec3(1.0), (vec3(1.0) - _416.xyz) / _419.xyz), vec3(1.0), vec3(float(_358.x < 9.9999999747524270787835121154785e-07), float(_358.y < 9.9999999747524270787835121154785e-07), float(_358.z < 9.9999999747524270787835121154785e-07))), vec3(0.0), vec3(float(_379.x < 9.9999999747524270787835121154785e-07), float(_379.y < 9.9999999747524270787835121154785e-07), float(_379.z < 9.9999999747524270787835121154785e-07))) * _419.w, _419.w);
}

advanced_blend_colorburn.frag.gles after:

#version 120
#ifdef GL_ARB_shading_language_420pack
#extension GL_ARB_shading_language_420pack : require
#endif

struct BlendInfo
{
    float color_factor;
    vec4 color;
};

uniform BlendInfo blend_info;

uniform sampler2D texture_sampler_dst;
uniform sampler2D texture_sampler_src;

varying vec2 v_dst_texture_coords;
varying vec2 v_src_texture_coords;

void main()
{
    vec4 _296 = texture2D(texture_sampler_dst, v_dst_texture_coords) * (((max(sign(v_dst_texture_coords.x), 0.0) * max(sign(v_dst_texture_coords.y), 0.0)) * max(sign(1.0 - v_dst_texture_coords.x), 0.0)) * max(sign(1.0 - v_dst_texture_coords.y), 0.0));
    float _332 = _296.w;
    vec4 _382 = texture2D(texture_sampler_src, v_src_texture_coords) * (((max(sign(v_src_texture_coords.x), 0.0) * max(sign(v_src_texture_coords.y), 0.0)) * max(sign(1.0 - v_src_texture_coords.x), 0.0)) * max(sign(1.0 - v_src_texture_coords.y), 0.0));
    float _418 = _382.w;
    vec4 _244 = mix(vec4((_382.xyz / vec3(_418)) * max(sign(_418), 0.0), _418), blend_info.color, vec4(max(sign(blend_info.color_factor), 0.0)));
    vec3 _248 = vec4((_296.xyz / vec3(_332)) * max(sign(_332), 0.0), _332).xyz;
    vec3 _251 = _244.xyz;
    float _257 = _244.w;
    gl_FragData[0] = vec4(mix(mix(vec3(1.0) - min(vec3(1.0), (vec3(1.0) - _248) / _251), vec3(1.0), max(sign(vec3(0.5) - abs(sign(_248 - vec3(1.0)))), vec3(0.0))), vec3(0.0), max(sign(vec3(0.5) - abs(sign(_251))), vec3(0.0))) * _257, _257);
}

bdero · 2022-06-30T03:36:01Z

I did a whole bunch of profiling against Metal, and I believe none of these attempted optimizations work (at least on the iPhone 12 mini). Here's a doc with execution times for colorburn pulled from captures.

It starts off with a before/after with all of the "optimizations" in place, and then proceeds to add back each branch that was eliminated, and then I get rid of the branch avoidance I did in the initial implementation. At the end I switched back to the full branch elimination version of ColorBurn as a sanity check.

Consider the following three implementations of ColorBurn:

Version A (unapologetic branching):

vec3 Blend(vec3 dst, vec3 src) {
  // https://www.w3.org/TR/compositing-1/#blendingcolorburn
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  if (1 - dst.r < kEhCloseEnough) {
    color.r = 1;
  }
  if (1 - dst.g < kEhCloseEnough) {
    color.g = 1;
  }
  if (1 - dst.b < kEhCloseEnough) {
    color.b = 1;
  }
  if (src.r < kEhCloseEnough) {
    color.r = 0;
  }
  if (src.g < kEhCloseEnough) {
    color.g = 0;
  }
  if (src.b < kEhCloseEnough) {
    color.b = 0;
  }
  return color;
}

Version B (equivalent to the version before this PR):

vec3 IPVec3IsEqual(vec3 x, float y) {
  vec3 diff = abs(x - y);
  return vec3(diff.r < kEhCloseEnough,  //
              diff.g < kEhCloseEnough,  //
              diff.b < kEhCloseEnough);
}

vec3 Blend(vec3 dst, vec3 src) {
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  color = mix(color, vec3(1), IPVec3IsEqual(dst, 1));
  color = mix(color, vec3(0), IPVec3IsEqual(src, 0));
  return color;
}

Version C (aggressive branch elimination):

vec3 Blend(vec3 dst, vec3 src) {
  // https://www.w3.org/TR/compositing-1/#blendingcolorburn
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
  color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
  return color;
}

Amusingly, A outperforms B, and B outperforms C.

So overall, findings for the iPhone 12 Mini are basically this: Branch as aggressively as possible, it's always faster, regardless of per-fragment variation. With this knowledge, I was able to knock almost 20% off the fragment time of ColorBurn.

Since this PR improves performance for some blends on Metal, I'll merge it, but I still need to profile against Android devices; I'm pretty sure we're going to come across some Android drivers that don't handle branching the same way. If that ends up being the case, we should be able to come up with a mix of library functions and macros to help get the best of both worlds.

dnfield · 2022-06-30T05:38:04Z

What about on GLES?

I'm a bit curious about whether the compiler is optimizing the branches or not. Also curious about whether metal does optimizations the GL driver will or won't.

bdero · 2022-06-30T06:49:50Z

What about on GLES?

We're probably going to find plenty of GL drivers that handle branching differently (this is what I meant when mentioning Android at the end of my previous comment). I'll be trying to profile an Android device I have on hand next.

Another thing that may be worthy of test is how this fares on Intel macs -- my 2019 MacBook Pro 16-inch has an AMD Radeon Pro 5300M, for example, so the driver is pumping out GCN instructions instead of Apple... sauce? Problem is xcode frame capturing works, but profiling doesn't.

I'm a bit curious about whether the compiler is optimizing the branches or not.

shaderc is outputting very similar shader code between GLSL and Metal -- it's keeping all the branches, inlining everything, and doing the weird do-while scope trick and everything. Probably none of that is necessary in practice. As for the Metal compiler and bytecode: That's all proprietary, so your guess is as good as mine. :) I'm sure many people have reverse engineered the bytecode but I'm steering clear of that.

…lends (flutter/engine#34349)

Start shader library and optimize advanced blends

3f3a746

bdero added affects: engine e: impeller labels Jun 28, 2022

bdero requested review from chinmaygarde, dnfield, flar and zanderso June 28, 2022 10:07

Licenses

123f36e

flar reviewed Jun 28, 2022

View reviewed changes

chinmaygarde approved these changes Jun 28, 2022

View reviewed changes

bdero added 2 commits June 28, 2022 14:03

Licenses 2

7080b6f

Address comments

8f7951b

Follow the profiler

1bcb881

bdero merged commit 1486cc5 into flutter:main Jun 30, 2022

engine-flutter-autoroll mentioned this pull request Jun 30, 2022

Roll Flutter Engine from a8797796d444 to 1486cc56eff6 (3 revisions) flutter/flutter#106898

Merged

engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Jun 30, 2022

1486cc5 [Impeller] Add shader library; reduce branching in advanced b…

b8cad01

…lends (flutter/engine#34349)


		#include <impeller/constants.glsl>

		vec3 EqualTo3(vec3 x, float y) {

[Impeller] Add shader library; reduce branching in advanced blends #34349

[Impeller] Add shader library; reduce branching in advanced blends #34349

Uh oh!

Conversation

bdero commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chinmaygarde left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdero commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdero commented Jun 30, 2022

Uh oh!

dnfield commented Jun 30, 2022

Uh oh!

bdero commented Jun 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bdero commented Jun 28, 2022 •

edited

Loading

bdero commented Jun 29, 2022 •

edited

Loading