Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Conversation

@bdero
Copy link
Member

@bdero bdero commented Jun 28, 2022

Introduce a library path for shaders and place some common utilities with header guards within. Remove a bunch of branches from advanced blends.

vec4 SampleWithBorder(sampler2D tex, vec2 uv) {
float within_bounds = GreaterThan(uv.x, 0) * GreaterThan(uv.y, 0) *
LessThan(uv.x, 1) * LessThan(uv.y, 1);
return texture(tex, uv) * within_bounds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it samples transparent black outside of the texture? Which is kDecal, not kClamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this clamps to 0 right at the edge of the texture, but kDecal will fade to 0 from the center of the edge pixel to the center of the pixel next to it that is outside of the texture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming kLinear interpolation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it samples transparent black outside of the texture? Which is kDecal, not kClamp.

This is using Metal/GL/Vulkan terminology, where Skia's kDecal == ClampToBorder/ClampToBorderColor(0,0,0,0), and kClamp == ClampToEdge. The main reason for this is that Impeller's render layer uses this terminology, and anyone using this function is likely to be interfacing directly with Impeller's render layer.

That being said, if we were to ever expose this shader library to Flutter users, I think we should strongly consider using Skia terminology.

Also, this clamps to 0 right at the edge of the texture, but kDecal will fade to 0 from the center of the edge pixel to the center of the pixel next to it that is outside of the texture.

Hmm yes, that makes sense. This is tricky to handle right now given it's dependent on the sampler config... I suspect we'll end up being able to flip on whatever extensions we need for ClampToBorder sampling nearly 100% of the time in practice, so hopefully this problem will just magically get solved as we continue optimizing shaders.

Copy link
Contributor

@chinmaygarde chinmaygarde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely supportive of a shader support library. Did you measure the performance differences in the advanced blends? Some nits, otherwise LGTM. For the GLSL, can you also paste the before and after to see check inlining?


#include <impeller/constants.glsl>

vec3 EqualTo3(vec3 x, float y) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using C style "namespacing" for this library code? For instance, "IPVec3IsEqual"? "IP" the Impeller namespace, "Vec3" the class and then the method name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. I also ended up renaming the "Mix" functions to better reflect their utility: IPVec3ChooseCutoff and IPVec3Choose.

#ifndef BRANCHING_GLSL_
#define BRANCHING_GLSL_

#include <impeller/constants.glsl>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we settle on a naming for SL code in a header? How about ".sl.h"? Should make it easier for editors and such to highlight by default. And also denote that the file is meant to be included. .glsl is also fine I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I went for .glsl here was that vscode w/ the popular shader language extension treats it the same as .frag and .vert:

  • .sl.h activates clangd and use C syntax highlighting, so GLSL builtins aren't highlighted/don't complete and the file gets flooded with nonsensical errors.
  • .glsl, .vert, and .frag all resolve to GLSL highlighting. The #include lines are gracefully ignored.

I'm not married to this and am totally cool with revisiting. At some point it might be worth making editor plugins that support shaderc features and configurable include paths for code navigation. I haven't really experienced such goodness outside of ShaderLab. :)

return 1 - abs(sign(x - y));
}

float GreaterThan(float x, float y) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, some headerdoc comments? Simple ones to start off with are fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return max(sign(x - y), 0);
}

float LessThan(float x, float y) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, floats as booleans could be made more readable using a typedef (or define more likely) in constants.h. How about BooleanF or BoolF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added BoolF and BoolV[2-4]. We could add a bunch of convenience methods for logic gates too and use this convention for inputs, but gonna wait to see if a realistic need arises.

@bdero
Copy link
Member Author

bdero commented Jun 29, 2022

@chinmaygarde

Did you measure the performance differences in the advanced blends?

I took some captures while removing the branches, but I need to come up with a less noisy way of capturing. There's way too much variance from run to run.

I've been testing colorburns that appear in one of our apps, and my hypothesis is that these branch elimination tricks have a slightly negative effect on any branch that only involves uniforms, so I aught to make both branched and unbranched versions of some things.
For colorburn/colordodge in particular, I actually wrote the original shader using branch removal tricks, so I haven't yet captured the difference they make here (the only difference is easier reading :)).

TL;DR: I'm going to spend some more time here to figure out what improves execution speed and what doesn't. I also need to de-branch and test the non-seperable blends like Luminosity.

For the GLSL, can you also paste the before and after to see check inlining?

advanced_blend_colorburn.frag.gles before:

#version 120
#ifdef GL_ARB_shading_language_420pack
#extension GL_ARB_shading_language_420pack : require
#endif

struct BlendInfo
{
    float color_factor;
    vec4 color;
};

uniform BlendInfo blend_info;

uniform sampler2D texture_sampler_dst;
uniform sampler2D texture_sampler_src;

varying vec2 v_dst_texture_coords;
varying vec2 v_src_texture_coords;

void main()
{
    vec4 _415 = vec4(0.0);
    do
    {
        bool _220 = v_dst_texture_coords.x > 0.0;
        bool _226 = false;
        if (_220)
        {
            _226 = v_dst_texture_coords.y > 0.0;
        }
        else
        {
            _226 = _220;
        }
        bool _232 = false;
        if (_226)
        {
            _232 = v_dst_texture_coords.x < 1.0;
        }
        else
        {
            _232 = _226;
        }
        bool _238 = false;
        if (_232)
        {
            _238 = v_dst_texture_coords.y < 1.0;
        }
        else
        {
            _238 = _232;
        }
        if (_238)
        {
            _415 = texture2D(texture_sampler_dst, v_dst_texture_coords);
            break;
        }
        _415 = vec4(0.0);
        break;
    } while(false);
    vec4 _416 = vec4(0.0);
    do
    {
        if (_415.w == 0.0)
        {
            _416 = vec4(0.0);
            break;
        }
        _416 = vec4(_415.xyz / vec3(_415.w), _415.w);
        break;
    } while(false);
    vec4 _419 = vec4(0.0);
    if (blend_info.color_factor > 0.0)
    {
        _419 = blend_info.color;
    }
    else
    {
        vec4 _417 = vec4(0.0);
        do
        {
            bool _277 = v_src_texture_coords.x > 0.0;
            bool _283 = false;
            if (_277)
            {
                _283 = v_src_texture_coords.y > 0.0;
            }
            else
            {
                _283 = _277;
            }
            bool _289 = false;
            if (_283)
            {
                _289 = v_src_texture_coords.x < 1.0;
            }
            else
            {
                _289 = _283;
            }
            bool _295 = false;
            if (_289)
            {
                _295 = v_src_texture_coords.y < 1.0;
            }
            else
            {
                _295 = _289;
            }
            if (_295)
            {
                _417 = texture2D(texture_sampler_src, v_src_texture_coords);
                break;
            }
            _417 = vec4(0.0);
            break;
        } while(false);
        vec4 _418 = vec4(0.0);
        do
        {
            if (_417.w == 0.0)
            {
                _418 = vec4(0.0);
                break;
            }
            _418 = vec4(_417.xyz / vec3(_417.w), _417.w);
            break;
        } while(false);
        _419 = _418;
    }
    vec3 _358 = abs(_416.xyz - vec3(1.0));
    vec3 _379 = abs(_419.xyz);
    gl_FragData[0] = vec4(mix(mix(vec3(1.0) - min(vec3(1.0), (vec3(1.0) - _416.xyz) / _419.xyz), vec3(1.0), vec3(float(_358.x < 9.9999999747524270787835121154785e-07), float(_358.y < 9.9999999747524270787835121154785e-07), float(_358.z < 9.9999999747524270787835121154785e-07))), vec3(0.0), vec3(float(_379.x < 9.9999999747524270787835121154785e-07), float(_379.y < 9.9999999747524270787835121154785e-07), float(_379.z < 9.9999999747524270787835121154785e-07))) * _419.w, _419.w);
}

advanced_blend_colorburn.frag.gles after:

#version 120
#ifdef GL_ARB_shading_language_420pack
#extension GL_ARB_shading_language_420pack : require
#endif

struct BlendInfo
{
    float color_factor;
    vec4 color;
};

uniform BlendInfo blend_info;

uniform sampler2D texture_sampler_dst;
uniform sampler2D texture_sampler_src;

varying vec2 v_dst_texture_coords;
varying vec2 v_src_texture_coords;

void main()
{
    vec4 _296 = texture2D(texture_sampler_dst, v_dst_texture_coords) * (((max(sign(v_dst_texture_coords.x), 0.0) * max(sign(v_dst_texture_coords.y), 0.0)) * max(sign(1.0 - v_dst_texture_coords.x), 0.0)) * max(sign(1.0 - v_dst_texture_coords.y), 0.0));
    float _332 = _296.w;
    vec4 _382 = texture2D(texture_sampler_src, v_src_texture_coords) * (((max(sign(v_src_texture_coords.x), 0.0) * max(sign(v_src_texture_coords.y), 0.0)) * max(sign(1.0 - v_src_texture_coords.x), 0.0)) * max(sign(1.0 - v_src_texture_coords.y), 0.0));
    float _418 = _382.w;
    vec4 _244 = mix(vec4((_382.xyz / vec3(_418)) * max(sign(_418), 0.0), _418), blend_info.color, vec4(max(sign(blend_info.color_factor), 0.0)));
    vec3 _248 = vec4((_296.xyz / vec3(_332)) * max(sign(_332), 0.0), _332).xyz;
    vec3 _251 = _244.xyz;
    float _257 = _244.w;
    gl_FragData[0] = vec4(mix(mix(vec3(1.0) - min(vec3(1.0), (vec3(1.0) - _248) / _251), vec3(1.0), max(sign(vec3(0.5) - abs(sign(_248 - vec3(1.0)))), vec3(0.0))), vec3(0.0), max(sign(vec3(0.5) - abs(sign(_251))), vec3(0.0))) * _257, _257);
}

@bdero
Copy link
Member Author

bdero commented Jun 30, 2022

I did a whole bunch of profiling against Metal, and I believe none of these attempted optimizations work (at least on the iPhone 12 mini). Here's a doc with execution times for colorburn pulled from captures.

It starts off with a before/after with all of the "optimizations" in place, and then proceeds to add back each branch that was eliminated, and then I get rid of the branch avoidance I did in the initial implementation. At the end I switched back to the full branch elimination version of ColorBurn as a sanity check.

Consider the following three implementations of ColorBurn:

  • Version A (unapologetic branching):
vec3 Blend(vec3 dst, vec3 src) {
  // https://www.w3.org/TR/compositing-1/#blendingcolorburn
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  if (1 - dst.r < kEhCloseEnough) {
    color.r = 1;
  }
  if (1 - dst.g < kEhCloseEnough) {
    color.g = 1;
  }
  if (1 - dst.b < kEhCloseEnough) {
    color.b = 1;
  }
  if (src.r < kEhCloseEnough) {
    color.r = 0;
  }
  if (src.g < kEhCloseEnough) {
    color.g = 0;
  }
  if (src.b < kEhCloseEnough) {
    color.b = 0;
  }
  return color;
}
  • Version B (equivalent to the version before this PR):
vec3 IPVec3IsEqual(vec3 x, float y) {
  vec3 diff = abs(x - y);
  return vec3(diff.r < kEhCloseEnough,  //
              diff.g < kEhCloseEnough,  //
              diff.b < kEhCloseEnough);
}

vec3 Blend(vec3 dst, vec3 src) {
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  color = mix(color, vec3(1), IPVec3IsEqual(dst, 1));
  color = mix(color, vec3(0), IPVec3IsEqual(src, 0));
  return color;
}
  • Version C (aggressive branch elimination):
vec3 Blend(vec3 dst, vec3 src) {
  // https://www.w3.org/TR/compositing-1/#blendingcolorburn
  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
  color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
  color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
  return color;
}

Amusingly, A outperforms B, and B outperforms C.

So overall, findings for the iPhone 12 Mini are basically this: Branch as aggressively as possible, it's always faster, regardless of per-fragment variation. With this knowledge, I was able to knock almost 20% off the fragment time of ColorBurn.

Since this PR improves performance for some blends on Metal, I'll merge it, but I still need to profile against Android devices; I'm pretty sure we're going to come across some Android drivers that don't handle branching the same way. If that ends up being the case, we should be able to come up with a mix of library functions and macros to help get the best of both worlds.

@dnfield
Copy link
Contributor

dnfield commented Jun 30, 2022

What about on GLES?

I'm a bit curious about whether the compiler is optimizing the branches or not. Also curious about whether metal does optimizations the GL driver will or won't.

@bdero
Copy link
Member Author

bdero commented Jun 30, 2022

What about on GLES?

We're probably going to find plenty of GL drivers that handle branching differently (this is what I meant when mentioning Android at the end of my previous comment). I'll be trying to profile an Android device I have on hand next.

Another thing that may be worthy of test is how this fares on Intel macs -- my 2019 MacBook Pro 16-inch has an AMD Radeon Pro 5300M, for example, so the driver is pumping out GCN instructions instead of Apple... sauce? Problem is xcode frame capturing works, but profiling doesn't.

I'm a bit curious about whether the compiler is optimizing the branches or not.

shaderc is outputting very similar shader code between GLSL and Metal -- it's keeping all the branches, inlining everything, and doing the weird do-while scope trick and everything. Probably none of that is necessary in practice. As for the Metal compiler and bytecode: That's all proprietary, so your guess is as good as mine. :) I'm sure many people have reverse engineered the bytecode but I'm steering clear of that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants