AVX2 optimizations by KungFuJesus · Pull Request #1053 · zlib-ng/zlib-ng

KungFuJesus · 2021-10-23T16:55:12Z

Since this is constant, anyway, we may as well use the variant that
doesn't add vector register pressure, has better ILP opportunities,
and has shorter instruction latency.

arch/x86/adler32_avx.c

KungFuJesus · 2021-10-24T13:57:57Z

So, at least on newer GCCs, the compiler is recognizing the constant shift and using the immediate shift instruction. So this makes basically zero difference on the compiled binary (but could on a crappier or older compiler). I do have some micro optimizations that reduces a trip to the stack and back (s1 & s2) as well. There's an obvious horizontal add optimization that can be done if not for the goofy base that Mark Adler picks that we have to do integer modulus against. Because NMAX is the maximum runs we can do before possibly overflowing the base, we have to do the modulo operation on every vector element. This requires integer division (a no go for SIMD, obviously), or full 64 bit products (absent from AVX2), shifts, and subtractions, as the compiler is actually doing on the ALU.

Hopefully, I should be able to have something that does just this with AVX512, and it should be faster for a larger stream of bytes.

mtl1979 · 2021-10-24T14:17:49Z

@KungFuJesus AVX512 indeed has better support for single-vector horizontal add, which has been missing since AMD dropped XOP instructions... I haven't added support for it yet as most PCs currently in use still don't support it... Also, I'm not sure using 512-bit vectors will give much gain as adler32 hardly ever needs to checksum that long byte sequences.

KungFuJesus · 2021-10-24T14:58:19Z

@KungFuJesus AVX512 indeed has better support for single-vector horizontal add, which has been missing since AMD dropped XOP instructions... I haven't added support for it yet as most PCs currently in use still don't support it... Also, I'm not sure using 512-bit vectors will give much gain as adler32 hardly ever needs to checksum that long byte sequences.

I just purchased a Cascade Lake X CPU, so I should be able to have something that does this shortly. Adler32 checksums tend to do things like checksum the entire zlib compressed row of bytes in a PNG. I guess realistically for this to be really a danger you'd need something larger than could possibly be rendered. However, as a generic checksum, there are cases where this could happen. A thought did occur to me, though.

Instead of rebasing to the prime number base for the adler checksum to each vector element, you really only need to rebase in such a way that the partial sums, when horizontally summed together, don't overflow 32 bit precision. To do this, you could simply just do a modulo with 2^32/8, which is a power of 2. Simply AND'ing with 2^29-1 and only doing the prime number modulus when it's a scalar in a GPR is probably feasible, do you agree?

KungFuJesus · 2021-10-24T15:08:40Z

Ahh scratch that above idea, it was dumb. In order for that to work the base would need to be a multiple of that prime or we're losing the biproduct of the modular arithmetic.

Still, full 64 bit multiplication with AVX512 should allow me to do the same operations the compiler is generating on the ALU in GPRs for each vector element before the horizontal sum. Or even just summing into a higher precision and doing the modulus at the end could get us there.

KungFuJesus · 2021-10-24T23:28:03Z

@mtl1979 I just pushed another commit into this PR that marginally has a real impact on efficiency.

mtl1979 · 2021-10-25T01:14:55Z

@KungFuJesus Masked loads require AVX512-capable processor, otherwise the set1 intrinsics can't be mapped to single instruction. Bitwise operations on vector registers can be really slow on older processors.

KungFuJesus · 2021-10-25T02:34:50Z

@KungFuJesus Masked loads require AVX512-capable processor, otherwise the set1 intrinsics can't be mapped to single instruction. Bitwise operations on vector registers can be really slow on older processors.

I'm not using masked loads (though avx does support them, mostly for float domains). The set1 intrinsic is typically shorthand for the vpbroadcast/w/d/q family of instructions. I'm applying a mask by creating full sized ymm registers so that after the relatively quick broadcast function, I can zero out all but the carry over from the last round of sums. It's shorthand for the s1[7] = adler behavior the code was already doing. In integer domain, on all AVX2 capable CPUs I've seen, bitwise masks are very low latency and can often be done on multiple execution ports, doing at least 2 per cycle. Basically the reason I'm broadcasting to all 8 elements and then zeroing out the first 7 is because there's no idiom to set the last element of a vector to a value (though there is a way to do a "scalar" move to set the first). If the vectors were only 128 bits wide we could bit shift, but with AVX2 you can't shift outside of the 128 bit lanes. We can obviously do a permute to do this, but, that is also going to be a slower sequence. The vpinsert set of instructions could conceivably allow us to insert into a vector at a fixed index position, however, for 256 bit wide vectors, the "sequence" that it compiles to also involves creating 2 128 bit vectors and combining them.

This code compiles cleanly on haswell but should work on any avx2 capable CPU.

nmoinvaz · 2021-10-25T03:00:27Z

@mtl1979 All the intrinsic used start with mm256 how is that avx512?

Edit: oh maybe you were replying to a different comment. Never mind.

KungFuJesus · 2021-10-25T03:06:05Z

@mtl1979 All the intrinsic used start with mm256 how is that avx512?

I assure you it isn't, I don't even have a capable avx512 CPU in hand yet, hah. I tested this code thoroughly, -mavx2 should be all you need for GCC flags.

mtl1979 · 2021-10-25T08:48:51Z

@nmoinvaz Some move/load instructions can map to combination of SSE and AVX instructions on processors that don't support AVX512. This will cause slowdown on processors where there is high penalty moving data between SSE and AVX units. AVX512-capable processors can move data from general-purpose registers directly to AVX unit.

@KungFuJesus I have Haswell-based processor and it's really slow executing bitwise operations on vector registers... It's basically faster to move the data to general-purpose registers and then doing the same there.

To sum things up, it might be that gcc can fix the slowdowns by mapping the slow intrinsics to faster equivalents, but that's what the original code already tried to do. As such we need to benchmark the pull request with and without the set1+and combination. I didn't notice anything else wrong in the changes, but it was also middle of the night and I was reading the patch on my phone...

KungFuJesus · 2021-10-25T12:54:49Z

@nmoinvaz Some move/load instructions can map to combination of SSE and AVX instructions on processors that don't support AVX512. This will cause slowdown on processors where there is high penalty moving data between SSE and AVX units. AVX512-capable processors can move data from general-purpose registers directly to AVX unit.

@KungFuJesus I have Haswell-based processor and it's really slow executing bitwise operations on vector registers... It's basically faster to move the data to general-purpose registers and then doing the same there.

To sum things up, it might be that gcc can fix the slowdowns by mapping the slow intrinsics to faster equivalents, but that's what the original code already tried to do. As such we need to benchmark the pull request with and without the set1+and combination. I didn't notice anything else wrong in the changes, but it was also middle of the night and I was reading the patch on my phone...

Every compiler worth anything will translate _mm256_set1_* intrinsics to the vex encoded instructions and will not involve the AVX to SSE transition penalty. What bitwise operations are you referring to? The AND'ing of bits?
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_and&ig_expand=350
That's 1 cycle of latency. Haswell isn't listed on there but Agner Fog's tables still show it as the same https://www.agner.org/optimize/instruction_tables.pdf (1 cycle of latency, 3 possible execution ports).

Here's what GCC compiles the set1 + AND to:

vmovd  %r10d,%xmm7
vmovd  %edx,%xmm8
vpbroadcastd %xmm7,%ymm7
vpbroadcastd %xmm8,%ymm8
vpand  %ymm7,%ymm6,%ymm7
vpand  %ymm8,%ymm6,%ymm8

It takes the function call arguments, passed via registers, uses the vex encoded move to move them to XMM registers, then broadcasts them. The ymm6 mask register is loaded via memory once, at the beginning of the function call.

mtl1979 · 2021-10-25T14:50:07Z

@KungFuJesus That assembler output is still wrong... vpbroadcastd and vpand`` will eliminate each other, so it's dead code...

KungFuJesus · 2021-10-25T15:35:07Z

@KungFuJesus That assembler output is still wrong... vpbroadcastd and vpand`` will eliminate each other, so it's dead code...

I'm not sure I follow...the "AND" operation that happens here is with the mask:
0, 0, 0, 0, 0, 0, 0, 0xFFFFFFFF
This effectively does the following with %r10d and %rdx (the contents of the adler and sum2 function arguments):
xmm7: { %r10d, 0, 0, 0}. It then broadcasts the contents of that register to all 8 lanes of the aliased ymm register:
ymm7: {%r10d, %r10d, %r10d, %r10d, %r10d, %r10d, %r10d, %r10d}. Then finally, it AND's it with the mask configured above such that we only have the value set in the last lane:
ymm7: {0, 0, 0, 0, 0, 0, 0, %r10d}

The same happens for xmm8 (don't recall which is s1 and which is s2). This is effectively the same thing as your memset(), s[7] = adler sequence, as the compiler would have to generate code that populated the ymm register at that vector position. This just does it in fewer cycles.

I assure you broadcasting an xmm register to a ymm register is idiomatic for setting all lanes for ymm register and it is not dead code (and the compiler, at this optimization level, would have certainly eliminated it in an early stage).
The code that's in develop right now is compiling to this sequence:

vpxor  %xmm0,%xmm0,%xmm0
movl   $0x0,0x18(%rsp)
movq   $0x0,0x10(%rsp)
mov    %r9d,0x1c(%rsp)
movq   $0x0,0x30(%rsp)
movl   $0x0,0x38(%rsp)
mov    %eax,0x3c(%rsp)
vmovdqa %xmm0,(%rsp)
vmovdqa %xmm0,0x20(%rsp)
...
vmovdqa (%rsp),%ymm4
vmovdqa 0x20(%rsp),%ymm7

That is, it's using a zero'ing idiom to fill the xmm register, then it's moving all of the zeros to a location on the stack, as well as the values contained in the registers to a offset at an aligned location. Then finally, it's moving the zeros generated in xmm0 to a contiguous location on stack just before where the actual values are stored, giving 4 leading zeros before a series of zeros followed by the value.

The code I'm proposing here skips all of that stack churn, and while it amounts to a micro optimization, it's fewer instructions and faster. The real optimization, though, lies in not having to do a modulus with every vector element that gets extracted. This saves 8 different 64 bit integer multiplications, shifts, and subtractions.

mtl1979 · 2021-10-25T15:45:24Z

@KungFuJesus Your explanation just asserts that it's basically doing vector element reversal... As far as I remember vector units are big-endian even on little-endian CPUs... XMM and YMM registers with equal numbers do overlay on x86_64.

KungFuJesus · 2021-10-25T15:50:11Z

@KungFuJesus Your explanation just asserts that it's basically doing vector element reversal... As far as I remember vector units are big-endian even on little-endian CPUs... XMM and YMM registers with equal numbers do overlay on x86_64.

Yes...the code generated is exploiting that fact. I'm laying out positions in memory order here, which of course fights with the convention of the little endian sometimes used to describe the vector registers (e.g. setr vs set intrinsics). Ignoring all that, you were doing:
memset(s1, 0, 7); s1[7] = adler;
This is doing the equivalent, in fewer instructions, and fewer trips to the stack. The first element (in memory order) gets broadcasted to all 8 elements. I AND the result of that with a mask where only the 8th element is nonzero, resulting in the same thing. The last position element in the partial sums is supposed to be the carry over from previous checksum loops or function calls, as that's how the recurrence relation with the multiplication of (32-i)*vector plays out.

Edit: Are you maybe confused by the vpbroadcastd xmm7, ymm7? Yes, they do overlay but this is valid and is idiomatic for setting all values in a vector register to the first value (just as pbroadcastd xmm7, xmm7 would be). This does not blow away what's already in the register any more than an instruction that accumulates back to the same register.

mtl1979 · 2021-10-25T15:56:51Z

@KungFuJesus Like I said earlier, I'm not rejecting the code, I just want to make sure it's the fastest possible way to do things and that future developers know why it is done this way... This is where proper inline comments are useful...

KungFuJesus · 2021-10-25T16:12:11Z

@KungFuJesus Like I said earlier, I'm not rejecting the code, I just want to make sure it's the fastest possible way to do things and that future developers know why it is done this way... This is where proper inline comments are useful...

In general, if you want all values in a vector register to be the same, _mm256_set1_* will generate an optimal sequence from the compiler. Most of the time it invokes the broadcast instruction and in scenarios where there might be a shorter sequence, it will emit that as well. As far as the AND mask - well I thought that bit was obvious but if not, I can put comments there (though, the original code didn't explain why s[0...6] = 0; s[7] = adler; was being done, either). Perhaps a link to the Intel whitepaper for this code would be appropriate?

Is there anything else you think merits explanatory comments?

mtl1979 · 2021-10-25T16:23:37Z

@KungFuJesus Part of what I do here requires me to think as dumber person than I actually am... If that makes me unsure about something, then most likely some other contributors are unsure also... A lot of existing code could warrant better explaining, but my job here is not to comment on existing code, only code that is modified or added... I have been around since early days of zlib-ng, but I haven't reviewed every commit or pull request because I have also other projects to work on and sometimes I had to reinstall operating system to fix issues and that reduced the time I have for any third-party projects.

I will do final review after one of the other reviewers have done benchmarking so we can see how much faster this code actually is compared to baseline results. I usually don't do benchmarking myself as I run Linux virtualized, so it doesn't have high-precision timers.

KungFuJesus · 2021-10-27T15:17:43Z

@KungFuJesus Part of what I do here requires me to think as dumber person than I actually am... If that makes me unsure about something, then most likely some other contributors are unsure also... A lot of existing code could warrant better explaining, but my job here is not to comment on existing code, only code that is modified or added... I have been around since early days of zlib-ng, but I haven't reviewed every commit or pull request because I have also other projects to work on and sometimes I had to reinstall operating system to fix issues and that reduced the time I have for any third-party projects.

I will do final review after one of the other reviewers have done benchmarking so we can see how much faster this code actually is compared to baseline results. I usually don't do benchmarking myself as I run Linux virtualized, so it doesn't have high-precision timers.

Would that be @Dead2 we're waiting on @nmoinvaz? I could run the benchmarks on a variety of hardware, as well, just let me know.

nmoinvaz · 2021-10-27T16:37:43Z

@KungFuJesus we usually run benchmarks using https://github.com/zlib-ng/deflatebench. Compile two binaries - one with and one without the changes and then run it for each.

Here is an example: #934 (comment)

arch/x86/adler32_avx.c

KungFuJesus · 2021-12-01T15:32:24Z

This PR cannot currently be merged due to merge conflicts, so it should be rebased. Also, I'd like someone else to tag it with their approval first.

Hmm, weird, github says it's fine. Where are the conflicts?

nmoinvaz · 2021-12-01T18:31:45Z

Here are my benchmark tests:

adler32_benchmark.exe --benchmark_repetitions=250
2021-12-01T13:22:52-05:00
Running adler32_benchmark.exe
Run on (8 X 2995 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
...
adler32_avx2_old_bench_mean        34079 ns        33961 ns          250
adler32_avx2_old_bench_median      32650 ns        32890 ns          250
adler32_avx2_old_bench_stddev       4855 ns         4624 ns          250
adler32_avx2_old_bench_cv          14.25 %         13.61 %           250
...
adler32_avx2_new_bench_mean        29302 ns        29266 ns          250
adler32_avx2_new_bench_median      28916 ns        29157 ns          250
adler32_avx2_new_bench_stddev       1787 ns         1787 ns          250
adler32_avx2_new_bench_cv           6.10 %          6.11 %           250

Here is my code: https://gist.github.com/nmoinvaz/b31b11c35724a549476b37da4ef8cd17. All you have to do is clone https://github.com/google/benchmark into the same directory and then run CMake.

Using median values there is possibly 12% improvement?

nmoinvaz · 2021-12-01T18:42:48Z

It would be better if the style nits commit was squashed - and that particular change if the same number of array elements were on each line, but it is not a condition of approval for me so I have approved this PR.

KungFuJesus · 2021-12-01T19:21:57Z

Here are my benchmark tests:

adler32_benchmark.exe --benchmark_repetitions=250
2021-12-01T13:22:52-05:00
Running adler32_benchmark.exe
Run on (8 X 2995 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
...
adler32_avx2_old_bench_mean        34079 ns        33961 ns          250
adler32_avx2_old_bench_median      32650 ns        32890 ns          250
adler32_avx2_old_bench_stddev       4855 ns         4624 ns          250
adler32_avx2_old_bench_cv          14.25 %         13.61 %           250
...
adler32_avx2_new_bench_mean        29302 ns        29266 ns          250
adler32_avx2_new_bench_median      28916 ns        29157 ns          250
adler32_avx2_new_bench_stddev       1787 ns         1787 ns          250
adler32_avx2_new_bench_cv           6.10 %          6.11 %           250

Here is my code: https://gist.github.com/nmoinvaz/b31b11c35724a549476b37da4ef8cd17. All you have to do is clone https://github.com/google/benchmark into the same directory and then run CMake.

Using median values there is possibly 12% improvement?

Awesome, that proper benchmark should come in handy.
One possible explanation for the lower coefficient of variation is the reduction in stack trips. The ymm registers are instead shuffling out directly to GPRs now.

KungFuJesus · 2021-12-01T19:27:40Z

@Dead2 I can rebase with develop if needed but it doesn't look like Github presently has any issues merging this.

nmoinvaz · 2021-12-01T19:36:00Z

Here is also an old adler32 benchmark I made once that is meant to be compiled with zlib-ng. https://gist.github.com/nmoinvaz/8bdc503804e17164b2d33312764feecc

Dead2 · 2021-12-01T20:34:27Z

@KungFuJesus

It does not tell me what the conflict is for some reason.

If you could rebase, lose the merge commit, and squash the last 3 fix-commits, leaving a clean and concise 3 commits, that would be very great. Hopefully that will resolve the conflict problem at the same time. 👍

KungFuJesus · 2021-12-01T23:56:36Z

@KungFuJesus

It does not tell me what the conflict is for some reason.

If you could rebase, lose the merge commit, and squash the last 3 fix-commits, leaving a clean and concise 3 commits, that would be very great. Hopefully that will resolve the conflict problem at the same time. +1

Ugh, I've already branched from this branch to work on an AVX512 variant. I mean, I should be able to, but it's going to require some annoying amount of git history revision to get there. The one thing I don't love above rebase is how it royally screws anything that manages to fork from an earlier state. I'll see about squashing this somehow.

nmoinvaz · 2021-12-02T00:16:09Z

It is a pain but it does make for a cleaner git history in the repository. There are times when I have made PR in zlib-ng and I've had to do lots of rebasing and reworking and waited many months for acceptance. I think you're getting close. I'm looking forward to the AVX512 variant even though I don't have hardware to test it on.

Since this is constant, anyway, we may as well use the variant that doesn't add vector register pressure, has better ILP opportunities, and has shorter instruction latency.

This now leverages the broadcasting instrinsics with an AND mask to load up the registers. Additionally, there's a minor efficiency boost here by casting up to 64 bit precision (by means of register aliasing) so that the modulo can be safely deferred until the write back to the full sums. The "write" back to the stack here is actually optimized out by GCC and turned into a write directly to a 32 bit GPR for each of the 8 elements. This much is not new, but now, since we don't have to do a modulus with the BASE value, we can bypass 8 64 bit multiplications, shifts, and subtractions while in those registers. I tried to do a horizontal reduction sum on the 8 64 bit elements since the vpextract* set of instructions aren't exactly low latency, however to do this safely (no overflow) it requires 2 128 bit register extractions, 8 vpmovsxdq to bring the things up to 64 bit precision, some shuffles, more 128 bit extractions to get around the 128 bit lane requirement of the shuffles, and finally a trip to a GPR and back to do the modulus on the scalar value. This method could have been more efficient if there were an inexpensive 64 bit horizontal addition instruction for AVX, but there isn't. To test this, I wrote a pretty basic benchmark using Python's zlib bindings on a huge set of random data, carefully timing only the checksum bits. Invoking perf stat from within the python process after the RNG shows a lower average number of cycles to complete and a shorter runtime.

KungFuJesus · 2021-12-02T22:19:37Z

Like every time I attempt rebase - git and I both got royally confused and I ended up having to construct half the state of head of the current branch on the remote via a squashing commit :-/. Anyway, it should be broken down to the desirable commits now - let me know if I screwed something up (I did make the backup of the pre rebase repo locally).

nmoinvaz · 2021-12-02T23:01:11Z

LGTM

arch/x86/adler32_avx.c

Dead2 · 2021-12-03T07:53:50Z

@KungFuJesus You might want to try out git fixup, have a look at https://github.com/zlib-ng/zlib-ng/wiki/Git-workflow-tips if you are not familiar with it. 😄

KungFuJesus · 2021-12-03T14:43:53Z

@KungFuJesus You might want to try out git fixup, have a look at https://github.com/zlib-ng/zlib-ng/wiki/Git-workflow-tips if you are not familiar with it. smile

I mean that's essentially what I did for the squash. That wasn't the issue I had, the issue was the rebase causing headaches with a merge commit that was established on this from another branch early on in the development (getting working horizontal sums was done on a different branch). The conflict resolution for that caused headaches because it was trying to resolve deltas from older commits that had since been reformatted from early git amended commits and force pushes because of all of those fixup changes. It's working now at least, with maybe some less satisfactory states of commit in between, but I'm pretty sure it'll at least compile and bisect fine. I just really dislike force push workflows because it breaks anything that has a history established with anything up until that point.

KungFuJesus · 2021-12-03T14:56:50Z

I should mention, if it isn't already obvious, that he failure in the CI checks seems to be a debian packaging related issue with the version of wine being installed. An apt issue rather than a real one.

Dead2 · 2021-12-03T15:35:38Z

@KungFuJesus I understand the problems 😄
We try to avoid force-pushing the repo main branches stable and develop (especially stable), but force-pushing within PRs is good, because it leads to a much cleaner commit log for the main branches, leading to easier external reviews and bisecting, etc.
In a way it is moving the pain to the PR process instead of having the potential for perpetual pain in the main branches 😉

And yes, the failing CI is an everlasting problem it seems. The Github Actions build platforms are constantly changing, causing new problems every 1-2 months or so 🙄

For some reason the movq instruction from a 128 bit register to a 64 bit GPR is not supported in 32 bit code. A simple workaround seems to be to invoke movl if compiling with -m32. Also addressing some style nits.

KungFuJesus · 2021-12-14T15:27:54Z

Hah, I came to the (now obvious) realization today that NMAX is derived to be the maximum scalar sums that can be performed before overflowing. This means the trip to 64 bit wide words wasn't even necessary, we can do a plain old 32 bit horizontal sum safely and probably save us a handful of cycles.

If my math is right it should be 12 cycles for the conversion saved, plus a cycle or two for the additional sums. Of course this also means the modulus operation probably could have been done on the FPU, but I'd wager it'd be more costly than doing it with GPRs thanks to register aliasing being so inexpensive and many of the instructions simply moving between GPRs and AVX registers instead of needing to go through the stack.

mtl1979 suggested changes Oct 23, 2021

View reviewed changes

arch/x86/adler32_avx.c Outdated Show resolved Hide resolved

KungFuJesus force-pushed the avx2_optimization branch 5 times, most recently from 63f3bb2 to 9057c67 Compare October 23, 2021 17:34

nmoinvaz added Architecture Architecture specific optimization labels Oct 23, 2021

KungFuJesus force-pushed the avx2_optimization branch from 4b2d136 to dbccab4 Compare October 24, 2021 23:30

KungFuJesus changed the title ~~Use immediate variant of shift instruction~~ AVX2 optimizations Oct 24, 2021

mtl1979 reviewed Dec 1, 2021

View reviewed changes

arch/x86/adler32_avx.c Outdated Show resolved Hide resolved

nmoinvaz approved these changes Dec 1, 2021

View reviewed changes

KungFuJesus added 3 commits December 2, 2021 16:49

Use immediate variant of shift instruction

fdb0ca0

Since this is constant, anyway, we may as well use the variant that doesn't add vector register pressure, has better ILP opportunities, and has shorter instruction latency.

Have horizontal sum here, decent wins

b18e062

KungFuJesus force-pushed the avx2_optimization branch from 02f4630 to aa21923 Compare December 2, 2021 22:18

mtl1979 reviewed Dec 2, 2021

View reviewed changes

arch/x86/adler32_avx.c Show resolved Hide resolved

mtl1979 reviewed Dec 2, 2021

View reviewed changes

arch/x86/adler32_avx.c Show resolved Hide resolved

KungFuJesus force-pushed the avx2_optimization branch from aa21923 to a5ac452 Compare December 3, 2021 03:42

Made this work on 32 bit compilations

eddfcf7

For some reason the movq instruction from a 128 bit register to a 64 bit GPR is not supported in 32 bit code. A simple workaround seems to be to invoke movl if compiling with -m32. Also addressing some style nits.

KungFuJesus force-pushed the avx2_optimization branch from a5ac452 to eddfcf7 Compare December 3, 2021 19:25

mtl1979 approved these changes Dec 3, 2021

View reviewed changes

Dead2 merged commit 1083c8e into zlib-ng:develop Dec 4, 2021

KungFuJesus deleted the avx2_optimization branch January 9, 2022 16:21

Uh oh!

Conversation

KungFuJesus commented Oct 23, 2021

Uh oh!

Uh oh!

KungFuJesus commented Oct 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtl1979 commented Oct 24, 2021

Uh oh!

KungFuJesus commented Oct 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KungFuJesus commented Oct 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KungFuJesus commented Oct 24, 2021

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmoinvaz commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KungFuJesus commented Oct 25, 2021

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtl1979 commented Oct 25, 2021

Uh oh!

KungFuJesus commented Oct 27, 2021

Uh oh!

nmoinvaz commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

KungFuJesus commented Dec 1, 2021

Uh oh!

nmoinvaz commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmoinvaz commented Dec 1, 2021

Uh oh!

KungFuJesus commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KungFuJesus commented Dec 1, 2021

Uh oh!

nmoinvaz commented Dec 1, 2021

Uh oh!

Dead2 commented Dec 1, 2021

Uh oh!

KungFuJesus commented Dec 1, 2021

Uh oh!

nmoinvaz commented Dec 2, 2021

Uh oh!

KungFuJesus commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmoinvaz commented Dec 2, 2021

KungFuJesus commented Oct 24, 2021 •

edited

Loading

KungFuJesus commented Oct 24, 2021 •

edited

Loading

KungFuJesus commented Oct 24, 2021 •

edited

Loading

KungFuJesus commented Oct 25, 2021 •

edited

Loading

nmoinvaz commented Oct 25, 2021 •

edited

Loading

KungFuJesus commented Oct 25, 2021 •

edited

Loading

KungFuJesus commented Oct 25, 2021 •

edited

Loading

KungFuJesus commented Oct 25, 2021 •

edited

Loading

KungFuJesus commented Oct 25, 2021 •

edited

Loading

nmoinvaz commented Oct 27, 2021 •

edited

Loading

nmoinvaz commented Dec 1, 2021 •

edited

Loading

KungFuJesus commented Dec 1, 2021 •

edited

Loading

KungFuJesus commented Dec 2, 2021 •

edited

Loading

KungFuJesus commented Dec 14, 2021 •

edited

Loading