There are 2X AES throughput improvements available here #42726.
However the assembly implementations are too big:
https://go-review.googlesource.com/c/go/+/286852/comments/1b1a6e65_4e27a3f8
From the outside, it seems the peoples able to review crypto CLs are streched thin when problems like #53142 are still open years later.
On the flip side I havn't got any problem getting things merged in the compiler, even complex-ish tricky code like bac4e2f were reviewed and merged in time.
We even had someone show up one day with a whole new pass, and this was merged within weeks in the same release cycle (which is hard to say about crypto things). (it was reverted but it was fixed and added back, and it's on track for release in go1.22) https://github.com/golang/go/commits/master/src/cmd/compile/internal/ssa/sccp.go
So the crypto peoples seems stretched, the compiler peoples don't look like they are.
The crypto peoples don't like writing assembly, the compiler peoples write code which write assembly all day.
What if we took the assembly part of crypto assembly and gave it to the compiler peoples ?
Proposal
We could create a package in internal/intrinsics/$GOARCH/, the point of it being in internal is that we don't need to get ergonomics perfect and can evolve them.
It is surprisingly easy and has little complexity to wire intrinsic to the compiler (example in https://go-review.googlesource.com/c/go/+/548318, don't look at the whole CL, it's quite big because it do other things, just memclrPointers definition and rules).
While keeping in mind a minimal compiler complexity we could implement AESENC this way,
First create body-less functions:
// internal/intrinsics/amd64/aes.go
package amd64
// Requires AES CPUID
func Aesenc(data, key ISimd128) ISimd128
func Aesdec(data, key ISimd128) ISimd128
The compiler would rely on the consumer properly checking CPUID bits.
So if I call amd64.Aesenc there is no attempt to fallback if it's unsupported, the compiler always emit AESENC instruction.
ISimd128 would be magic types the type checker would need to know about:
// internal/intrinsics/amd64/aes.go
package amd64
type ISimd128 ISimd128
// We could also only have the bytes ←→ SIMD conversions, and rely on optimizations to generate bigger versions.
func I8x16To128(x [16]uint8) ISimd128
func I16x8To128(x [8]uint16) ISimd128
func I32x4To128(x [4]uint32) ISimd128
func I64x2To128(x [2]uint64) ISimd128
func I128To8x16(x ISimd128) [16]uint8
func I128To16x8(x ISimd128) [8]uint16
func I128To32x4(x ISimd128) [4]uint32
func I128To64x2(x ISimd128) [2]uint64
// equivalent register for floats
type FSimd128 FSimd128
We could also use things like [16]byte or [2]uint64 directly as the simd register, however this makes the compiler more complex because it is responsible of promoting arrays to vector registers.
Then in .rules file we can lower it:
(CALLstatic {sym} data key mem)
&& isSameCall(sym, "internal/intrinsics/amd64.Aesenc")
=> (MakeResult (AESENC <v.Type.Field(0)> data key) mem)
Because this is an internal package, I think it is acceptable to attribute line numbers to caller to intrinsics.
We already have a solid framework to merge theses instructions which operate on registers to ones on memory by combining a few more .rules and addressingmodes.go.
regalloc is already able to handle tricky registering situations like AESENC which use some register as both source and destination.
Memory wise ISimd128 type would have a 16 bytes alignment and would be usable in struct fields, so you can store it on the heap or whatever. The compiler would be able to emit simple MOVQDA.
The complex part for the compiler is when you are allowed to do indexing inside the simd register type.
Something like:
type ISimd128 [16]byte
//... reg1 and reg2 are ISimd128
var x [32]byte
copy(x[:], reg1)
copy(x[16:], reg2)
copy(reg1[:], x[8:])
In this particular case the compiler would need to know if it's better to use memory, or be smart (here staying in register land and doing PINSERTQ would be the best)
This is solved by not being allowed to do it at all, type ISimd128 ISimd128 does not allow for indexing of inner elements.
We would instead expose a Shuffle64 intrinsic.
For this limited usecase I don't think we need to apply generic optimizations, the goal is to be nicer to write assembly not compiler optimizer code.
For amd64 particularly all cpus (but zen4) have a cost when mixing older MMX and SSE with newer VEX and EVEX encodings.
VEX and EVEX allows to use 256 bits and 512 bits registers and newer AVX, AVX2, AVX512 families of instructions.
To do so we would add a new directive (probably also limit it to the std module):
vex marked functions would only be allowed to call other vex marked functions, this is because SSE is used in almost all functions for zeroing or copy of fixed size elements.
This means a vex function couldn't call the runtime, for crypto use cases this is fine, it is customs to make the actual crypto routines use out parameters instead of allocating and returning a result.
When calling a vex function from a non vex function the compiler would inject vzeroupper after the call. This is so vex functions can call other vex functions and pass ISimd* arguments and return values through ymm and zmm registers.
This is different than previous solutions because it does not aim to provide a generic solution.
We would still have an implementation for each architecture we want an assembly.
This aims at replacing assembly, we wouldn't spend much time adding optimizations in the compiler, instead we would add access to more instructions and let people write theses optimizations themselves.
The researched gains over assembly are:
There are 2X AES throughput improvements available here #42726.
However the assembly implementations are too big:
https://go-review.googlesource.com/c/go/+/286852/comments/1b1a6e65_4e27a3f8
From the outside, it seems the peoples able to review crypto CLs are streched thin when problems like #53142 are still open years later.
On the flip side I havn't got any problem getting things merged in the compiler, even complex-ish tricky code like bac4e2f were reviewed and merged in time.
We even had someone show up one day with a whole new pass, and this was merged within weeks in the same release cycle (which is hard to say about crypto things). (it was reverted but it was fixed and added back, and it's on track for release in go1.22) https://github.com/golang/go/commits/master/src/cmd/compile/internal/ssa/sccp.go
So the crypto peoples seems stretched, the compiler peoples don't look like they are.
The crypto peoples don't like writing assembly, the compiler peoples write code which write assembly all day.
What if we took the assembly part of crypto assembly and gave it to the compiler peoples ?
Proposal
We could create a package in
internal/intrinsics/$GOARCH/, the point of it being ininternalis that we don't need to get ergonomics perfect and can evolve them.It is surprisingly easy and has little complexity to wire intrinsic to the compiler (example in https://go-review.googlesource.com/c/go/+/548318, don't look at the whole CL, it's quite big because it do other things, just
memclrPointersdefinition and rules).While keeping in mind a minimal compiler complexity we could implement AESENC this way,
First create body-less functions:
The compiler would rely on the consumer properly checking CPUID bits.
So if I call
amd64.Aesencthere is no attempt to fallback if it's unsupported, the compiler always emitAESENCinstruction.ISimd128would be magic types the type checker would need to know about:We could also use things like
[16]byteor[2]uint64directly as the simd register, however this makes the compiler more complex because it is responsible of promoting arrays to vector registers.Then in
.rulesfile we can lower it:Because this is an internal package, I think it is acceptable to attribute line numbers to caller to intrinsics.
We already have a solid framework to merge theses instructions which operate on registers to ones on memory by combining a few more
.rulesandaddressingmodes.go.regallocis already able to handle tricky registering situations likeAESENCwhich use some register as both source and destination.Memory wise
ISimd128type would have a 16 bytes alignment and would be usable in struct fields, so you can store it on the heap or whatever. The compiler would be able to emit simpleMOVQDA.The complex part for the compiler is when you are allowed to do indexing inside the simd register type.
Something like:
In this particular case the compiler would need to know if it's better to use memory, or be smart (here staying in register land and doing
PINSERTQwould be the best)This is solved by not being allowed to do it at all,
type ISimd128 ISimd128does not allow for indexing of inner elements.We would instead expose a
Shuffle64intrinsic.For this limited usecase I don't think we need to apply generic optimizations, the goal is to be nicer to write assembly not compiler optimizer code.
For amd64 particularly all cpus (but zen4) have a cost when mixing older MMX and SSE with newer VEX and EVEX encodings.
VEX and EVEX allows to use 256 bits and 512 bits registers and newer AVX, AVX2, AVX512 families of instructions.
To do so we would add a new directive (probably also limit it to the
stdmodule):vexmarked functions would only be allowed to call other vex marked functions, this is because SSE is used in almost all functions for zeroing or copy of fixed size elements.This means a
vexfunction couldn't call the runtime, for crypto use cases this is fine, it is customs to make the actual crypto routines use out parameters instead of allocating and returning a result.When calling a
vexfunction from a nonvexfunction the compiler would injectvzeroupperafter the call. This is sovexfunctions can call othervexfunctions and passISimd*arguments and return values throughymmandzmmregisters.This is different than previous solutions because it does not aim to provide a generic solution.
We would still have an implementation for each architecture we want an assembly.
This aims at replacing assembly, we wouldn't spend much time adding optimizations in the compiler, instead we would add access to more instructions and let people write theses optimizations themselves.
The researched gains over assembly are: