proposal: crypto: replace assembly implementations with internal instrinsics

There are 2X AES throughput improvements available here #42726.
However the assembly implementations are too big:
https://go-review.googlesource.com/c/go/+/286852/comments/1b1a6e65_4e27a3f8

From the outside, it seems the peoples able to review crypto CLs are streched thin when problems like #53142 are still open years later.

---

On the flip side I havn't got any problem getting things merged in the compiler, even complex-ish tricky code like bac4e2f241ca8df3d5be6ddf83214b9a681f4086 were reviewed and merged in time.
We even had someone show up one day with a whole new pass, and this was merged within weeks in the same release cycle (which is hard to say about crypto things). (it was reverted but it was fixed and added back, and it's on track for release in go1.22) https://github.com/golang/go/commits/master/src/cmd/compile/internal/ssa/sccp.go

---

So the crypto peoples seems stretched, the compiler peoples don't look like they are.
The crypto peoples don't like writing assembly, the compiler peoples write code which write assembly all day.
What if we took the assembly part of crypto assembly and gave it to the compiler peoples ?

### Proposal

We could create a package in `internal/intrinsics/$GOARCH/`, the point of it being in `internal` is that we don't need to get ergonomics perfect and can evolve them.
It is surprisingly easy and has little complexity to wire intrinsic to the compiler (example in https://go-review.googlesource.com/c/go/+/548318, don't look at the whole CL, it's quite big because it do other things, just `memclrPointers` definition and rules).

While keeping in mind a minimal compiler complexity we could implement AESENC this way,
First create body-less functions:
```go
// internal/intrinsics/amd64/aes.go
package amd64

// Requires AES CPUID
func Aesenc(data, key ISimd128) ISimd128
func Aesdec(data, key ISimd128) ISimd128
```
The compiler would rely on the consumer properly checking CPUID bits.
So if I call `amd64.Aesenc` there is no attempt to fallback if it's unsupported, the compiler always emit `AESENC` instruction.

`ISimd128` would be magic types the type checker would need to know about:
```go
// internal/intrinsics/amd64/aes.go
package amd64

type ISimd128 ISimd128

// We could also only have the bytes ←→ SIMD conversions, and rely on optimizations to generate bigger versions.
func I8x16To128(x [16]uint8) ISimd128
func I16x8To128(x [8]uint16) ISimd128
func I32x4To128(x [4]uint32) ISimd128
func I64x2To128(x [2]uint64) ISimd128

func I128To8x16(x ISimd128) [16]uint8
func I128To16x8(x ISimd128) [8]uint16
func I128To32x4(x ISimd128) [4]uint32
func I128To64x2(x ISimd128) [2]uint64

// equivalent register for floats
type FSimd128 FSimd128
```
We could also use things like `[16]byte` or `[2]uint64` directly as the simd register, however this makes the compiler more complex because it is responsible of promoting arrays to vector registers.

Then in `.rules` file we can lower it:
```
(CALLstatic {sym} data key mem)
	&& isSameCall(sym, "internal/intrinsics/amd64.Aesenc")
	=> (MakeResult (AESENC <v.Type.Field(0)> data key) mem)
```
Because this is an internal package, I think it is acceptable to attribute line numbers to caller to intrinsics.

We already have a solid framework to merge theses instructions which operate on registers to ones on memory by combining a few more `.rules` and `addressingmodes.go`.

`regalloc` is already able to handle tricky registering situations like `AESENC` which use some register as both source and destination.

Memory wise `ISimd128` type would have a 16 bytes alignment and would be usable in struct fields, so you can store it on the heap or whatever. The compiler would be able to emit simple `MOVQDA`.
The complex part for the compiler is when you are allowed to do indexing inside the simd register type.
Something like:
```go
type ISimd128 [16]byte

//... reg1 and reg2 are ISimd128
var x [32]byte
copy(x[:], reg1)
copy(x[16:], reg2)
copy(reg1[:], x[8:])
```
In this particular case the compiler would need to know if it's better to use memory, or be smart (here staying in register land and doing `PINSERTQ` would be the best)
This is solved by not being allowed to do it at all, `type ISimd128 ISimd128` does not allow for indexing of inner elements.
We would instead expose a `Shuffle64` intrinsic.

For this limited usecase I don't think we need to apply generic optimizations, the goal is to be nicer to write assembly not compiler optimizer code.

For amd64 particularly all cpus (but zen4) have a cost when mixing older MMX and SSE with newer VEX and EVEX encodings.
VEX and EVEX allows to use 256 bits and 512 bits registers and newer AVX, AVX2, AVX512 families of instructions.
To do so we would add a new directive (probably also limit it to the `std` module):
```
//go:vex
func // ...
```
`vex` marked functions would only be allowed to call other vex marked functions, this is because SSE is used in almost all functions for zeroing or copy of fixed size elements.
This means a `vex` function couldn't call the runtime, for crypto use cases this is fine, it is customs to make the actual crypto routines use out parameters instead of allocating and returning a result.

When calling a `vex` function from a non `vex` function the compiler would inject `vzeroupper` after the call. This is so `vex` functions can call other `vex` functions and pass `ISimd*` arguments and return values through `ymm` and `zmm` registers.

---

This is different than previous solutions because it does not aim to provide a generic solution.
We would still have an implementation for each architecture we want an assembly.
This aims at replacing assembly, we wouldn't spend much time adding optimizations in the compiler, instead we would add access to more instructions and let people write theses optimizations themselves.

The researched gains over assembly are:
- Readable optimized implementations
- Register allocation
- Memory safety
- Type safety (https://go-review.googlesource.com/c/go/+/519675 *cough cough*)
- Hopefully, easier on the crypto team.
- Preemptable routines (#64417).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: crypto: replace assembly implementations with internal instrinsics #64634

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: crypto: replace assembly implementations with internal instrinsics #64634

Description

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions