API Proposal: Add Intel hardware intrinsic functions and namespace  

This proposal adds intrinsics that allow programmers to use managed code (C#) to leverage Intel® SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1/2, PCLMULQDQ, and AES instructions.  

## Rationale and Proposed API  
### Vector Types  
Currently, .NET provides `System.Numerics.Vector<T>` and related intrinsic functions as a cross-platform SIMD interface that automatically matches proper hardware support at JIT-compile time (e.g. `Vector<T>` is size of 128-bit on SSE2 machines or 256-bit on AVX2 machines). However, there is no way to simultaneously use different size `Vector<T>`, which limits the flexibility of SIMD intrinsics. For example, on AVX2 machines, XMM registers are not accessible from `Vector<T>`, but certain instructions have to work on XMM registers (i.e. SSE4.2). Consequently, this proposal introduces `Vector128<T>` and `Vector256<T>` in a new namespace `System.Runtime.Intrinsics`  
```csharp
namespace System.Runtime.Intrinsics
{
    // 128 bit types
    [StructLayout(LayoutKind.Sequential, Size = 16)]
    public struct Vector128<T> where T : struct {}

    // 256 bit types
    [StructLayout(LayoutKind.Sequential, Size = 32)]
    public struct Vector256<T> where T : struct {}
}
```

This namespace is platform agnostic, and other hardware could provide intrinsics that operate over them. For instance, `Vector128<T>` could be implemented as an abstraction of XMM registers on SSE capable processor or as an abstraction of Q registers on NEON capable processors. Meanwhile, other types may be added in the future to support newer SIMD architectures (i.e. adding 512-bit vector and mask vector types for AVX-512). 

### Intrinsic Functions
The current design of `System.Numerics.Vector` abstracts away the specifics of processor details. While this approach works well in many cases, developers may not be able to take full advantage of the underlying hardware. Intrinsic functions allow developers to access full capability of processors on which their programs run. 

One of the design goals of intrinsics APIs is to provide one-to-one correspondence to [Intel C/C++ intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/). That way, programmers already familiar with C/C++ intrinsics can easily leverage their existing skills. Another advantage of this approach is that we leverage the existing body of documentation and sample code written for C/C++ instrinsics.

Intrinsic functions that manipulate `Vector128/256<T>` will be placed in a platform-specific namespace `System.Runtime.Intrinsics.X86`. Intrinsic APIs will be separated to several static classes based-on the instruction sets they belong to.

```csharp
// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        public static bool IsSupported {get;}

        // __m256 _mm256_add_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> Add(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_add_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> Add(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        // __m256 _mm256_addsub_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> AddSubtract(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_addsub_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> AddSubtract(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        ......
    }
}
```
Some of intrinsics benefit from C# generic and get simpler APIs:
```csharp
// Sse2.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse
    {
        public static bool IsSupported {get;}

        // __m128 _mm_castpd_ps (__m128d a)
        // __m128i _mm_castpd_si128 (__m128d a)
        // __m128d _mm_castps_pd (__m128 a)
        // __m128i _mm_castps_si128 (__m128 a)
        // __m128d _mm_castsi128_pd (__m128i a)
        // __m128 _mm_castsi128_ps (__m128i a)
        [Intrinsic]
        public static Vector128<U> StaticCast<T, U>(Vector128<T> value) where T : struct where U : struct { throw new NotImplementedException(); }
        
        ......
    }
}
```
Each instruction set class contains an `IsSupported` property which stands for whether the underlying hardware supports the instruction set. Programmers use these properties to ensure that their code can run on any hardware via platform-specific code path. For JIT compilation, the results of capability checking are JIT time constants, so dead code path for the current platform will be eliminated by JIT compiler (conditional constant propagation). For AOT compilation, compiler/runtime executes the CPUID checking to identify corresponding instruction sets. Additionally, the intrinsics do not provide software fallback and calling the intrinsics on machines that has no corresponding instruction sets will cause `PlatformNotSupportedException` at runtime. Consequently, we always recommend developers to provide software fallback to remain the program portable. Common pattern of platform-specific code path and software fallback looks like below.  
```csharp
if (Avx2.IsSupported)
{
    // The AVX/AVX2 optimizing implementation for Haswell or above CPUs  
}
else if (Sse41.IsSupported)
{
    // The SSE optimizing implementation for older CPUs  
}
......
else
{
    // Scalar or software-fallback implementation
}
```

The scope of this API proposal is not limited to SIMD (vector) intrinsics, but also includes scalar intrinsics that operate over scalar types (e.g. int, short, long, or float, etc.) from the instruction sets mentioned above.  As an example, the following code segment shows `Crc32` intrinsic functions from `Sse42` class.
```csharp
// Sse42.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse42
    {
        public static bool IsSupported {get;}

        // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
        [Intrinsic]
        public static uint Crc32(uint crc, byte data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v)
        [Intrinsic]
        public static uint Crc32(uint crc, ushort data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v)
        [Intrinsic]
        public static uint Crc32(uint crc, uint data) { throw new NotImplementedException(); }
        // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
        [Intrinsic]
        public static ulong Crc32(ulong crc, ulong data) { throw new NotImplementedException(); }

        ......
    }
}
```

### Intended Audience
The intrinsics APIs bring the power and flexibility of accessing hardware instructions directly from C# programs.  However, this power and flexibility means that developers have to be cognizant of how these APIs are used. In addition to ensuring that their program logic is correct, developers must also ensure that the use of underlying intrinsic APIs are valid in the context of their operations.

For example, developers who use certain hardware intrinsics should be aware of their data alignment requirements.  Both aligned and unaligned memory load and store intrinsics are provided, and if aligned loads and stores are desired, developers must ensure that the data are aligned appropriately. The following code snippet shows the different flavors of load and store intrinsics proposed:

```csharp
// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......
        
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> Load(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> Load(byte* address) { throw new NotImplementedException(); }
        ......
        [Intrinsic]
        public static Vector256<T> Load<T>(ref T vector) where T : struct { throw new NotImplementedException(); }

        
        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadAligned(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadAligned(byte* address) { throw new NotImplementedException(); }
        ......

        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadDqu(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadDqu(byte* address) { throw new NotImplementedException(); }
        ......
        
        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......
        public static void Store<T>(ref T vector, Vector256<T> source) where T : struct { throw new NotImplementedException(); }

        
        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......

	// void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
    
        ......
	}
}
```

### IMM Operands   
Most of the intrinsics can be directly ported to C# from C/C++, but certain instructions that require immediate parameters (i.e. imm8) as operands deserve additional consideration, such as `pshufd`, ` vcmpps`, etc. C/C++ compilers specially treat these intrinsics which throw compile-time errors when non-constant values are passed into immediate parameters. Therefore, CoreCLR also requires the immediate argument guard from C# compiler. We suggest an addition of a new "compiler feature" into Roslyn which places `const` constraint on function parameters. Roslyn could then ensure that these functions are invoked with "literal" values on the `const` formal parameters.

```csharp
// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......

        // __m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
        [Intrinsic]
        public static Vector256<float> Blend(Vector256<float> left, Vector256<float> right, const byte control) { throw new NotImplementedException(); }
        // __m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
        [Intrinsic]
        public static Vector256<double> Blend(Vector256<double> left, Vector256<double> right, const byte control) { throw new NotImplementedException(); }

        // __m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
        [Intrinsic]
        public static Vector128<float> Compare(Vector128<float> left, Vector128<float> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }
        
        // __m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
        [Intrinsic]
        public static Vector128<double> Compare(Vector128<double> left, Vector128<double> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }

        ......
    }
}

// Enums.cs
namespace System.Runtime.Intrinsics.X86
{
    public enum FloatComparisonMode : byte
    {
        EqualOrderedNonSignaling,
        LessThanOrderedSignaling,
        LessThanOrEqualOrderedSignaling,
        UnorderedNonSignaling,
        NotEqualUnorderedNonSignaling,
        NotLessThanUnorderedSignaling,
        NotLessThanOrEqualUnorderedSignaling,
        OrderedNonSignaling,
        ......
    }

    ......
}

```

## Semantics and Usage  
The semantic is straightforward if users are already familiar with [Intel C/C++ intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/). Existing SIMD programs and algorithms that are implemented in C/C++ can be directly ported to C#. Moreover, compared to `System.Numerics.Vector<T>`, these intrinsics leverage the whole power of Intel SIMD instructions and do not depend on other modules (e.g. `Unsafe`) in high-performance environments.

For example, SoA (structure of array) is a more efficient pattern than AoS (array of structure) in SIMD programming. However, it requires dense `shuffle` sequences to convert data source (usually stored in AoS format), which is not provided by `Vector<T>`. Using `Vector256<T>` with AVX shuffle instructions (including shuffle, insert, extract, etc.) can lead to higher throughput. 
```csharp
public struct Vector256Packet
{
    public Vector256<float> xs {get; private set;}
    public Vector256<float> ys {get; private set;}
    public Vector256<float> zs {get; private set;}

    // Convert AoS vectors to SoA packet
    public unsafe Vector256Packet(float* vectors)
    {
        var m03 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[0])); // load lower halves
        var m14 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[4]));
        var m25 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[8]));
        m03 = Avx.Insert(m03, &vectors[12], 1);  // load higher halves
        m14 = Avx.Insert(m14, &vectors[16], 1);
        m25 = Avx.Insert(m25, &vectors[20], 1);

        var xy = Avx.Shuffle(m14, m25, 2 << 6 | 1 << 4 | 3 << 2 | 2);
        var yz = Avx.Shuffle(m03, m14, 1 << 6 | 0 << 4 | 2 << 2 | 1);
        var _xs = Avx.Shuffle(m03, xy, 2 << 6 | 0 << 4 | 3 << 2 | 0);
        var _ys = Avx.Shuffle(yz, xy,  3 << 6 | 1 << 4 | 2 << 2 | 0);
        var _zs = Avx.Shuffle(yz, m25, 3 << 6 | 0 << 4 | 3 << 2 | 1);

        xs = _xs;
        ys = _ys;
        zs = _zs; 
    }
    ......
}

public static class Main
{
    static unsafe int Main(string[] args)
    {
        var data = new float[Length];
        fixed (float* dataPtr = data)
        {
            if (Avx2.IsSupported)
            {
                var vector = new Vector256Packet(dataPtr);
                ......
                // Using AVX/AVX2 intrinsics to compute eight 3D vectors.
            }
            else if (Sse41.IsSupported)
            {
                var vector = new Vector128Packet(dataPtr);
                ......
                // Using SSE intrinsics to compute four 3D vectors.
            }
            else
            {
                // scalar algorithm
            }
        }
    }
}
```

Furthermore, conditional code is enabled in vectorized programs. Conditional path is ubiquitous in scalar programs (`if-else`), but it requires specific SIMD instructions in vectorized programs, such as compare, blend, or andnot, etc.
```csharp
public static class ColorPacketHelper
{
    public static IntRGBPacket ConvertToIntRGB(this Vector256Packet colors)
    {
        var one = Avx.Set1<float>(1.0f);
        var max = Avx.Set1<float>(255.0f);

        var rsMask = Avx.Compare(colors.xs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var gsMask = Avx.Compare(colors.ys, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var bsMask = Avx.Compare(colors.zs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);

        var rs = Avx.BlendVariable(colors.xs, one, rsMask);
        var gs = Avx.BlendVariable(colors.ys, one, gsMask);
        var bs = Avx.BlendVariable(colors.zs, one, bsMask);

        var rsInt = Avx.ConvertToVector256Int(Avx.Multiply(rs, max));
        var gsInt = Avx.ConvertToVector256Int(Avx.Multiply(gs, max));
        var bsInt = Avx.ConvertToVector256Int(Avx.Multiply(bs, max));

        return new IntRGBPacket(rsInt, gsInt, bsInt);
    }
}

public struct IntRGBPacket
{
    public Vector256<int> Rs {get; private set;}
    public Vector256<int> Gs {get; private set;}
    public Vector256<int> Bs {get; private set;}

    public IntRGBPacket(Vector256<int> _rs, Vector256<int> _gs, Vector256<int>_bs)
    {
        Rs = _rs;
        Gs = _gs;
        Bs = _bs;
    }
}
```

As previously stated, traditional scalar algorithms can be accelerated as well. For example, CRC32 is natively supported on SSE4.2 CPUs.
```csharp
public static class Verification
{
    public static bool VerifyCrc32(ulong acc, ulong data, ulong res)
    {
        if (Sse42.IsSupported)
        {
            return Sse42.Crc32(acc, data) == res;
        }
        else
        {
            return SoftwareCrc32(acc, data) == res;
            // The software implementation of Crc32 provided by developers or other libraries
        }
    }
}
```

## Implementation Roadmap  
Implementing all the intrinsics in JIT is a large-scale and long-term project, so the current plan is to initially implement a subset of them with unit tests, code quality test, and benchmarks.

The first step in the implementation would involve infrastructure related items. This step would involve wiring the basic components, including but not limited to internal data representations of `Vector128<T>` and `Vector256<T>`, intrinsics recognition, hardware support checking, and external support from Roslyn/CoreFX.  Next steps would involve implementing subsets of intrinsics in classes representing different instruction sets.

## Complete API Design
[Add Intel hardware intrinsic APIs to CoreFX dotnet/corefx#23489](https://github.com/dotnet/corefx/pull/23489)
[Add Intel hardware intrinsic API implementation to mscorlib dotnet/corefx#13576](https://github.com/dotnet/coreclr/pull/13576)

## Update  
08/17/2017  
- Change namespace `System.Runtime.CompilerServices.Intrinsics` to `System.Runtime.Intrinsics` and `System.Runtime.CompilerServices.Intrinsics.X86` to `System.Runtime.Intrinsics.X86`.
- Change ISA class name to match CoreFX naming convention, e.g., using `Avx` instead of `AVX`.
- Change certain pointer parameter names, e.g., using `address` instead of `mem`.
- Define `IsSupport` as properties.
- Add `Span<T>` overloads to the most common memory-access intrinsics (`Load`, `Store`, `Broadcast`), but leave other alignment-aware or performance-sensitive intrinsics with original pointer version.
- Clarify that these intrinsics will not provide software fallback.
- Clarify `Sse2` class design and separate small calsses (e.g., `Aes`, `Lzcnt`, etc.) into individual source files (e.g., `Aes.cs`, `Lzcnt.cs`, etc.).
- Change method name `CompareVector*` to `Compare` and get rid of `Compare` prefix from `FloatComparisonMode`.

08/22/2017
- Replace `Span<T>` overloads by `ref T` overloads.

09/01/2017
- Minor changes from API code review.

12/21/2018
- All the proposed APIs are enabled in .NET Core runtime.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API Proposal: Add Intel hardware intrinsic functions and namespace #23057

Rationale and Proposed API

Vector Types

Intrinsic Functions

Intended Audience

IMM Operands

Semantics and Usage

Implementation Roadmap

Complete API Design

Update

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API Proposal: Add Intel hardware intrinsic functions and namespace #23057

Description

Rationale and Proposed API

Vector Types

Intrinsic Functions

Intended Audience

IMM Operands

Semantics and Usage

Implementation Roadmap

Complete API Design

Update

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions