Skip to content

[API Proposal]: Add support for AVX-512 VNNI hardware instructions #86849

@MadProbe

Description

@MadProbe

Background and motivation

There already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI)
Also this feature is in preview to be consistent with existing AvxVnni API

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    [Intrinsic]
    public abstract class Avx512Vnni : Avx512F
    {
        public static new bool IsSupported { get => IsSupported; }

        [Intrinsic]
        public new abstract class X64 : Avx512F.X64
        {
            public static new bool IsSupported { get => IsSupported; }
        }

        [Intrinsic]
        public new abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get => IsSupported; }

            /// <summary>
            /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        }

        /// <summary>
        /// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
    }
}

API Usage

The motivation for this proposal is largely the same as that of AvxVnni. These instructions are used the same way that AvxVnni is and may be universally used in any algorithm as long as you know where to use them and have good performance improvements against multiple instruction counterparts with same output.

// Example ripped from my Adler32 rolling hash implementation code
// where 256-bit vector are currently used instead but easily can be widened to 512-bit vector as shown below
// Also non-revelant stuff has been cut off for brevity
while (IsAddressLessThan(ref dataRef, ref end)) {
    Vector512<byte> bytes = Vector512.LoadUnsafe(ref dataRef);
    vadlerBA += vadlerA;

    if (Avx512Vnni.IsSupported) {
        vadlerBmult = Avx512Vnni.MultiplyWideningAndAdd(vadlerBmult.AsInt32(), bytes, mults_vector).AsUInt32();
    } else {
        vadlerBmult += Avx512BW.MultiplyAddAdjacent(Avx512BW.MultiplyAddAdjacent(bytes, mults_vector), Vector512<short>.One).AsUInt32();
    }

    vadlerA += Avx512BW.SumAbsoluteDifferences(bytes, zero).AsUInt64();
    dataRef = ref Add(ref dataRef, Vector512<byte>.Count);
}

Alternative Designs

Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on some intel processors with hybrid core architecture (Adler Lake and its successors).

Risks

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions