We would like to contribute a HIP backend to Faiss to support AMD GPUs. We have a working prototype that passes all unit tests on Navi hardware (6800XT, 7900XTX). The prototype features a statically hipified version of the existing CUDA backend with manual AMD specific changes (build system, PTX to amdgcn builtins, ...).
Assuming you are interested in this work, how would we go about upstreaming it best?
Would a static HIP backend (in the faiss::hip namespace) be preferred? If not, what architecture would preferable (e.g., overriding faiss::gpu)?
Unlike the CUDA backend, we ultimately need to support multiple "warp sizes" (wave fronts) at runtime - 32 for Navi and 64 for MI series. There are some questions pertaining to uses of kWarpSize that will not work out of the box (sizing shared memory, some of the replacements in cmake, static uses at dispatch sites, ...). We will need guidance on how such support would be architected and integrated best.
Lastly, we have done only minor performance analysis and/or tuning with the current prototype - are there any public benchmarks and/or particular protocols you have used for the other backends that we should use as a reference? We have used the GPU benchmark scripts with the SIFT data sets to assess performance so far.
As part of this, we're currently using a GPU_MAX_SELECTION_K of 1024 - what would a recommend protocol look like to decide for 1024 vs 2048 - we'd ideally like to use only one independent of HW/SW generation.
To browse the current prototype as a reference, please see https://github.com/iotamudelta/faiss/tree/wf32 .
We would like to contribute a HIP backend to Faiss to support AMD GPUs. We have a working prototype that passes all unit tests on Navi hardware (6800XT, 7900XTX). The prototype features a statically hipified version of the existing CUDA backend with manual AMD specific changes (build system, PTX to amdgcn builtins, ...).
Assuming you are interested in this work, how would we go about upstreaming it best?
Would a static HIP backend (in the faiss::hip namespace) be preferred? If not, what architecture would preferable (e.g., overriding faiss::gpu)?
Unlike the CUDA backend, we ultimately need to support multiple "warp sizes" (wave fronts) at runtime - 32 for Navi and 64 for MI series. There are some questions pertaining to uses of kWarpSize that will not work out of the box (sizing shared memory, some of the replacements in cmake, static uses at dispatch sites, ...). We will need guidance on how such support would be architected and integrated best.
Lastly, we have done only minor performance analysis and/or tuning with the current prototype - are there any public benchmarks and/or particular protocols you have used for the other backends that we should use as a reference? We have used the GPU benchmark scripts with the SIFT data sets to assess performance so far.
As part of this, we're currently using a GPU_MAX_SELECTION_K of 1024 - what would a recommend protocol look like to decide for 1024 vs 2048 - we'd ideally like to use only one independent of HW/SW generation.
To browse the current prototype as a reference, please see https://github.com/iotamudelta/faiss/tree/wf32 .