F16 variants - Update loads and stores to AVX2 - Group 5 by r-abishek · Pull Request #637 · ROCm/rpp

r-abishek · 2025-11-12T05:38:45Z

Replacement of scalar load/store and conversion to FP32, with AVX2 intrinsics - no additions or removals to external user API.
28.7% - 48.5% improvements in performance for the updated kernels for the FP16 bit depth.
F16 Load/Store updates for blend, color_cast, flip, crop_mirror_normalize.

FP16 Load/Store Updates

Copilot

Pull Request Overview

This PR updates F16 (half-precision floating point) kernels to use AVX2 intrinsics for loading and storing data, replacing scalar conversions to FP32. The changes target the blend, color_cast, flip, and crop_mirror_normalize kernels, delivering performance improvements of 28.7% to 48.5% for FP16 operations.

Replaces scalar F16 to F32 conversions with AVX2 SIMD intrinsics
Introduces new AVX2 load/store functions for F16 data with mirroring support
Updates conditional branching for better code clarity

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/modules/tensor/cpu/kernel/flip.cpp	Updates flip kernel to use AVX2 F16 load/store functions, removes temporary F32 buffers, adjusts flip factor calculation for RGB channels
src/modules/tensor/cpu/kernel/crop_mirror_normalize.cpp	Replaces scalar F16 conversions with AVX2 intrinsics, improves conditional structure with `else if`
src/modules/tensor/cpu/kernel/color_cast.cpp	Adds AVX2 code paths with preprocessing directives, updates to use F16 load/store functions
src/modules/tensor/cpu/kernel/blend.cpp	Converts to AVX2 F16 operations, adds compile-time AVX2 feature detection
src/include/common/cpu/rpp_cpu_simd_load_store.hpp	Adds new F16 mirror load functions for AVX2 (pkd3 and pln3 variants)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-12T05:42:13Z

src/include/common/cpu/rpp_cpu_simd_load_store.hpp

+    _MM_TRANSPOSE4_PS(p128[4], p128[5], p128[6], p128[7]); /* Transpose the 4x4 matrix and forms [[R05 R06 R07 R08][B05 B06 B07 B08][G05 G06 G07 G08][R06 R07 R08 R09]] */
+    p[0] = _mm256_setr_m128(p128[0], p128[4]); /* packs as R01-R08 */
+    p[1] = _mm256_setr_m128(p128[1], p128[5]); /* packs as G01-G08 */
+    p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-R08 */


The comment incorrectly states "B01-R08" when it should be "B01-B08" to match the pattern of the other channels and correctly describe what is being packed.

Suggested change

p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-R08 */

p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-B08 */

codecov · 2025-11-12T08:06:35Z

Codecov Report

❌ Patch coverage is 97.64706% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...odules/tensor/cpu/kernel/crop_mirror_normalize.cpp	55.56%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #637      +/-   ##
===========================================
+ Coverage    88.24%   88.28%   +0.04%     
===========================================
  Files          195      195              
  Lines        82712    82619      -93     
===========================================
- Hits         72985    72934      -51     
+ Misses        9727     9685      -42

Files with missing lines	Coverage Δ
src/include/common/cpu/rpp_cpu_simd_load_store.hpp	`93.71% <100.00%> (+0.07%)`	⬆️
src/modules/tensor/cpu/kernel/blend.cpp	`100.00% <100.00%> (ø)`
src/modules/tensor/cpu/kernel/color_cast.cpp	`100.00% <100.00%> (ø)`
src/modules/tensor/cpu/kernel/flip.cpp	`90.71% <100.00%> (-0.31%)`	⬇️
...odules/tensor/cpu/kernel/crop_mirror_normalize.cpp	`58.16% <55.56%> (+0.12%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix comment in common function

r-abishek · 2025-11-13T03:26:54Z

@rrawther @LakshmiKumar23 CI failure is only due to (-0.09%) reduction in coverage.

* Updates for crop mirror normalize * Updated flip F16 rawC and load store modifications * Updated blend with AVX support for F16 bitdepth * Updated color cast with AVX support for F16 bitdepth * Remove empty lines * Update comments * Fix comment in common function --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* F16 variants - Update loads and stores to AVX2 - Group 4 (#627) * Make changes for exposure, log and spatter * Updates for crop mirror normalize * Fix memory issues with log 1D * Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp * Update the alignedLength for log --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> * Package - Enable Lintian Support rpp (#633) * fix lintian errors * fix lintian overrides static error * lintian errors fixed * move lintian overrides into if deb check * use existing changelog. fix formatting * not installing lintian overrides. keeping original changelog name * remove overrides --------- Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Docs - Bump rocm-docs-core[api_reference] from 1.27.0 to 1.29.0 in /docs/sphinx (#638) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.27.0 to 1.29.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.27.0...v1.29.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.29.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Test suite - Add QA pass/fail tests for F32 bit depth (#631) * Added golden outputs and resolved HOST backend * Updated bin files for median filter and resize crop mirror * Fix for median filter F32 QA * Updated bin files * Updated rcm review comments * Updated comments for rmn * Modified bitdepths and resolved review comments * Fix typo * resolve review comments --------- Co-authored-by: sampath117 <snehaa@multicorewareinc.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> * Test Suite - Error Code Capture for all tests (#635) * Updates to capture error code * Intialize RPP_SUCCESS as default value * Update the code to display error status as part of the C++ code execution * Update rpp_test_suite_common.h * Update utilities/test_suite/HIP/Tensor_audio_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_image_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_misc_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_voxel_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_audio_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_image_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_misc_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_voxel_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes for CI issues * Restore naming convention in voxel test suite * Fix compilation issues * Update the code to use func for funcName * Modify error message * Modify the print statements --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * F16 variants - Update loads and stores to AVX2 - Group 5 (#637) * Updates for crop mirror normalize * Updated flip F16 rawC and load store modifications * Updated blend with AVX support for F16 bitdepth * Updated color cast with AVX support for F16 bitdepth * Remove empty lines * Update comments * Fix comment in common function --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Docs - Bump rocm-docs-core[api_reference] from 1.29.0 to 1.30.0 in /docs/sphinx (#640) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.29.0...v1.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.30.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * HOST and HIP - pinned buffers for respective API (#628) * Removed memcpy and used hipHostMalloc for allocation : blend * Removed memcpy and used hipHostMalloc for allocation : brightness * Removed memcpy and used hipHostMalloc for allocation : color cast * Removed memcpy and used hipHostMalloc for allocation : color twist * Removed memcpy and used hipHostMalloc for allocation : contrast * Removed memcpy and used hipHostMalloc for allocation : crop mirror normalize * Removed memcpy and used hipHostMalloc for allocation : Exposure * Removed memcpy and used hipHostMalloc for allocation : Gamma correction * Removed memcpy and used hipHostMalloc for allocation : gaussian filter * Removed memcpy and used hipHostMalloc for allocation : Noise * Removed memcpy and used hipHostMalloc for allocation : Non linear blend * Removed memcpy and used hipHostMalloc for allocation : Resize mirror normalize * Removed memcpy and used hipHostMalloc for allocation : Water * Added hipHostFree for all kernels in test suite * Added hipHostFree for all kernels in test suite * Removed memcpy and used hipHostMalloc for allocation : Flip, spatter, rcm, color temperature * Resolved copilot review comments * Updated version * Removed unused parameter * Updated version in cmakeList * removed the host to device mem copies for warp affine and rotate * Updated version * Removed comment * Updated Chnagelog file * Update patch version from 2.2.0 to 2.2.1 * Update CHANGELOG * Address copilot comments for HIP HOST consistent allocation * Documentation changes for updated memcpy changes * Update ricap outer API to use pinned memory and remove mem copy * Fix memory allocation and deallocation for permutationTensor * Update api/rppt_tensor_effects_augmentations.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix spelling of noiseProbability and saltProbability * Fix deallocation --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Docs - Bump rocm-docs-core[api_reference] from 1.30.0 to 1.30.1 in /docs/sphinx (#643) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.30.0 to 1.30.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.30.0...v1.30.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.30.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * CMakelists - Add optional GPU targets (#641) * add optional gpu targets * add addiitonal gpu targets * Rename function - hip_exec_roi_converison_ltrb_to_xywh to hip_exec_roi_conversion_ltrb_to_xywh (#645) Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> * Docs - Update CHANGELOG.md (#646) Updates --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Abishek <52214183+r-abishek@users.noreply.github.com> Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> Co-authored-by: jonatluu <jonatluu@amd.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sampath117 <snehaa@multicorewareinc.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>

Srihari-mcw and others added 7 commits October 29, 2025 13:13

Updates for crop mirror normalize

78b2dc0

Updated flip F16 rawC and load store modifications

d15483f

Updated blend with AVX support for F16 bitdepth

12181c3

Updated color cast with AVX support for F16 bitdepth

8aca114

Remove empty lines

0cea144

Update comments

eb6a063

Merge pull request #513 from Srihari-mcw/apr/f16_load_store

1eee8f9

FP16 Load/Store Updates

r-abishek requested a review from Copilot November 12, 2025 05:38

r-abishek added enhancement New feature or request ci:precheckin labels Nov 12, 2025

Copilot started reviewing on behalf of r-abishek November 12, 2025 05:40 View session

Copilot finished reviewing on behalf of r-abishek November 12, 2025 05:41

Copilot AI reviewed Nov 12, 2025

View reviewed changes

Srihari-mcw and others added 2 commits November 12, 2025 15:54

Fix comment in common function

3169b4a

Merge pull request #526 from Srihari-mcw/fix_comments_f16

2daeee8

Fix comment in common function

LakshmiKumar23 requested review from LakshmiKumar23 and rrawther November 13, 2025 03:22

r-abishek requested review from LakshmiKumar23 and rrawther and removed request for LakshmiKumar23 and rrawther November 13, 2025 03:22

Merge branch 'develop' into ar/opt_f16_loads_stores_5

e26b9a5

kiritigowda self-assigned this Nov 13, 2025

r-abishek and others added 2 commits November 13, 2025 11:47

Merge branch 'develop' into ar/opt_f16_loads_stores_5

debe121

Merge branch 'develop' into ar/opt_f16_loads_stores_5

84beee8

rrawther approved these changes Nov 20, 2025

View reviewed changes

kiritigowda merged commit c51e0e1 into ROCm:develop Nov 21, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F16 variants - Update loads and stores to AVX2 - Group 5#637

F16 variants - Update loads and stores to AVX2 - Group 5#637
kiritigowda merged 12 commits intoROCm:developfrom
r-abishek:ar/opt_f16_loads_stores_5

r-abishek commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 12, 2025

Uh oh!

codecov bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

r-abishek commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-R08 */
	p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-B08 */

Conversation

r-abishek commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

r-abishek commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Nov 12, 2025 •

edited

Loading