F16 variants - Update loads and stores to AVX2 - Group 4 by r-abishek · Pull Request #627 · ROCm/rpp

r-abishek · 2025-10-08T06:20:56Z

Replacement of scalar load/store and conversion to FP32, with AVX2 intrinsics - no additions or removals to external user API.
40.6% - 48.8% improvements in performance for the updated kernels for the FP16 bit depth.
F16 Load/Store updates for exposure, spatter, log.

…d_store.hpp

F16 load store updates grp 4

Copilot

Pull Request Overview

This PR optimizes F16 variants by replacing scalar load/store operations with AVX2 intrinsics for spatter, log, and exposure kernels. The changes eliminate manual conversion loops between F16 and F32 formats, leveraging specialized SIMD load/store functions for direct F16 handling.

Replaced scalar F16-to-F32 conversion loops with direct AVX2 F16 load functions
Updated SIMD store operations to write directly to F16 format
Moved non-AVX2 fallback code inside proper conditional blocks

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
src/modules/tensor/cpu/kernel/spatter.cpp	Updated spatter kernel to use AVX2 F16 load/store intrinsics, reorganized conditional compilation blocks
src/modules/tensor/cpu/kernel/log.cpp	Replaced manual F16-to-F32 conversion with direct AVX2 F16 load operations
src/modules/tensor/cpu/kernel/exposure.cpp	Eliminated manual conversion loops, using direct F16 SIMD load/store functions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-08T06:21:49Z

src/modules/tensor/cpu/kernel/spatter.cpp

-                    rpp_simd_store(rpp_store24_f32pln3_to_f32pln3_avx, dstPtrTempR_ps, dstPtrTempG_ps, dstPtrTempB_ps, p);    // simd stores
+                    rpp_simd_store(rpp_store24_f32pln3_to_f16pln3_avx, dstPtrTempR, dstPtrTempG, dstPtrTempB, p);    // simd stores
 #else
+                    Rpp32f srcPtrTemp_ps[13];


Array size is 13 but vectorIncrement is typically 12 for non-AVX2 path. Consider using vectorIncrement constant instead of magic number 13.

Suggested change

Rpp32f srcPtrTemp_ps[13];

Rpp32f srcPtrTemp_ps[vectorIncrement];

@r-abishek please consider this also

We will check and consolidate such changes which is required across multiple kernels as a separate PR. Thanks

src/modules/tensor/cpu/kernel/spatter.cpp

Copilot · 2025-10-08T06:21:50Z

src/modules/tensor/cpu/kernel/log.cpp

        Rpp32u vectorIncrement = 16;
        if (nDim == 1)
        {
+            alignedLength = length[0] & ~15;


Magic number 15 should be replaced with (vectorIncrement - 1) for better maintainability and clarity.

this will be truncating the length to the nearest multiple of 16. Is that what is needed?

Yes. We are processing the values in batches of 16 with the AVX Code. The leftover values are processed with raw C Code. Thanks

Updated based on the comment. Resolved

codecov · 2025-10-08T21:59:30Z

Codecov Report

❌ Patch coverage is 58.62069% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/modules/tensor/cpu/kernel/log.cpp	20.00%	12 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #627      +/-   ##
===========================================
- Coverage    88.50%   88.09%   -0.41%     
===========================================
  Files          195      195              
  Lines        82768    82704      -64     
===========================================
- Hits         73251    72852     -399     
- Misses        9517     9852     +335

Files with missing lines	Coverage Δ
src/modules/tensor/cpu/kernel/exposure.cpp	`100.00% <100.00%> (ø)`
src/modules/tensor/cpu/kernel/spatter.cpp	`97.50% <100.00%> (-0.07%)`	⬇️
src/modules/tensor/cpu/kernel/log.cpp	`22.52% <20.00%> (-0.26%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rrawther

please address all review comments

rrawther · 2025-10-11T01:27:08Z

src/modules/tensor/cpu/kernel/log.cpp

        Rpp32u vectorIncrement = 16;
        if (nDim == 1)
        {
+            alignedLength = length[0] & ~15;


this will be truncating the length to the nearest multiple of 16. Is that what is needed?

F16 copilot comments 1

rrawther

The size of dstPtrTemp is set to 13 instead of 12 to avoid overflow and segmentation fault issues when we utilize the SIMD load/store functions

we can't increase the array size to more than required to avoid segfault. If it is caused by a bug in the code, it needs to be addressed instead

Srihari-mcw · 2025-10-16T04:48:15Z

The size of dstPtrTemp is set to 13 instead of 12 to avoid overflow and segmentation fault issues when we utilize the SIMD load/store functions

we can't increase the array size to more than required to avoid segfault. If it is caused by a bug in the code, it needs to be addressed instead

With this specific function for example, we are trying to load the pkd3 inputs to pln3 layout. The PKD3 values are loaded in 4 different vectors and transposed to load the values in PLN3 layout. Here for p[3] we are loading from srcPtr + 9. _mm_loadu_ps loads 4 float values from the location. Hence srcPtr is kept as a buffer a size 13.

Thanks

rrawther

please address the review comments

r-abishek · 2025-10-29T21:54:51Z

@rrawther I think all are addressed on this one?

* Make changes for exposure, log and spatter * Updates for crop mirror normalize * Fix memory issues with log 1D * Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp * Update the alignedLength for log --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

* F16 variants - Update loads and stores to AVX2 - Group 4 (#627) * Make changes for exposure, log and spatter * Updates for crop mirror normalize * Fix memory issues with log 1D * Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp * Update the alignedLength for log --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> * Package - Enable Lintian Support rpp (#633) * fix lintian errors * fix lintian overrides static error * lintian errors fixed * move lintian overrides into if deb check * use existing changelog. fix formatting * not installing lintian overrides. keeping original changelog name * remove overrides --------- Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Docs - Bump rocm-docs-core[api_reference] from 1.27.0 to 1.29.0 in /docs/sphinx (#638) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.27.0 to 1.29.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.27.0...v1.29.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.29.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Test suite - Add QA pass/fail tests for F32 bit depth (#631) * Added golden outputs and resolved HOST backend * Updated bin files for median filter and resize crop mirror * Fix for median filter F32 QA * Updated bin files * Updated rcm review comments * Updated comments for rmn * Modified bitdepths and resolved review comments * Fix typo * resolve review comments --------- Co-authored-by: sampath117 <snehaa@multicorewareinc.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> * Test Suite - Error Code Capture for all tests (#635) * Updates to capture error code * Intialize RPP_SUCCESS as default value * Update the code to display error status as part of the C++ code execution * Update rpp_test_suite_common.h * Update utilities/test_suite/HIP/Tensor_audio_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_image_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_misc_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HIP/Tensor_voxel_hip.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_audio_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_image_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_misc_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update utilities/test_suite/HOST/Tensor_voxel_host.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes for CI issues * Restore naming convention in voxel test suite * Fix compilation issues * Update the code to use func for funcName * Modify error message * Modify the print statements --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * F16 variants - Update loads and stores to AVX2 - Group 5 (#637) * Updates for crop mirror normalize * Updated flip F16 rawC and load store modifications * Updated blend with AVX support for F16 bitdepth * Updated color cast with AVX support for F16 bitdepth * Remove empty lines * Update comments * Fix comment in common function --------- Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Docs - Bump rocm-docs-core[api_reference] from 1.29.0 to 1.30.0 in /docs/sphinx (#640) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.29.0...v1.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.30.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * HOST and HIP - pinned buffers for respective API (#628) * Removed memcpy and used hipHostMalloc for allocation : blend * Removed memcpy and used hipHostMalloc for allocation : brightness * Removed memcpy and used hipHostMalloc for allocation : color cast * Removed memcpy and used hipHostMalloc for allocation : color twist * Removed memcpy and used hipHostMalloc for allocation : contrast * Removed memcpy and used hipHostMalloc for allocation : crop mirror normalize * Removed memcpy and used hipHostMalloc for allocation : Exposure * Removed memcpy and used hipHostMalloc for allocation : Gamma correction * Removed memcpy and used hipHostMalloc for allocation : gaussian filter * Removed memcpy and used hipHostMalloc for allocation : Noise * Removed memcpy and used hipHostMalloc for allocation : Non linear blend * Removed memcpy and used hipHostMalloc for allocation : Resize mirror normalize * Removed memcpy and used hipHostMalloc for allocation : Water * Added hipHostFree for all kernels in test suite * Added hipHostFree for all kernels in test suite * Removed memcpy and used hipHostMalloc for allocation : Flip, spatter, rcm, color temperature * Resolved copilot review comments * Updated version * Removed unused parameter * Updated version in cmakeList * removed the host to device mem copies for warp affine and rotate * Updated version * Removed comment * Updated Chnagelog file * Update patch version from 2.2.0 to 2.2.1 * Update CHANGELOG * Address copilot comments for HIP HOST consistent allocation * Documentation changes for updated memcpy changes * Update ricap outer API to use pinned memory and remove mem copy * Fix memory allocation and deallocation for permutationTensor * Update api/rppt_tensor_effects_augmentations.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix spelling of noiseProbability and saltProbability * Fix deallocation --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Docs - Bump rocm-docs-core[api_reference] from 1.30.0 to 1.30.1 in /docs/sphinx (#643) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.30.0 to 1.30.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.30.0...v1.30.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.30.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * CMakelists - Add optional GPU targets (#641) * add optional gpu targets * add addiitonal gpu targets * Rename function - hip_exec_roi_converison_ltrb_to_xywh to hip_exec_roi_conversion_ltrb_to_xywh (#645) Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> * Docs - Update CHANGELOG.md (#646) Updates --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Abishek <52214183+r-abishek@users.noreply.github.com> Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com> Co-authored-by: jonatluu <jonatluu@amd.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sampath117 <snehaa@multicorewareinc.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>

Srihari-mcw and others added 5 commits September 5, 2025 11:02

Make changes for exposure, log and spatter

7c550d0

Updates for crop mirror normalize

3063c0d

Fix memory issues with log 1D

0bc3001

Remove changes for crop mirror normalize and restore rpp_cpu_simd_loa…

a49e3bc

…d_store.hpp

Merge pull request #493 from Srihari-mcw/f16_load_store_updates_grp_4

e58f65d

F16 load store updates grp 4

r-abishek requested a review from Copilot October 8, 2025 06:20

r-abishek added the ci:precheckin label Oct 8, 2025

Copilot AI reviewed Oct 8, 2025

View reviewed changes

Update the alignedLength for log

3206f76

kiritigowda self-assigned this Oct 9, 2025

kiritigowda requested a review from rrawther October 9, 2025 17:57

rrawther requested a review from LakshmiKumar23 October 10, 2025 17:51

Merge branch 'develop' into ar/opt_f16_loads_stores_4

8d15415

rrawther requested changes Oct 11, 2025

View reviewed changes

Merge pull request #503 from Srihari-mcw/f16_copilot_comments_1

fbff8c1

F16 copilot comments 1

rrawther reviewed Oct 15, 2025

View reviewed changes

r-abishek and others added 3 commits October 16, 2025 00:05

Merge branch 'develop' into ar/opt_f16_loads_stores_4

59c6451

Merge branch 'develop' into ar/opt_f16_loads_stores_4

1353dc6

Merge branch 'develop' into ar/opt_f16_loads_stores_4

e2c738e

rrawther approved these changes Oct 29, 2025

View reviewed changes

rrawther requested changes Oct 29, 2025

View reviewed changes

Merge branch 'develop' into ar/opt_f16_loads_stores_4

c5bef7e

rrawther approved these changes Oct 29, 2025

View reviewed changes

kiritigowda merged commit 50c3df4 into ROCm:develop Nov 3, 2025
8 of 10 checks passed

	Rpp32f srcPtrTemp_ps[13];
	Rpp32f srcPtrTemp_ps[vectorIncrement];

Conversation

r-abishek commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

LakshmiKumar23 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

rrawther Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rrawther left a comment

Choose a reason for hiding this comment

Uh oh!

rrawther Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

rrawther left a comment

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw commented Oct 16, 2025

Uh oh!

rrawther left a comment

Choose a reason for hiding this comment

Uh oh!

r-abishek commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Oct 8, 2025 •

edited

Loading