Skip to content

F16 variants - Update loads and stores to AVX2 - Group 4#627

Merged
kiritigowda merged 12 commits intoROCm:developfrom
r-abishek:ar/opt_f16_loads_stores_4
Nov 3, 2025
Merged

F16 variants - Update loads and stores to AVX2 - Group 4#627
kiritigowda merged 12 commits intoROCm:developfrom
r-abishek:ar/opt_f16_loads_stores_4

Conversation

@r-abishek
Copy link
Copy Markdown
Member

  • Replacement of scalar load/store and conversion to FP32, with AVX2 intrinsics - no additions or removals to external user API.
  • 40.6% - 48.8% improvements in performance for the updated kernels for the FP16 bit depth.
  • F16 Load/Store updates for exposure, spatter, log.
image

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes F16 variants by replacing scalar load/store operations with AVX2 intrinsics for spatter, log, and exposure kernels. The changes eliminate manual conversion loops between F16 and F32 formats, leveraging specialized SIMD load/store functions for direct F16 handling.

  • Replaced scalar F16-to-F32 conversion loops with direct AVX2 F16 load functions
  • Updated SIMD store operations to write directly to F16 format
  • Moved non-AVX2 fallback code inside proper conditional blocks

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/modules/tensor/cpu/kernel/spatter.cpp Updated spatter kernel to use AVX2 F16 load/store intrinsics, reorganized conditional compilation blocks
src/modules/tensor/cpu/kernel/log.cpp Replaced manual F16-to-F32 conversion with direct AVX2 F16 load operations
src/modules/tensor/cpu/kernel/exposure.cpp Eliminated manual conversion loops, using direct F16 SIMD load/store functions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

rpp_simd_store(rpp_store24_f32pln3_to_f32pln3_avx, dstPtrTempR_ps, dstPtrTempG_ps, dstPtrTempB_ps, p); // simd stores
rpp_simd_store(rpp_store24_f32pln3_to_f16pln3_avx, dstPtrTempR, dstPtrTempG, dstPtrTempB, p); // simd stores
#else
Rpp32f srcPtrTemp_ps[13];
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array size is 13 but vectorIncrement is typically 12 for non-AVX2 path. Consider using vectorIncrement constant instead of magic number 13.

Suggested change
Rpp32f srcPtrTemp_ps[13];
Rpp32f srcPtrTemp_ps[vectorIncrement];

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-abishek please consider this also

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will check and consolidate such changes which is required across multiple kernels as a separate PR. Thanks

Rpp32u vectorIncrement = 16;
if (nDim == 1)
{
alignedLength = length[0] & ~15;
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 15 should be replaced with (vectorIncrement - 1) for better maintainability and clarity.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be truncating the length to the nearest multiple of 16. Is that what is needed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We are processing the values in batches of 16 with the AVX Code. The leftover values are processed with raw C Code. Thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated based on the comment. Resolved

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 58.62069% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/modules/tensor/cpu/kernel/log.cpp 20.00% 12 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #627      +/-   ##
===========================================
- Coverage    88.50%   88.09%   -0.41%     
===========================================
  Files          195      195              
  Lines        82768    82704      -64     
===========================================
- Hits         73251    72852     -399     
- Misses        9517     9852     +335     
Files with missing lines Coverage Δ
src/modules/tensor/cpu/kernel/exposure.cpp 100.00% <100.00%> (ø)
src/modules/tensor/cpu/kernel/spatter.cpp 97.50% <100.00%> (-0.07%) ⬇️
src/modules/tensor/cpu/kernel/log.cpp 22.52% <20.00%> (-0.26%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kiritigowda kiritigowda self-assigned this Oct 9, 2025
@kiritigowda kiritigowda requested a review from rrawther October 9, 2025 17:57
Copy link
Copy Markdown
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address all review comments

Rpp32u vectorIncrement = 16;
if (nDim == 1)
{
alignedLength = length[0] & ~15;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be truncating the length to the nearest multiple of 16. Is that what is needed?

Copy link
Copy Markdown
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of dstPtrTemp is set to 13 instead of 12 to avoid overflow and segmentation fault issues when we utilize the SIMD load/store functions

we can't increase the array size to more than required to avoid segfault. If it is caused by a bug in the code, it needs to be addressed instead

@Srihari-mcw
Copy link
Copy Markdown
Contributor

The size of dstPtrTemp is set to 13 instead of 12 to avoid overflow and segmentation fault issues when we utilize the SIMD load/store functions

we can't increase the array size to more than required to avoid segfault. If it is caused by a bug in the code, it needs to be addressed instead

image

With this specific function for example, we are trying to load the pkd3 inputs to pln3 layout. The PKD3 values are loaded in 4 different vectors and transposed to load the values in PLN3 layout. Here for p[3] we are loading from srcPtr + 9. _mm_loadu_ps loads 4 float values from the location. Hence srcPtr is kept as a buffer a size 13.

Thanks

Copy link
Copy Markdown
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address the review comments

@r-abishek
Copy link
Copy Markdown
Member Author

@rrawther I think all are addressed on this one?

@kiritigowda kiritigowda merged commit 50c3df4 into ROCm:develop Nov 3, 2025
8 of 10 checks passed
ManasaDattaT pushed a commit to RooseweltMcW/rpp that referenced this pull request Dec 19, 2025
* Make changes for exposure, log and spatter

* Updates for crop mirror normalize

* Fix memory issues with log 1D

* Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp

* Update the alignedLength for log

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
HazarathKumarM pushed a commit to HazarathKumarM/rpp that referenced this pull request Jan 6, 2026
* Make changes for exposure, log and spatter

* Updates for crop mirror normalize

* Fix memory issues with log 1D

* Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp

* Update the alignedLength for log

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
JeniferC99 pushed a commit that referenced this pull request Jan 22, 2026
* F16 variants - Update loads and stores to AVX2 - Group 4 (#627)

* Make changes for exposure, log and spatter

* Updates for crop mirror normalize

* Fix memory issues with log 1D

* Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp

* Update the alignedLength for log

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

* Package - Enable Lintian Support rpp (#633)

* fix lintian errors

* fix lintian overrides static error

* lintian errors fixed

* move lintian overrides into if deb check

* use existing changelog. fix formatting

* not installing lintian overrides. keeping original changelog name

* remove overrides

---------

Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.27.0 to 1.29.0 in /docs/sphinx (#638)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.27.0 to 1.29.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.27.0...v1.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.29.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Test suite - Add QA pass/fail tests for F32 bit depth (#631)

* Added golden outputs and resolved HOST backend

* Updated bin files for median filter and resize crop mirror

* Fix for median filter F32 QA

* Updated bin files

* Updated rcm review comments

* Updated comments for rmn

* Modified bitdepths and resolved review comments

* Fix typo

* resolve review comments

---------

Co-authored-by: sampath117 <snehaa@multicorewareinc.com>
Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

* Test Suite - Error Code Capture for all tests (#635)

* Updates to capture error code

* Intialize RPP_SUCCESS as default value

* Update the code to display error status as part of the C++ code execution

* Update rpp_test_suite_common.h

* Update utilities/test_suite/HIP/Tensor_audio_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_image_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_misc_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_voxel_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_audio_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_image_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_misc_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_voxel_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes for CI issues

* Restore naming convention in voxel test suite

* Fix compilation issues

* Update the code to use func for funcName

* Modify error message

* Modify the print statements

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* F16 variants - Update loads and stores to AVX2 - Group 5 (#637)

* Updates for crop mirror normalize

* Updated flip F16 rawC and load store modifications

* Updated blend with AVX support for F16 bitdepth

* Updated color cast with AVX support for F16 bitdepth

* Remove empty lines

* Update comments

* Fix comment in common function

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.29.0 to 1.30.0 in /docs/sphinx (#640)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.29.0...v1.30.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.30.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* HOST and HIP - pinned buffers for respective API (#628)

* Removed memcpy and used hipHostMalloc for allocation : blend

* Removed memcpy and used hipHostMalloc for allocation : brightness

* Removed memcpy and used hipHostMalloc for allocation : color cast

* Removed memcpy and used hipHostMalloc for allocation : color twist

* Removed memcpy and used hipHostMalloc for allocation : contrast

* Removed memcpy and used hipHostMalloc for allocation : crop mirror normalize

* Removed memcpy and used hipHostMalloc for allocation : Exposure

* Removed memcpy and used hipHostMalloc for allocation : Gamma correction

* Removed memcpy and used hipHostMalloc for allocation : gaussian filter

* Removed memcpy and used hipHostMalloc for allocation : Noise

* Removed memcpy and used hipHostMalloc for allocation : Non linear blend

* Removed memcpy and used hipHostMalloc for allocation : Resize mirror normalize

* Removed memcpy and used hipHostMalloc for allocation : Water

* Added hipHostFree for all kernels in test suite

* Added hipHostFree for all kernels in test suite

* Removed memcpy and used hipHostMalloc for allocation : Flip, spatter, rcm, color temperature

* Resolved copilot review comments

* Updated version

* Removed unused parameter

* Updated version in cmakeList

* removed the host to device mem copies for warp affine and rotate

* Updated version

* Removed comment

* Updated Chnagelog file

* Update patch version from 2.2.0 to 2.2.1

* Update CHANGELOG

* Address copilot comments for HIP HOST consistent allocation

* Documentation changes for updated memcpy changes

* Update ricap outer API to use pinned memory and remove mem copy

* Fix memory allocation and deallocation for permutationTensor

* Update api/rppt_tensor_effects_augmentations.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix spelling of noiseProbability and saltProbability

* Fix deallocation

---------

Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.30.0 to 1.30.1 in /docs/sphinx (#643)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.30.0 to 1.30.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.30.0...v1.30.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.30.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* CMakelists - Add optional GPU targets (#641)

* add optional gpu targets

* add addiitonal gpu targets

* Rename function - hip_exec_roi_converison_ltrb_to_xywh to hip_exec_roi_conversion_ltrb_to_xywh (#645)

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>

* Docs - Update CHANGELOG.md (#646)

Updates

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Abishek <52214183+r-abishek@users.noreply.github.com>
Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
Co-authored-by: jonatluu <jonatluu@amd.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sampath117 <snehaa@multicorewareinc.com>
Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants