Skip to content

F16 variants - Update loads and stores to AVX2 - Group 5#637

Merged
kiritigowda merged 12 commits intoROCm:developfrom
r-abishek:ar/opt_f16_loads_stores_5
Nov 21, 2025
Merged

F16 variants - Update loads and stores to AVX2 - Group 5#637
kiritigowda merged 12 commits intoROCm:developfrom
r-abishek:ar/opt_f16_loads_stores_5

Conversation

@r-abishek
Copy link
Copy Markdown
Member

  • Replacement of scalar load/store and conversion to FP32, with AVX2 intrinsics - no additions or removals to external user API.
  • 28.7% - 48.5% improvements in performance for the updated kernels for the FP16 bit depth.
    F16 Load/Store updates for blend, color_cast, flip, crop_mirror_normalize.
image image

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates F16 (half-precision floating point) kernels to use AVX2 intrinsics for loading and storing data, replacing scalar conversions to FP32. The changes target the blend, color_cast, flip, and crop_mirror_normalize kernels, delivering performance improvements of 28.7% to 48.5% for FP16 operations.

  • Replaces scalar F16 to F32 conversions with AVX2 SIMD intrinsics
  • Introduces new AVX2 load/store functions for F16 data with mirroring support
  • Updates conditional branching for better code clarity

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/modules/tensor/cpu/kernel/flip.cpp Updates flip kernel to use AVX2 F16 load/store functions, removes temporary F32 buffers, adjusts flip factor calculation for RGB channels
src/modules/tensor/cpu/kernel/crop_mirror_normalize.cpp Replaces scalar F16 conversions with AVX2 intrinsics, improves conditional structure with else if
src/modules/tensor/cpu/kernel/color_cast.cpp Adds AVX2 code paths with preprocessing directives, updates to use F16 load/store functions
src/modules/tensor/cpu/kernel/blend.cpp Converts to AVX2 F16 operations, adds compile-time AVX2 feature detection
src/include/common/cpu/rpp_cpu_simd_load_store.hpp Adds new F16 mirror load functions for AVX2 (pkd3 and pln3 variants)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

_MM_TRANSPOSE4_PS(p128[4], p128[5], p128[6], p128[7]); /* Transpose the 4x4 matrix and forms [[R05 R06 R07 R08][B05 B06 B07 B08][G05 G06 G07 G08][R06 R07 R08 R09]] */
p[0] = _mm256_setr_m128(p128[0], p128[4]); /* packs as R01-R08 */
p[1] = _mm256_setr_m128(p128[1], p128[5]); /* packs as G01-G08 */
p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-R08 */
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment incorrectly states "B01-R08" when it should be "B01-B08" to match the pattern of the other channels and correctly describe what is being packed.

Suggested change
p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-R08 */
p[2] = _mm256_setr_m128(p128[2], p128[6]); /* packs as B01-B08 */

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 97.64706% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...odules/tensor/cpu/kernel/crop_mirror_normalize.cpp 55.56% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #637      +/-   ##
===========================================
+ Coverage    88.24%   88.28%   +0.04%     
===========================================
  Files          195      195              
  Lines        82712    82619      -93     
===========================================
- Hits         72985    72934      -51     
+ Misses        9727     9685      -42     
Files with missing lines Coverage Δ
src/include/common/cpu/rpp_cpu_simd_load_store.hpp 93.71% <100.00%> (+0.07%) ⬆️
src/modules/tensor/cpu/kernel/blend.cpp 100.00% <100.00%> (ø)
src/modules/tensor/cpu/kernel/color_cast.cpp 100.00% <100.00%> (ø)
src/modules/tensor/cpu/kernel/flip.cpp 90.71% <100.00%> (-0.31%) ⬇️
...odules/tensor/cpu/kernel/crop_mirror_normalize.cpp 58.16% <55.56%> (+0.12%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@r-abishek r-abishek requested review from LakshmiKumar23 and rrawther and removed request for LakshmiKumar23 and rrawther November 13, 2025 03:22
@r-abishek
Copy link
Copy Markdown
Member Author

@rrawther @LakshmiKumar23 CI failure is only due to (-0.09%) reduction in coverage.

@kiritigowda kiritigowda self-assigned this Nov 13, 2025
@kiritigowda kiritigowda merged commit c51e0e1 into ROCm:develop Nov 21, 2025
9 checks passed
ManasaDattaT pushed a commit to RooseweltMcW/rpp that referenced this pull request Dec 19, 2025
* Updates for crop mirror normalize

* Updated flip F16 rawC and load store modifications

* Updated blend with AVX support for F16 bitdepth

* Updated color cast with AVX support for F16 bitdepth

* Remove empty lines

* Update comments

* Fix comment in common function

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
HazarathKumarM pushed a commit to HazarathKumarM/rpp that referenced this pull request Jan 6, 2026
* Updates for crop mirror normalize

* Updated flip F16 rawC and load store modifications

* Updated blend with AVX support for F16 bitdepth

* Updated color cast with AVX support for F16 bitdepth

* Remove empty lines

* Update comments

* Fix comment in common function

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
JeniferC99 pushed a commit that referenced this pull request Jan 22, 2026
* F16 variants - Update loads and stores to AVX2 - Group 4 (#627)

* Make changes for exposure, log and spatter

* Updates for crop mirror normalize

* Fix memory issues with log 1D

* Remove changes for crop mirror normalize and restore rpp_cpu_simd_load_store.hpp

* Update the alignedLength for log

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

* Package - Enable Lintian Support rpp (#633)

* fix lintian errors

* fix lintian overrides static error

* lintian errors fixed

* move lintian overrides into if deb check

* use existing changelog. fix formatting

* not installing lintian overrides. keeping original changelog name

* remove overrides

---------

Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.27.0 to 1.29.0 in /docs/sphinx (#638)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.27.0 to 1.29.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.27.0...v1.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.29.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Test suite - Add QA pass/fail tests for F32 bit depth (#631)

* Added golden outputs and resolved HOST backend

* Updated bin files for median filter and resize crop mirror

* Fix for median filter F32 QA

* Updated bin files

* Updated rcm review comments

* Updated comments for rmn

* Modified bitdepths and resolved review comments

* Fix typo

* resolve review comments

---------

Co-authored-by: sampath117 <snehaa@multicorewareinc.com>
Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

* Test Suite - Error Code Capture for all tests (#635)

* Updates to capture error code

* Intialize RPP_SUCCESS as default value

* Update the code to display error status as part of the C++ code execution

* Update rpp_test_suite_common.h

* Update utilities/test_suite/HIP/Tensor_audio_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_image_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_misc_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HIP/Tensor_voxel_hip.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_audio_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_image_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_misc_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update utilities/test_suite/HOST/Tensor_voxel_host.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes for CI issues

* Restore naming convention in voxel test suite

* Fix compilation issues

* Update the code to use func for funcName

* Modify error message

* Modify the print statements

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* F16 variants - Update loads and stores to AVX2 - Group 5 (#637)

* Updates for crop mirror normalize

* Updated flip F16 rawC and load store modifications

* Updated blend with AVX support for F16 bitdepth

* Updated color cast with AVX support for F16 bitdepth

* Remove empty lines

* Update comments

* Fix comment in common function

---------

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.29.0 to 1.30.0 in /docs/sphinx (#640)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.29.0...v1.30.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.30.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* HOST and HIP - pinned buffers for respective API (#628)

* Removed memcpy and used hipHostMalloc for allocation : blend

* Removed memcpy and used hipHostMalloc for allocation : brightness

* Removed memcpy and used hipHostMalloc for allocation : color cast

* Removed memcpy and used hipHostMalloc for allocation : color twist

* Removed memcpy and used hipHostMalloc for allocation : contrast

* Removed memcpy and used hipHostMalloc for allocation : crop mirror normalize

* Removed memcpy and used hipHostMalloc for allocation : Exposure

* Removed memcpy and used hipHostMalloc for allocation : Gamma correction

* Removed memcpy and used hipHostMalloc for allocation : gaussian filter

* Removed memcpy and used hipHostMalloc for allocation : Noise

* Removed memcpy and used hipHostMalloc for allocation : Non linear blend

* Removed memcpy and used hipHostMalloc for allocation : Resize mirror normalize

* Removed memcpy and used hipHostMalloc for allocation : Water

* Added hipHostFree for all kernels in test suite

* Added hipHostFree for all kernels in test suite

* Removed memcpy and used hipHostMalloc for allocation : Flip, spatter, rcm, color temperature

* Resolved copilot review comments

* Updated version

* Removed unused parameter

* Updated version in cmakeList

* removed the host to device mem copies for warp affine and rotate

* Updated version

* Removed comment

* Updated Chnagelog file

* Update patch version from 2.2.0 to 2.2.1

* Update CHANGELOG

* Address copilot comments for HIP HOST consistent allocation

* Documentation changes for updated memcpy changes

* Update ricap outer API to use pinned memory and remove mem copy

* Fix memory allocation and deallocation for permutationTensor

* Update api/rppt_tensor_effects_augmentations.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix spelling of noiseProbability and saltProbability

* Fix deallocation

---------

Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Docs - Bump rocm-docs-core[api_reference] from 1.30.0 to 1.30.1 in /docs/sphinx (#643)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.30.0 to 1.30.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.30.0...v1.30.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.30.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* CMakelists - Add optional GPU targets (#641)

* add optional gpu targets

* add addiitonal gpu targets

* Rename function - hip_exec_roi_converison_ltrb_to_xywh to hip_exec_roi_conversion_ltrb_to_xywh (#645)

Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>

* Docs - Update CHANGELOG.md (#646)

Updates

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Abishek <52214183+r-abishek@users.noreply.github.com>
Co-authored-by: Srihari-mcw <srihari@multicorewareinc.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
Co-authored-by: jonatluu <jonatluu@amd.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sampath117 <snehaa@multicorewareinc.com>
Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: hmaddise <HazarathKumar.Maddisetty@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:precheckin enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants