Skip to content

Box Filter Nearest Neighbors Padding#455

Closed
Srihari-mcw wants to merge 70 commits intor-abishek:developfrom
Srihari-mcw:box_filter_padding_updates
Closed

Box Filter Nearest Neighbors Padding#455
Srihari-mcw wants to merge 70 commits intor-abishek:developfrom
Srihari-mcw:box_filter_padding_updates

Conversation

@Srihari-mcw
Copy link
Copy Markdown
Collaborator

@Srihari-mcw Srihari-mcw commented Jul 1, 2025

The PR contains changes to add nearest neighbors padding for box filter on both HOST and HIP Backends

@Srihari-mcw
Copy link
Copy Markdown
Collaborator Author

@r-abishek, the PR is ready for your review

@Srihari-mcw Srihari-mcw force-pushed the box_filter_padding_updates branch from e01e45d to f407f34 Compare August 5, 2025 03:57
@HazarathKumarM
Copy link
Copy Markdown
Collaborator

@r-abishek
This PR is ready for review. Major function restructuring has been done in the HIP code

Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Srihari-mcw @HazarathKumarM Added some key comments. Please discuss out the feasibility and let me know

// load function for 3x3 kernel size
inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8u **srcPtrTemp, Rpp32s rowKernelLoopLimit)
template<typename T>
inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit, Rpp32s padIndex)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these 4 functions are templated, then ideally separator comment on L1017 above should indicate that its for both U8/I8?

// unpack and sign extend higher half of 9 256 bit registers and add (used for 9x9 kernel size I8 variants)
inline void unpackhi_signext_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst)
{
pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have 2 templated functions, just one for unpackhi and one for unpacklo?
Try looping for different kernel sizes like below. Check if we can get away on performance.
Try a # pragma unroll if compiler doesn't automatically loop unroll

template <typename K>
inline void unpackhi_signext_and_add_host(__m256i *pxRow, __m256i *pxDst)
{
    pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8);
    for (int i = 1; i <= K; i++)
        pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[i], avx_px0), 8), 8));
}

template <typename K>
inline void unpacklo_signext_and_add_host(__m256i *pxRow, __m256i *pxDst)
{
    pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[0], avx_px0), 8), 8);
    for (int i = 1; i <= K; i++)
        pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[i], avx_px0), 8), 8));
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if constexpr (std::is_same<T, Rpp8s>::value)
{
unpacklo_signext_and_add_9x9_host(pxRow, &pxRowHalf[0]);
unpackhi_signext_and_add_9x9_host(pxRow, &pxRowHalf[1]);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you always have unpacklo immediately followed by unpackhi.

Why not change to:

if constexpr (std::is_same<T, Rpp8s>::value)
    unpack_signext_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated)
else
    unpack_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand unpack_signext_and_add_host is new and should work. But the older unpack_and_add_9x9_host() might have deps. But we can check on ability to shrink code further here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int clampedX = max(roiTensorPtrSrc[id_z].xywhROI.xy.x,
min(id_x_i + i, roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1));
int clampedY = max(roiTensorPtrSrc[id_z].xywhROI.xy.y,
min(id_y_i, roiTensorPtrSrc[id_z].xywhROI.xy.y + roiTensorPtrSrc[id_z].xywhROI.roiHeight - 1));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clampedY is completely loop independent.
For clampedX, the whole section "roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1" is loop independent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


tempBuffer[rgbOffset] = srcPtr[clampedIdx]; // R
tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G
tempBuffer[rgbOffset + 2] = srcPtr[clampedIdx + 2]; // B
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srcPtr is global memory.
We should probably not be accessing it 3 * 8 in loop times.

Can't we read a whole row of n elements in one shot, then assign into tempBuffer?
Basically:

  • there will be a minimum possible value of clampedIdx in all runs of that loop.
  • there will be a maximum possible value of clampedIdx in all runs of that loop.
  • Find those and read the whole row.

For the last point, we may need a way to map our scalar and vector types at compile time. (Basically float should mean d_float24, uchar should mean d_uchar24 and so on)
If there is a way to define a type vec24 - which would always be the 24-element version of whatever template type T comes in, then this would perhaps work?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying all the elements at once won't be possible as it will lead out of bounds memory access in some test cases . However we eliminated the usage of temp buffer and directly copied to shared mem. It gave us some improvements as well

int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3);

tempBuffer[rgbOffset] = srcPtr[clampedIdx]; // R
tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Srihari-mcw @HazarathKumarM
For previous comment, something like this possible? Pls discuss

template<typename T> struct vec24_of;

template<> struct vec24_of<uchar> { using type = d_uchar24; }; // 24 uchar will always be d_uchar24
template<> struct vec24_of<schar> { using type = d_schar24; }; // 24 schar will always be d_schar24
template<> struct vec24_of<half>  { using type = d_half24;  }; // 24 half will always be d_half24
template<> struct vec24_of<float> { using type = d_float24; }; // 24 float will always be d_float24

template<typename T>
using vec24_t = typename vec24_of<T>::type; // Just to say that vec24_t is the 24 element type of T and is dependent on T

// Then while calling or using it inside a kernel
template<typename T>
__global__ void box_filter_5x5_pkd_hip_tensor(T *srcPtr,
                                              uint2 srcStridesNH,
                                              T *dstPtr,
                                              uint2 dstStridesNH,
                                              uint padLength,
                                              uint2 tileSize,
                                              RpptROIPtr roiTensorPtrSrc)
{
    vec24_t<T> tempBuffer_24 = *(vec24_t<T>*)&srcPtr[clampedIndex]; // use minimum value of clampedIndex for the whole loop and take all elements needed after that into tempBuffer_24 - assuming a max of 24 elements needed?
}

@r-abishek r-abishek requested a review from Copilot August 13, 2025 01:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Srihari-mcw
Copy link
Copy Markdown
Collaborator Author

External PR issued and merged

ManasaDattaT pushed a commit to ManasaDattaT/rpp that referenced this pull request Dec 19, 2025
…s/sphinx (r-abishek#455)

Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.8.1 to 1.8.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.8.2/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.8.1...v1.8.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants