Box Filter Nearest Neighbors Padding#455
Box Filter Nearest Neighbors Padding#455Srihari-mcw wants to merge 70 commits intor-abishek:developfrom
Conversation
|
@r-abishek, the PR is ready for your review |
e01e45d to
f407f34
Compare
…ptimized the I8 variants
|
@r-abishek |
r-abishek
left a comment
There was a problem hiding this comment.
@Srihari-mcw @HazarathKumarM Added some key comments. Please discuss out the feasibility and let me know
| // load function for 3x3 kernel size | ||
| inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8u **srcPtrTemp, Rpp32s rowKernelLoopLimit) | ||
| template<typename T> | ||
| inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit, Rpp32s padIndex) |
There was a problem hiding this comment.
If these 4 functions are templated, then ideally separator comment on L1017 above should indicate that its for both U8/I8?
| // unpack and sign extend higher half of 9 256 bit registers and add (used for 9x9 kernel size I8 variants) | ||
| inline void unpackhi_signext_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst) | ||
| { | ||
| pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8); |
There was a problem hiding this comment.
Can we have 2 templated functions, just one for unpackhi and one for unpacklo?
Try looping for different kernel sizes like below. Check if we can get away on performance.
Try a # pragma unroll if compiler doesn't automatically loop unroll
template <typename K>
inline void unpackhi_signext_and_add_host(__m256i *pxRow, __m256i *pxDst)
{
pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8);
for (int i = 1; i <= K; i++)
pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[i], avx_px0), 8), 8));
}
template <typename K>
inline void unpacklo_signext_and_add_host(__m256i *pxRow, __m256i *pxDst)
{
pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[0], avx_px0), 8), 8);
for (int i = 1; i <= K; i++)
pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[i], avx_px0), 8), 8));
}
| if constexpr (std::is_same<T, Rpp8s>::value) | ||
| { | ||
| unpacklo_signext_and_add_9x9_host(pxRow, &pxRowHalf[0]); | ||
| unpackhi_signext_and_add_9x9_host(pxRow, &pxRowHalf[1]); |
There was a problem hiding this comment.
Seems like you always have unpacklo immediately followed by unpackhi.
Why not change to:
if constexpr (std::is_same<T, Rpp8s>::value)
unpack_signext_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated)
else
unpack_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated)
There was a problem hiding this comment.
I understand unpack_signext_and_add_host is new and should work. But the older unpack_and_add_9x9_host() might have deps. But we can check on ability to shrink code further here
| int clampedX = max(roiTensorPtrSrc[id_z].xywhROI.xy.x, | ||
| min(id_x_i + i, roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1)); | ||
| int clampedY = max(roiTensorPtrSrc[id_z].xywhROI.xy.y, | ||
| min(id_y_i, roiTensorPtrSrc[id_z].xywhROI.xy.y + roiTensorPtrSrc[id_z].xywhROI.roiHeight - 1)); |
There was a problem hiding this comment.
clampedY is completely loop independent.
For clampedX, the whole section "roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1" is loop independent.
|
|
||
| tempBuffer[rgbOffset] = srcPtr[clampedIdx]; // R | ||
| tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G | ||
| tempBuffer[rgbOffset + 2] = srcPtr[clampedIdx + 2]; // B |
There was a problem hiding this comment.
srcPtr is global memory.
We should probably not be accessing it 3 * 8 in loop times.
Can't we read a whole row of n elements in one shot, then assign into tempBuffer?
Basically:
- there will be a minimum possible value of clampedIdx in all runs of that loop.
- there will be a maximum possible value of clampedIdx in all runs of that loop.
- Find those and read the whole row.
For the last point, we may need a way to map our scalar and vector types at compile time. (Basically float should mean d_float24, uchar should mean d_uchar24 and so on)
If there is a way to define a type vec24 - which would always be the 24-element version of whatever template type T comes in, then this would perhaps work?
There was a problem hiding this comment.
Copying all the elements at once won't be possible as it will lead out of bounds memory access in some test cases . However we eliminated the usage of temp buffer and directly copied to shared mem. It gave us some improvements as well
| int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3); | ||
|
|
||
| tempBuffer[rgbOffset] = srcPtr[clampedIdx]; // R | ||
| tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G |
There was a problem hiding this comment.
@Srihari-mcw @HazarathKumarM
For previous comment, something like this possible? Pls discuss
template<typename T> struct vec24_of;
template<> struct vec24_of<uchar> { using type = d_uchar24; }; // 24 uchar will always be d_uchar24
template<> struct vec24_of<schar> { using type = d_schar24; }; // 24 schar will always be d_schar24
template<> struct vec24_of<half> { using type = d_half24; }; // 24 half will always be d_half24
template<> struct vec24_of<float> { using type = d_float24; }; // 24 float will always be d_float24
template<typename T>
using vec24_t = typename vec24_of<T>::type; // Just to say that vec24_t is the 24 element type of T and is dependent on T
// Then while calling or using it inside a kernel
template<typename T>
__global__ void box_filter_5x5_pkd_hip_tensor(T *srcPtr,
uint2 srcStridesNH,
T *dstPtr,
uint2 dstStridesNH,
uint padLength,
uint2 tileSize,
RpptROIPtr roiTensorPtrSrc)
{
vec24_t<T> tempBuffer_24 = *(vec24_t<T>*)&srcPtr[clampedIndex]; // use minimum value of clampedIndex for the whole loop and take all elements needed after that into tempBuffer_24 - assuming a max of 24 elements needed?
}
|
External PR issued and merged |
…s/sphinx (r-abishek#455) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.8.1 to 1.8.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.8.2/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.8.1...v1.8.2) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The PR contains changes to add nearest neighbors padding for box filter on both HOST and HIP Backends