Skip to content

Histogram Equalize kernel Implementation - HOST / HIP #533

Merged
r-abishek merged 23 commits intor-abishek:ar/histogram_equalizefrom
RooseweltMcW:apr/histEqualize
Feb 25, 2026
Merged

Histogram Equalize kernel Implementation - HOST / HIP #533
r-abishek merged 23 commits intor-abishek:ar/histogram_equalizefrom
RooseweltMcW:apr/histEqualize

Conversation

@RooseweltMcW
Copy link
Copy Markdown

Histogram Equalize kernel implementation for both HOST and HIP backend

Srihari-mcw pushed a commit to Srihari-mcw/rpp that referenced this pull request Dec 8, 2025
ManasaDattaT pushed a commit to ManasaDattaT/rpp that referenced this pull request Dec 19, 2025
Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HazarathKumarM @RooseweltMcW

Please address minor comment.
This funciton only has U8.
Please confirm U8 QA passing for this?


if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8))
{
histogram_equalize_u8_u8_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this function is only for U8?

Pls add reejction condition for an F32/I8/F16 input then.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for src and dst data types.
Perhaps add an appropriate RPP_ERROR for invalid source data type and invalid destination data type

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, Done.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements histogram equalization functionality for both HOST (CPU) and HIP (GPU) backends in the RPP library. Histogram equalization is a color augmentation technique that enhances image contrast by redistributing pixel intensity values.

Changes:

  • Adds histogram_equalize augmentation with ID 97 for both HOST and HIP backends
  • Implements color space conversion (RGB ↔ YCbCr) and histogram equalization on the Y channel for RGB images
  • Adds test suite integration for the new augmentation

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
utilities/test_suite/rpp_test_suite_image.h Adds HISTOGRAM_EQUALIZE enum and augmentation map entry
utilities/test_suite/common.py Adds histogram_equalize to Python test configuration and color_augmentations group
utilities/test_suite/HOST/Tensor_image_host.cpp Integrates histogram_equalize test case for HOST backend
utilities/test_suite/HIP/Tensor_image_hip.cpp Integrates histogram_equalize test case for HIP backend
src/modules/tensor/rppt_tensor_color_augmentations.cpp Implements rppt_histogram_equalize API entry point with validation
src/modules/tensor/hip/kernel/histogram_equalize.cpp Implements HIP kernels for color conversion, histogram collection, LUT building, and application
src/modules/tensor/cpu/kernel/histogram_equalize.cpp Implements HOST CPU functions with AVX2 SIMD optimizations for histogram equalization
src/include/tensor/host_tensor_executors.hpp Declares HOST executor function
src/include/tensor/hip_tensor_executors.hpp Declares HIP executor function
api/rppt_tensor_color_augmentations.h Adds public API documentation for rppt_histogram_equalize
docs/data/doxygenOutputs/color_augmentations_histogram_equalize_img150x150.png Adds sample output image for documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +43
__device__ constexpr float coeffGCb = -0.344136f;
__device__ constexpr float coeffGCr = -0.714136f;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YCbCr to RGB conversion coefficients have a sign mismatch between HOST and HIP implementations. The HIP version uses negative values (coeffGCb = -0.344136f, coeffGCr = -0.714136f) on lines 42-43, while the HOST version uses positive values (coeffGCb = 0.344136f, coeffGCr = 0.714136f) in cpu/kernel/histogram_equalize.cpp lines 43-44. The standard YCbCr to RGB conversion formula is: G = Y - 0.344136 * Cb - 0.714136 * Cr. Since the HIP implementation already negates coeffGCb and coeffGCr when defining them, but then uses them in addition operations (lines 100-101), this results in incorrect calculations. Either the sign should be removed from the coefficient definitions, or the operations should use subtraction instead of addition.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, Done.

Comment on lines +298 to +302
Rpp8u *scratchBase = reinterpret_cast<Rpp8u *>(handle.GetInitHandle()->mem.mcpu.scratchBufferHost);
Rpp8u *yBuf = scratchBase + batchCount * (pixels * 3);
Rpp8u *cbBuf = yBuf + pixels;
Rpp8u *crBuf = cbBuf + pixels;
Rpp8u *dstYBuf = yBuf;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overflow in HOST implementation. The code allocates scratch buffers for each batch element without validating total size against available scratch buffer. Line 299 computes: yBuf = scratchBase + batchCount * (pixels * 3), where pixels = roiWidth * roiHeight. For parallel execution with OpenMP (line 278), multiple threads may simultaneously access the scratch buffer with different batchCount values. If the total requirement (batchSize * max_pixels * 3) exceeds the available scratchBufferHost size, this will cause memory corruption. There's no bounds checking to ensure the scratch buffer is large enough.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified


// -------------------- histogram_equalize --------------------

RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr1,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent parameter naming. The first parameter is named srcPtr1 in the declaration, but it should be srcPtr for consistency with all other function declarations in this file and with the actual implementation.

Suggested change
RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr1,
RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, Done.

cbPtr = cbBuf;
crPtr = crBuf;
#if __AVX2__
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case issue with alignedLength calculation. The formula ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement will produce negative or zero values when roiWidth is less than vectorIncrement (48). For small ROIs (e.g., roiWidth = 16), this results in alignedLength = ((16 / 48) - 1) * 48 = (0 - 1) * 48 = -48. The loop condition vectorLoopCount < alignedLength would then be false initially, which is correct behavior, but using a potentially negative value is poor practice. Consider using: alignedLength = (roi.xywhROI.roiWidth / vectorIncrement) * vectorIncrement to avoid negative values, or add a check to ensure alignedLength is non-negative.

Suggested change
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
if (roi.xywhROI.roiWidth >= vectorIncrement)
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
else
alignedLength = 0;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cbPtr = cbBuf;
crPtr = crBuf;
#if __AVX2__
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case issue with alignedLength calculation. The formula ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement will produce negative or zero values when roiWidth is less than vectorIncrement (48). For small ROIs (e.g., roiWidth = 16), this results in alignedLength = ((16 / 48) - 1) * 48 = (0 - 1) * 48 = -48. The loop condition vectorLoopCount < alignedLength would then be false initially, which is correct behavior, but using a potentially negative value is poor practice. Consider using: alignedLength = (roi.xywhROI.roiWidth / vectorIncrement) * vectorIncrement to avoid negative values, or add a check to ensure alignedLength is non-negative.

Suggested change
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement;
if (alignedLength < 0)
alignedLength = 0;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +356 to +366
unsigned int *d_hist = reinterpret_cast<unsigned int*>(handle.GetInitHandle()->mem.mgpu.scratchBufferHip.floatmem);
unsigned char *d_lut = reinterpret_cast<unsigned char*>(d_hist + batchSize * HISTOGRAM_BINS);

if(srcDescPtr->c == 3)
{
const size_t planeSize = static_cast<size_t>(srcDescPtr->w) * srcDescPtr->h * srcDescPtr->n;
Rpp8u *yuvBuf = reinterpret_cast<Rpp8u *>(d_lut + batchSize * HISTOGRAM_BINS);

Rpp8u *yBuf = yuvBuf;
Rpp8u *cbBuf = yuvBuf + planeSize;
Rpp8u *crBuf = yuvBuf + (planeSize * 2);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential buffer overflow issue. The scratch buffer allocation assumes sequential memory layout without validation. The code allocates:

  1. d_hist: batchSize * 256 * sizeof(unsigned int) = batchSize * 1024 bytes
  2. d_lut: batchSize * 256 * sizeof(unsigned char) = batchSize * 256 bytes
  3. yuvBuf: 3 * planeSize = 3 * w * h * n bytes (for 3-channel images)

The total requirement is batchSize * 1280 + 3 * w * h * n bytes. For large images (e.g., 3840x2160), this equals batchSize * 1280 + 3 * 3840 * 2160 * n = 24,883,200 * n + 1280 * n bytes per batch element. The scratch buffer is 8,294,400 * sizeof(float) = 33,177,600 bytes total. This may be insufficient for large batches or images, yet there's no validation to ensure the required memory fits within the available scratch buffer, risking memory corruption.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified


// -------------------- histogram_equalize --------------------

RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr1,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent parameter naming. The first parameter is named srcPtr1 in the declaration, but it should be srcPtr for consistency with all other function declarations in this file and with the actual implementation in histogram_equalize.cpp line 265 which uses srcPtr.

Suggested change
RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr1,
RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, Done.

collect_hist_pln_tensor_host(srcPtr, hist, roiWidth, roiHeight, srcDescPtr->strides.hStride);
build_lut_from_hist_host(hist, lutBatch, pixels);
apply_lut_tensor(srcPtr, dstPtr, roiWidth, roiHeight, lutBatch, srcDescPtr->strides.hStride, dstDescPtr->strides.hStride);
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for unsupported layout combinations. The implementation only handles specific layout combinations (NCHW->NCHW, NCHW->NHWC, NHWC->NCHW, NHWC->NHWC for 3-channel, and NCHW->NCHW for 1-channel). If the code reaches a combination not explicitly handled (though unlikely given earlier validation), it will silently skip processing and return RPP_SUCCESS without producing any output or error. Consider adding an else clause to return an appropriate error status for consistency with error handling in other parts of the codebase.

Suggested change
}
}
else
{
// Unsupported channel/layout combination
return RPP_ERROR;
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@HazarathKumarM HazarathKumarM Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant , we already checking the condition and also it is not possible to return inside openMP loop

testCaseName = "histogram_equalize";

startWallTime = omp_get_wtime();
startCpuTime = clock();
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startCpuTime = clock() statement is missing in this case block. All other test cases in this file follow the pattern of setting both startWallTime and startCpuTime before calling the function. This line should be added after line 1855 (startWallTime = omp_get_wtime()) for consistency with the rest of the codebase.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already present, can be ignored.

@HazarathKumarM
Copy link
Copy Markdown
Collaborator

@r-abishek all comments are reoslved now

@r-abishek r-abishek changed the base branch from develop to ar/histogram_equalize February 25, 2026 00:24
@r-abishek r-abishek merged commit 0a6a42f into r-abishek:ar/histogram_equalize Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants