Eliminating bounds checks in the YUV->RGBA conversion#6
Eliminating bounds checks in the YUV->RGBA conversion#6torokati44 wants to merge 3 commits intoruffle-rs:masterfrom
Conversation
kmeisthax
left a comment
There was a problem hiding this comment.
I found a number of pathological cases that could trigger unsafety - they all have to do with mismatches between the chroma and luma input sizes. We don't have type-level assurances that the three arrays are compatibly sized, so we'll need some additional bounds checks.
I also would like to see some of the internal functions documented for their safety requirements, I've flagged those as well.
| @@ -131,62 +180,60 @@ pub fn yuv420_to_rgba( | |||
| let y_height = y.len() / y_width; | |||
| let br_height = chroma_b.len() / br_width; | |||
There was a problem hiding this comment.
It is possible to hand this function a chroma_r that is too short for it's given geometry. Since this is the entry point for the whole crate, it's unsound as-is. (For the record, this function's use of y and chroma_b appear to be sound; since we derive different bounds for it based on it's length.)
We should either have separate b_height and r_height calculations and variables, or return an error if chroma_r is too short. I'd prefer the former over the latter but I'm not sure what the performance impact is in this case.
| @@ -40,31 +38,31 @@ fn sample_chroma_for_luma( | |||
| (luma_y as i32 - 1) / 2 | |||
There was a problem hiding this comment.
sample_chroma_for_luma needs to be marked as unsafe, as it has several invariants that must be fulfilled in order for code that uses it to remain sound:
- The
chroma_widthandchroma_heightparameters must be derived from the length ofchromain such a way that their product cannot exceed that length, otherwise clamping will fail to prohibit out-of-bounds reads - If
clampis off, thenluma_xandluma_ymust already be bounds-checked to twice thechroma_widthandchroma_height(respectively) in order to prohibit out-of-bounds reads
There was a problem hiding this comment.
These invariants should also be documented in the function's doccomment
| rgba[base + 2] = clamp(b); | ||
| rgba[base + 3] = 255; | ||
| #[inline] | ||
| fn convert_and_write_pixel(yuv: (u8, u8, u8), rgba: &mut Vec<u8>, base: usize, luts: &LUTs) { |
There was a problem hiding this comment.
convert_and_write_pixel should also be marked as unsafe, as it has an invariant must be fulfilled for callers to remain sound:
basemust be less than the length ofrgbaminus four.
This invariant should also be doccommented.
Furthermore, rgba itself is not being reallocated, so it should be passed as &mut [u8] instead of &mut Vec<u8>.
| let y_sample = y.get(x_pos + y_pos * y_width).copied().unwrap_or(0); | ||
| let b_sample = | ||
| sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false) as f32; | ||
| sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false); |
There was a problem hiding this comment.
Pathologically small chroma_b or chroma_r inputs to this function are unsound, as the unclamped sampling will read outside their bounds. x_pos and y_pos are derived from the width and height of luma, but there is nothing to ensure that chroma_b or chroma_r's geometry is compatible with luma's. As a result, this code can read out of bounds.
|
|
||
| // doing the sides with clamping | ||
| for y_pos in 0..y_height { | ||
| for x_pos in [0, y_width - 1].iter() { |
There was a problem hiding this comment.
Pathologically small inputs (specifically, 1px wide videos) will cause a panic here. This should instead be a saturating_sub, so that in this case, the conversion will merely harmlessly run twice.
As far as I can tell, the use of the unsafe functions below is still sound, since clamping is on. We may still want to test this somehow.
This may impact performance; depending on how LLVM's optimizer is feeling about loop-invariant code motion today.
| for x_pos in 0..y_width { | ||
| for y_pos in [0, y_height - 1].iter() { | ||
| let y_sample = y.get(x_pos + y_pos * y_width).copied().unwrap_or(0) as f32; | ||
| for y_pos in [0, y_height - 1].iter() { |
There was a problem hiding this comment.
This can also panic on pathologically small video heights.
|
Before addressing the comments, let me just add a note that the H.263 ITU Recommendation has this paragraph in it:
Adhering to the "decoded in the same manner as if the width or height had the next larger size that would be divisible by 16" part could simplify some loops in the IDCT and gather phases - but is not relevant for this PR. The last part of that sentence is relevant however: "and then the picture is cropped at the right and the |
|
My guess would be that the cropping is supposed to happen after the YUV 420 conversion, because that removes ambiguity about how to handle odd-sized luma pictures. That being said, while the H.263 specification prohibits the pathological inputs I mentioned, the PR doesn't. It's entirely possible this code winds up getting reused for other video formats that are less restrictive with custom picture sizes. We don't need to correctly decode all invalid picture sizes, but we do need to reject them before any |
|
Superseded by #9. |
These changes together yield an overall ∼21% reduction of the time(outdated)run_frame()takes, on average.Measured on the self-hosted web build, running the first 100 frames of z0r loops 3664, 4449, and 7081; but discarding the first 20 frame times of each loop before averaging - to let the WASM JIT warm up, and the numbers stabilize a little.
The results:

The first and last rows are about the current state, the middle rows are some "milestone" commits of this PR. The X axis is milliseconds. Every measurement was done four times.
The benchmarks were done in Chromium 94, on a Ryzen 2700X (using
playwrightand thetemcitool).The first two commits I'm fairly confident in being "good", the middle one is just "okay", and the last two "should be fine, I think".(outdated)More details about the changes in the added comments.