Summary
awBufferVectorLoad\ and
awBufferVectorStore\ (SM 6.9 vector buffer ops) emit 16-bit element types for min precision types (\min16int, \min16uint, \min16float). This causes drivers to load/store 2 bytes per element instead of 4, mismatching the expected buffer layout.
Repro
\\hlsl
RWByteAddressBuffer g : register(u0);
[numthreads(1,1,1)]
void main() {
min16int3 v = g.Load(0);
g.Store(12, v);
}
\\
Compile with \dxc -T cs_6_9:
Actual (buggy):
awBufferVectorLoad.v3i16\ — loads 3 x 2 bytes = 6 bytes
awBufferVectorStore.v3i16\ — stores 3 x 2 bytes = 6 bytes
Expected:
awBufferVectorLoad.v3i32\ + \ runc <3 x i32> to <3 x i16>\ — loads 3 x 4 bytes = 12 bytes
- \sext <3 x i16> to <3 x i32>\ +
awBufferVectorStore.v3i32\ — stores 3 x 4 bytes = 12 bytes
Root Cause
\TranslateBufLoad\ in \HLOperationLower.cpp\ (line ~4353) creates the vector type directly from the min precision element type without widening to 32-bit first. Pre-SM6.9 \RawBufferLoad\ correctly handles this by loading as i32 and truncating — the SM6.9 vector variant should do the same.
Analysis
WARP treats i16
awBufferVectorLoad\ as 2-byte-per-element loads (confirmed in source). Pre-SM6.9, DXC emits
awBufferLoad.i32\ + \ runc i32 to i16\ for min precision, which correctly loads 4 bytes. The SM6.9 vector path skips this widening, producing a buffer layout mismatch when the CPU writes 32-bit values.
Same issue affects \min16uint\ and \min16float\ (half).
Fix
Widen min precision types to i32/f32 in both \TranslateBufLoad\ and \TranslateBufStore\ for \RawBufferVectorLoad/Store, matching the existing bool widening pattern. Truncate/extend back after load / before store.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com
Summary
awBufferVectorLoad\ and
awBufferVectorStore\ (SM 6.9 vector buffer ops) emit 16-bit element types for min precision types (\min16int, \min16uint, \min16float). This causes drivers to load/store 2 bytes per element instead of 4, mismatching the expected buffer layout.
Repro
\\hlsl
RWByteAddressBuffer g : register(u0);
[numthreads(1,1,1)]
void main() {
min16int3 v = g.Load(0);
g.Store(12, v);
}
\\
Compile with \dxc -T cs_6_9:
Actual (buggy):
awBufferVectorLoad.v3i16\ — loads 3 x 2 bytes = 6 bytes
awBufferVectorStore.v3i16\ — stores 3 x 2 bytes = 6 bytes
Expected:
awBufferVectorLoad.v3i32\ + \ runc <3 x i32> to <3 x i16>\ — loads 3 x 4 bytes = 12 bytes
awBufferVectorStore.v3i32\ — stores 3 x 4 bytes = 12 bytes
Root Cause
\TranslateBufLoad\ in \HLOperationLower.cpp\ (line ~4353) creates the vector type directly from the min precision element type without widening to 32-bit first. Pre-SM6.9 \RawBufferLoad\ correctly handles this by loading as i32 and truncating — the SM6.9 vector variant should do the same.
Analysis
WARP treats i16
awBufferVectorLoad\ as 2-byte-per-element loads (confirmed in source). Pre-SM6.9, DXC emits
awBufferLoad.i32\ + \ runc i32 to i16\ for min precision, which correctly loads 4 bytes. The SM6.9 vector path skips this widening, producing a buffer layout mismatch when the CPU writes 32-bit values.
Same issue affects \min16uint\ and \min16float\ (half).
Fix
Widen min precision types to i32/f32 in both \TranslateBufLoad\ and \TranslateBufStore\ for \RawBufferVectorLoad/Store, matching the existing bool widening pattern. Truncate/extend back after load / before store.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com