fix: stack overflow in turbo_init_rotation + TURBO_D static_assert#23
Conversation
…m#23 Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m#23 Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in) TheTom#23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m#23 Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ed76b24 to
f59394d
Compare
…m#23 Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m#23 Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in) TheTom#23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m#23 Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heTom#23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey @TheTom — gentle ping on this one. Just the stack overflow fix (64KB on worker thread stack) + static_assert guard. Rebased clean on current HEAD (post block-128, post signalnine merge). Happy to adjust anything if needed. |
|
Thank you for the ping. Will test this morning (cst) before noon. |
|
Metal sanity check passed. Built and tested on M2 Pro (macOS, Metal backend). Build: clean, no warnings. PPL test (Qwen2.5-1.5B Q4_K_M, q8_0-K/turbo4-V, 512 context, 4 chunks):
Identical. Both changes are CPU-only (ggml-turbo-quant.c), no CUDA/Metal kernel impact. Also just pushed a fix for the CI failure you were seeing (unused CENTROIDS_1BIT constant from a previous experiment, unrelated to your PR). Thanks for the stack overflow fix, that's a real bug that would bite on threads with small stacks. Merging. |
8ad0f00
into
TheTom:feature/turboquant-kv-cache
Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full investigation log with all tests, results, and the root cause. Upstream TurboQuant activity tracked in #27. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in) #23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full investigation log with all tests, results, and the root cause. Upstream TurboQuant activity tracked in #27. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in) #23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Minimal resubmission per PR #18 review — only the two fixes still needed against current HEAD (
172fc85):Stack overflow fix:
turbo_init_rotation()allocatesfloat G[128*128](64KB) on the stack atggml-turbo-quant.c:71, thenmemcpys into the staticturbo_rotation. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix generates the Gaussian matrix directly intoturbo_rotation, eliminating both the stack allocation and thememcpy.static_assertguard:TURBO_DandQK_TURBO3_GROUPare defined separately but must always match (both = rotation group size 128). Adds compile-time check to catch silent divergence between CPU reference path and GPU kernels.What changed from PR #18
Dropped fixes #2 (non-128-aligned head dims) and #4 (WHT dispatch assert) — both resolved by signalnine's PR #3 merge. This is the minimal diff TheTom requested.
Test plan
gcc -std=c11 -D_GNU_SOURCE