Skip to content

Update wrapper to remove device arg#9

Merged
MasterJH5574 merged 1 commit intomainfrom
wrapper-device
Oct 30, 2023
Merged

Update wrapper to remove device arg#9
MasterJH5574 merged 1 commit intomainfrom
wrapper-device

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

No description provided.

@MasterJH5574 MasterJH5574 merged commit 588d4a8 into main Oct 30, 2023
@MasterJH5574 MasterJH5574 deleted the wrapper-device branch October 31, 2023 20:46
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…w follow-up flashinfer-ai#9

Bench now supports --ep N for single-GPU EP simulation (commit
da2f0e3). Full sweep at EP=1 / 8 / 16 across all 15 token counts
produced 45 (size, EP) data points — all pass parity with max_abs
≤ 0.0625.

EP=1 results reproduce the audit's headline post-fix verification
(+4.1% at N=16384, vs +3.5% in the 2026-04-25 table — within run-
to-run noise). EP=8 and EP=16 add new coverage of the deployment-
realistic configs that DeepSeek-V3 actually runs under.

Audit changes:

- New "EP=8 / EP=16 single-GPU sweep (2026-04-28)" section
  immediately after the ground-truth nsys verification. Documents
  the bench plumbing changes (Mapping construction with
  tp_size=ep, moe_tp_size=1, moe_ep_size=ep + the dual-binding
  monkey-patch on can_access_peer needed because plugin.py's
  imported binding is unaffected by patching _ipc_utils alone).
  Includes the full 45-cell (size, EP) Δ% table.

- Three observations from the data:
  (1) Small-batch Δ% explodes at EP>1 due to fixed-overhead-
      fraction effect — not actionable.
  (2) Large-batch (N=16384) Δ% stays modest across EP values
      (+4.1, +7.3, +7.9% at EP=1, 8, 16).
  (3) Per-kernel gap widens in flashinfer's disfavor at smaller
      per-rank expert count: gemm1+gemm2 sum at N=16384 goes
      -2.4% (fi faster, EP=1) → +8.6% (EP=8) → +13.7% (EP=16).
      Both kernels' compiled PTX is byte-identical between sides
      (proven 2026-04-24), so this is tactic-selection or
      wrapper-overhead, not kernel-binary divergence.

- Follow-up flashinfer-ai#7 marked RESOLVED.

- New follow-up flashinfer-ai#9 added: "Investigate why flashinfer's per-kernel
  times grow faster than TRT-LLM's at smaller per-rank expert
  count (EP>1)." Three plausible mechanisms with concrete probe
  steps. Not blocking — EP=1 port-parity remains the load-bearing
  finding.

Top-of-file correction title and open-follow-ups summary updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…indings

Captured both autotuners' selected tactics at N=16384 across EP=1,
8, 16 via TLLM_LOG_LEVEL=DEBUG (TRT-LLM side) plus the bench's own
_audit_selected_tactics helper (flashinfer side). Confirmed the
9a-class divergence at EP=16, falsified the simple-fix version.

Key findings:

- EP=1: fi and trt both pick tile_size=256 / 2-CTA at the largest
  shape bucket. gemm1 tactics agree exactly; gemm2 tactics differ
  slightly but both are 2-CTA. Consistent with the observed
  -2.4% (fi faster) kernel-level Δ%.

- EP=16: clear tactic divergence. Flashinfer picks tile_size=128
  for ALL 14 cache entries (no tile_size=256 at any shape).
  TRT-LLM picks tile_size=256 ((256, 256), (2, 1), False) at the
  largest shape bucket. Different MMA tile / cluster / throughput
  — exactly explains the +13.7% kernel-level gap.

- EP=8: not pure tactic divergence. fi and trt pick essentially
  identical tactics (tile_size=256, gemm1=(256,256) 2-CTA,
  gemm2=(256,256) 2-CTA) at the largest shape. Yet kernel-level
  Δ% is +8.6% — must be from another source (shape-bucket
  rounding, per-call overhead, or measurement noise).

The EP=16 finding does NOT have a one-line fix like flashinfer-ai#3067 did.
Both sides have equivalent tactic enumerations (post-fix); both
autotuners profile them; they just rank the profile-time outcomes
differently. Closing this needs deeper engineering — verify
profile-methodology equivalence (9a-i), then shape-bucket rounding
diff (9a-ii), then deeper investigation if those hold.

flashinfer-ai#9 entry expanded with three sub-mechanisms (9a-i, 9a-ii, 9a-iii)
that decompose the original 9a hypothesis, plus 9b (per-call
overhead at smaller per-rank work) and 9c (load-balancing). The
recommendation has been refined: not blocking, real but
production-relevance vs effort makes it a "future work" item.

Closes the EP investigation for this audit cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…red-and-skipped

After 2026-04-24 / 25 / 28 we have five independent fi-vs-trt
agreement proofs:
  1. source byte-identical with rc5.post2 (deep audit)
  2. compiled PTX byte-identical at tile_size=256 (md5 401ebca6...)
  3. per-call timing match within 0.1% at apples-to-apples tactic
  4. 45 (size, EP) parity cells all pass within 0.5 FP4 step
  5. nsys ground-truth verification with kernel mangled-name
     structure agreement

The probability of fi+trt being wrong-but-agreeing across all five
is essentially zero, so a PyTorch FP4 third-reference check no
longer provides meaningful incremental confidence.

Additionally, `compute_reference_moe_fp4` has known limitations:
its PyTorch-eager FP4 simulation is stricter than the kernel's
actual FP4 representation, which made it ambiguous to interpret
during the original flashinfer-ai#3067 framing. Disagreement between bench
output and the reference would not unambiguously indicate a kernel
bug.

Cost is also non-trivial: Python-eager per-token / per-expert
loops would require running at a small problem-size subset to
keep wall-clock bearable.

Cost-to-incremental-confidence ratio is bad enough that this
follow-up is consciously skipped, not deferred. Future evidence
of a fi-vs-trt agreement that's actually wrong would re-elevate
it; otherwise no value.

Effective remaining open follow-ups: flashinfer-ai#3 (production-convention
scaling), flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun), flashinfer-ai#9 (EP=16 tactic-
divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…d-and-skipped

flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths
run with degenerate values, so a divergent FP8_MAX or scale-convention
mismatch between the two sides would silently produce identical
output here and divergent output at non-trivial scales). Skipped
because:

- Both sides have their own internal scale-plumbing tests with
  non-trivial scales. TRT-LLM's
  tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and
  flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise
  non-trivial-gs configurations against PyTorch references. The
  bench wouldn't add coverage their CI doesn't already have.

- A failure here wouldn't tell us what we want to know. The most
  likely cause of fi-vs-trt parity divergence at production scales
  would be a scale-convention disagreement BETWEEN the two
  implementations — not fixable in flashinfer (TRT-LLM is upstream),
  and dramatically out of scope.

- Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing
  has many surfaces (alpha, weight_scale_2, input_scale, is_sf_*
  flags, fp4_quantize signatures); any tiny bench-side mismatch
  produces a parity failure that looks like a port bug. That's the
  same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit.

Scope-difference note: the audit's load-bearing question is "is the
kernel port faithful?" — closed via byte-identical PTX + matching
timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM
agree on NVFP4 scaling conventions?" — a separate question, real
but tangential to port-faithfulness.

One scenario that would re-elevate this: a planned production ship
of CuteDslMoEWrapper to a caller using non-trivial scales, where
the team wants one independent cross-check (against CuteDslFusedMoE
under the same scales) before merging. In that scenario flashinfer-ai#3 is
exactly the right pre-ship sanity. For closing out the investigation
audit, it's tangential.

Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity
rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
flashinfer-ai#9

Earlier drafts of the EP=8/16 sweep section and follow-up flashinfer-ai#9 entry
characterized the "main-pass total minus mapped-kernel sum" residual
as "CPU/orchestration overhead" or "Python wrapper overhead per
gemm call." That framing is structurally wrong under CUDA graphs.

The bench captures the full forward pass into a CUDA graph during
warmup, then replays via cudaGraphLaunch. During replay no Python
runs between kernels — Python overhead is paid once at graph
capture time and amortized over thousands of replay iterations. The
CUDA-event main-pass measurements are GPU stream-time, not CPU time.

The actual residual is GPU-side: small unmapped kernels (e.g.,
moe_sort_py_fills, output_bf16_zero, FillFunctor variants),
inter-kernel GPU-idle gaps, and stream-coordination effects.

Corrections:

- "Three observations" finding 1 (small-batch Δ% explosion):
  reframed from "fixed CPU overhead" to "fixed-cost work in the
  captured graph" with an explicit note that this is GPU stream-
  time under CUDA graphs.

- "Three observations" finding 4 (between-kernel time): retitled
  from "CPU/orchestration overhead" to "GPU-side between-kernel
  time"; explicit note that this comprises small unmapped kernels
  and GPU-side gaps, not CPU/Python overhead.

- Follow-up flashinfer-ai#9 entry: 9b ("Per-call wrapper overhead in flashinfer
  scales with call rate") marked DROPPED 2026-04-28 with reasoning.
  Plausible remaining mechanisms for the EP=8 +8.6% gap at agreed
  tactics narrowed to GPU-side inter-kernel gap differences or
  measurement noise.

Surfaced by user inspection of CUDA-events semantics (CUDA events
measure stream-time, not CPU time, especially under graph replay).
The earlier loose framing was a real-but-non-load-bearing error —
the load-bearing audit findings (port faithfulness, parity, post-
fix verification) are unaffected because they don't depend on
attributing the residual to CPU vs GPU.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…flashinfer-ai#9 remains

Three close-out edits to wrap the CuteDSL MoE FP4 port audit:

- Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped,
  matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons:
  (a) port-parity claims are unaffected by DSL-compiler version since
  both sides use the same in-container compiler, (b) flashinfer and
  TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus
  unsupported-config risk produces the same ambiguous-failure pattern
  that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the
  image. Install recipe preserved for future absolute-latency probes.

- Add a "Version-skew caveat (2026-04-28)" subsection to the
  top-of-file correction. The bench compares flashinfer-with-post-rc5-
  forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2-
  without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may
  partly reflect the version asymmetry. Load-bearing claims (port
  faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067
  fix) are unaffected because they do not depend on absolute deltas.
  Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing
  the post-rc5 commits.

- Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the
  considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16
  tactic-divergence root cause) as the only effective open follow-up.
  Audit declared closed 2026-04-28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…realloc bias

Post-close finding (2026-04-28). The +8.6% kernel-level gap at
EP=8 N=16384 — previously speculated to be "shape-bucket rounding,
per-call overhead, or measurement noise" — is dominantly a real
autotune-correctness bug in CuteDslMoEWrapper.

Root cause: wrapper pre-allocates _gemm1_output, _gemm1_output_scale,
and _moe_sort_buffers["out_permuted_idx_to_expanded_idx"] sized for
self.tile_size at construction. The use_prealloc gate at fused_moe.py:495
honors prealloc only when tactic.tile_size == self.tile_size. During
autotune profiling, the matching tactic gets pre-allocated buffers
"for free" while the other tactic falls through to gemm1_out=None and
pays per-call dynamic allocation. Autotuner sees the latter as
artificially slower → biased tactic selection.

Empirical evidence (N=16384 EP=8, single-iter):
- Pre-fix (default tile_size=128 wrapper): fi all 14 cache entries
  pick tile=128; +15.3% Δ% with bias-flipped wrapper sized for tile=256.
- Post-fix (VALID_TILE_SIZES + buffers sized for max/min of range):
  fi at largest bucket picks tile_size=256 / 2-CTA matching trt;
  Δ% drops to +4.1%.

Architectural deviation: TRT-LLM's Sm100BlockScaledContiguousGather*Runner
allocates output buffers per call inside forward() — no prealloc, no
caching, no autotune bias. flashinfer added the prealloc as a CUDA-graph
optimization but keyed it to a single tile_size. The fix preserves the
optimization while generalizing it across the canonical tile_size
enumeration (VALID_TILE_SIZES = (128, 256), monotonic-safe for any
future additions).

Status: wrapper-side fix prepared but NOT yet upstream — held pending
merge of flashinfer-ai#3067 (PR branch fix-cutedsl-moe-gemm2-tactic-enum-tile256).
Lands as a separate follow-up PR.

Additional findings captured:
- +1 tile bucketing-formula divergence between fi and trt's
  get_max_num_tiles (small ~0.18% effect, separate from the prealloc
  bias, lives in _moe_core_impl, shared by both wrapper and functional
  APIs).
- Bench-side --dump-kernel-shapes diagnostic added for runtime kernel-
  call shape comparison; one subtlety: must NOT call .item() inside
  CUDA-graph capture (cudaErrorStreamCaptureInvalidated).
- Architectural observation: wrapper-vs-functional API. CuteDslMoEWrapper
  has accumulated multiple wrapper-specific bugs across the audit
  (autotune-context, prealloc bias, now-inert tile_size arg).
  cute_dsl_fused_moe_nvfp4 (functional) has not. Recommended medium-term
  cleanup: refactor wrapper as thin stateful adapter delegating to
  functional API. Out of scope for this audit cycle.

EP=16 +13.7% sub-question remains separately open. The prealloc bias
affects EP=16 too but plausibly does not fully account for the gap;
pending validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ape comparison

Adds a `--dump-kernel-shapes` flag to bench_cute_dsl_port_parity.py
that monkey-patches the gemm1 and gemm2 kernel-call sites on both
the flashinfer side (blockscaled_contiguous_*_nvfp4 functions in
flashinfer.fused_moe.cute_dsl.fused_moe) and the TRT-LLM side
(Sm100BlockScaledContiguous*Runner.forward methods). Prints unique
shape signatures with a monotonic call counter so autotune-profiling
shapes AND inference shapes are both captured in a single run.

Used for audit follow-up flashinfer-ai#9 EP=8 investigation (commit dde2723
addendum) — surfaced the CuteDslMoEWrapper prealloc-bias bug by
revealing that fi was systematically picking tile_size=128 even
after the flashinfer-ai#3067 fix. Implementation is _install_kernel_shape_dumpers
gated on the flag.

Subtlety captured in the implementation comment: must NOT call
.item() on tensor arguments inside the dumper. .item() is a CUDA
host sync, illegal inside torch.cuda.graph() capture, and produces
cudaErrorStreamCaptureInvalidated on the next kernel launch.
Earlier draft of the dumper had .item() in _fmt_scalar via
"hasattr(t, 'item') and t.numel() == 1: return int(t.item())";
removed in favor of always emitting tensor shape only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…shinfer-ai#3067 PR)"

This reverts commit a4137e67ca52ed4b6d8b59f2c2d5a3da1b53a497.

Strategic call (2026-04-28): pull the wrapper prealloc-bias fix off
the audit branch. Two reasons:

1. **Don't stack ahead of main.** The audit branch already carries
   the flashinfer-ai#3067 preview at 074c185. Stacking another substantial
   wrapper change (this commit) on top, before flashinfer-ai#3067 has merged,
   creates a real PR-dependency chain. Reviewers of either PR would
   need context on both, and rebase cascades are likely. Keep the
   audit branch lean and the prealloc fix in its own slot until
   flashinfer-ai#3067 actually lands.

2. **The thin-adapter refactor is the better medium-term fix.**
   The audit's architectural recommendation (see addendum section
   *flashinfer-ai#9 EP=8 +8.6% gap — root cause identified 2026-04-28*, subsection
   *Architectural observation: wrapper vs functional API*) is to
   refactor CuteDslMoEWrapper as a thin stateful adapter holding
   only stream/event references and delegating ALL buffer allocation
   to the functional API. That refactor would delete _allocate_buffers
   entirely, along with self.tile_size, self.max_num_tokens, and
   the use_prealloc check — the bug class disappears structurally
   because there is no prealloc. The careful work in this commit
   patches a symptom in code we may delete; if the refactor pans
   out, this fix evaporates.

The fix is preserved on branch `cute-dsl-moe-wrapper-prealloc-bias-fix`
for resurrection if the refactor turns out impractical. Patch file
also at /tmp/claude/cute_dsl_prealloc_fix.patch (Mac).

Audit doc and memories updated to reflect this strategic decision.
The 100-iter validation finding (+8.2% post-fix gap, see audit
addendum) still stands — it confirms the prealloc bias is real but
not sufficient on its own to close flashinfer-ai#9 EP=8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
… bucket-cap fixes

Updates the post-close addendum and the flashinfer-ai#9 sub-section to reflect the
2026-04-28 root-cause work. The audit's headline +7-9% EP perf gaps at
N=16384 (EP=8 +8.6% kernel, EP=16 +13.7% kernel) are driven by two
paired CuteDSL MoE autotune bugs:

1. CuteDslMoEWrapper prealloc-bias (correctness): wrapper prealloc'd
   buffers sized for self.tile_size only, biasing autotune profiling
   against the other tile_size. Without the fix, fi is locked to
   tile_size=128 in 14/14 cache entries. Side branch
   cute-dsl-moe-wrapper-prealloc-bias-fix at commit a4137e6.
2. Autotuner bucket-cap mismatch (headline-perf): tuner.py:281 caps
   profile buckets at 8192 while runtime can be N=16384. fi profiles
   at half the per-expert workload, autotune-noise on tight margins
   (0.58% at EP=8 bucket=8192) destabilizes the cached choice. 1-line
   fix: bump cap to 16384.

Validation at --num-iters 100 --warmup 10 (3 runs each):

  no fixes:           EP=8 +5.5%/9.5pp  EP=16 +1.8%/9.7pp
  prealloc-fix only:  EP=8 +4.4%/11pp   EP=16 +2.4%/10.5pp
  both fixes:         EP=8 -0.8%/1.1pp  EP=16 -5.6%/10pp

Prealloc-fix alone does not close the headline gap. The bucket-cap-fix
is what closes it. Both fixes are required.

The EP=16 10pp spread under both-fixes is trt's autotune coin-flip
on its 0.08% tile=128 vs tile=256 profile-time margin -- fi's chosen
tactic is stable across all 3 runs; trt's is the noise source.

Several earlier 2026-04-28 revisions to this addendum had wrong
framings (EP=16 fully closed by prealloc-fix alone at --num-iters 1
was a lucky 2-run sample; necessary-but-not-sufficient at EP=8 was
partly true at --num-iters 1 but the residual was the bucket-cap
mismatch, not an unfixable deeper issue). The final state captured
here supersedes all earlier revisions in this same audit doc.

Neither fix is upstream as of 2026-04-28 -- held pending flashinfer-ai#3067
(fix-cutedsl-moe-gemm2-tactic-enum-tile256) and flashinfer-ai#3198
(get_max_num_tiles off-by-one) merging first. Thin-adapter refactor
remains the recommended medium-term direction; both short-term fixes
patch symptoms in code the refactor would delete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
Adds a "Final cross-N validation" sub-section under flashinfer-ai#9 capturing the
3-point baseline-vs-both-fixes sweep at N ∈ {4096, 8192, 16384} ×
EP ∈ {8, 16} × 3 runs at --num-iters 100. All 6 (EP, N) cells with
both fixes applied are at fi-parity or fi-faster; no regression vs
baseline. The N=16384 closure is reproduced (EP=8: +8.3% → -0.5%;
EP=16: +5.2% → -2.8%).

Notable observation: EP=16 N=4096 fi-advantage shrinks from -3.5%
to -0.6% with the fixes. fi still wins, just by less. The fixes
trade a small accidental advantage at small N (where tile=128 was
locked-in correct due to the prealloc bias) for a substantial
closure at N=16384.

This is the closing empirical validation of the paired-fix story.
The audit's perf-investigation thread is now empirically settled;
remaining audit-cycle work is the cleanup follow-ups (PR opening,
test additions, refactor scoping) tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…uns at ±3% noise floor)

Promotes the +13.7% kernel-level closure from "closed by inference"
to "closed by direct measurement". Adds a sub-section under flashinfer-ai#9
capturing the 3-run per-kernel measurement at EP=16 N=16384
--num-iters 100 with both fixes applied:

  Run 1 (trt picked tile=128, slower kernel):
    gemm1 Δ% -17.0%, gemm2 Δ% -18.0%, kernel-sum Δ% -17.4%, full-pass -7.8%
  Run 2 (trt picked tile=256, faster kernel):
    gemm1 Δ% +3.4%, gemm2 Δ% +2.5%, kernel-sum Δ% +3.0%, full-pass +0.4%
  Run 3 (trt picked tile=256, faster kernel):
    gemm1 Δ% -1.4%, gemm2 Δ% -2.4%, kernel-sum Δ% -1.8%, full-pass +0.4%

In matched-tactic runs (Runs 2-3, both at tile=256), kernel-level
Δ% lands within the ±3% autotune-noise floor -- fully closing the
audit's +13.7% reading. In the mismatched run (Run 1), fi is 17%
faster at kernels because fi-with-fixes consistently uses tile=256
while trt's coin-flip can land on tile=128. The prior +13.7%
audit measurement (fi-128 vs trt-256) is no longer reproducible
because fi never picks tile=128 at runtime under both-fixes.

This run also directly verifies the trt-coin-flip noise model
that was previously inferred from the 0.08% profile margin: trt's
per-run gemm1 time splits cleanly between tile=128 (~0.1234 ms)
and tile=256 (~0.1004-0.1029 ms), tracking trt's tactic choice.
fi is stable across all 3 runs; trt is the noise source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 29, 2026
…dual as version-skew

Adds the closing fresh-container reproduction sub-section under flashinfer-ai#9
covering EP ∈ {1, 8, 16} × N ∈ {4096, 8192, 16384} × 3 runs at
--num-iters 100 with both fixes applied. This is the comprehensive
empirical validation of the audit's perf-investigation closure:

| (EP, N)        | Mean Δ%   | Spread | Verdict
| EP=1 N=16384   | +3.9%     | 0.3pp  | fi slower (version-skew)
| EP=8 N=4096    | -7.0%     | 3.7pp  | fi faster
| EP=8 N=8192    | +1.4%     | 2.3pp  | parity
| EP=8 N=16384   | -4.4%     | 8.7pp  | fi faster (trt coin-flip variance)
| EP=16 N=4096   | -5.4%     | 1.0pp  | fi faster
| EP=16 N=8192   | +3.0%     | 2.4pp  | parity
| EP=16 N=16384  | -9.7%     | 1.8pp  | fi much faster (all 3 trt-tile=128)

Key findings:

(1) EP>1 perf gap is closed at all measured shapes -- mean Δ% is
    fi-at-parity-or-faster across the EP × N grid.

(2) EP=1 +3.9% residual is real (tight 0.3pp spread) but explained
    by the audit's documented version-skew caveat (NGC trt
    1.3.0rc5.post2 lacks 3 forward-ported commits). Auto-resolves on
    next NGC bump. NOT closed by the paired fixes -- mechanism is
    separate (kernel-level fi is -2.4% faster at EP=1; the +3.9%
    full-pass gap lives in non-kernel GPU-side work, not autotune).

(3) Per-kernel detail at EP=16 N=16384 (this session): all 3 runs
    caught trt-tile=128 by coin-flip; gemm1+gemm2 sum Δ% = -13 to
    -15% (fi much faster). The audit's original +13.7% kernel-level
    is fully closed -- the *opposite* mismatch is now possible
    because fi consistently picks tile=256.

(4) Cross-session variance documented: EP=8 N=16384 came in -4.4%
    here vs -0.8% in an earlier same-day session, both with both
    fixes. The variance is from trt's coin-flip ratio across
    sessions (different SLURM allocation / GPU instance). Sign is
    consistent (fi-at-parity-or-faster); magnitude varies. For
    audit-closure purposes the sign + tight bound is sufficient.

This commit closes the empirical perf-investigation thread of
the audit. Remaining audit-cycle work is upstream-shipping
(PRs not yet opened, tests not yet added) tracked separately --
not part of this empirical thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 29, 2026
Adds the EP=8 per-kernel breakdown (3 runs at EP=8 N=16384 --num-iters 100
with both fixes) showing the same trt-coin-flip pattern as EP=16:

  Run 1 (trt tile=256): kernel-sum Δ% = -0.4%, full-pass +1.2%
  Run 2 (trt tile=256): kernel-sum Δ% = -0.4%, full-pass -1.0%
  Run 3 (trt tile=128): kernel-sum Δ% = -12.1%, full-pass -1.0%

trt's gemm1 splits cleanly between ~0.206 ms (tile=256) and 0.238 ms
(tile=128), a ~15% jump tracking trt's per-tactic perf -- mirroring the
EP=16 ~20% jump previously documented. Confirms trt's coin-flip is the
noise source at EP=8 too, not fi.

Matched-tactic case (Runs 1+2): kernel-sum Δ% within ±0.4% noise floor.
Mismatched case (Run 3): fi 12% faster at kernels.

This upgrades the EP=8 closure from "inferred from EP=16 evidence" to
"direct measurement", symmetric to the EP=16 closure committed earlier
today.

Also documents an interesting wrapper-overhead observation: in the
mismatched run fi is 12% faster at kernels but only -1.0% at full-pass.
fi has ~44 µs more non-kernel time than trt at EP=8 (CuteDslMoEWrapper
Python layer overhead vs trt's C++ thop). Same mechanism as the EP=1
+3.9% residual; not a port-correctness issue, but is the architectural
difference the recommended thin-adapter refactor addresses.

This commit is the final piece of empirical evidence for the audit's
perf-investigation closure. Both flashinfer-ai#9 sub-questions (EP=8 +8.6% and
EP=16 +13.7%) are now closed by direct per-kernel measurement, not
inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 29, 2026
…, mark superseded sub-sections

Lee asked for a thorough audit-of-the-audit pass: look for inconsistencies,
discrepancies, redundancies, and opportunities for compaction/simplification.
After reading the full 3952-line doc, the following systematic edits were
applied:

Top-of-doc:

- Added a "Final state (2026-04-29) — read this first" section at the very
  top (immediately after the title, before the chronological correction
  log). Captures: static audit summary, both upstream PRs (flashinfer-ai#3067, flashinfer-ai#3198),
  the two paired wrapper bugs and their side branches, the comprehensive
  30-cell validation matrix, the three-regime classification (large-N
  closure / medium-N wrapper-overhead-residual / small-N noise), the EP=1
  + version-skew explanation, and the closure of the empirical thread.
  This section is now authoritative; the rest of the doc is the
  chronological investigation record leading to it.

- Updated the title's "last updated" tag to 2026-04-29.

Stale references fixed:

- The flashinfer-ai#3067 fix branch was referred to via the stale commit hashes
  d291d17e and e2330e73/f0cf8cd0 in three places; updated to the current
  57de5f8 + 38293a0 (with notes that the original commit-hash references
  are preserved for traceability).
- The "Tunable runner" row in the Summary table marked tile_size=256 as
  "gated off"; updated to "fixed via PR flashinfer-ai#3067 (pending review)".
- Item 3 of the corrections list: "Recommended fix (drafted, not yet
  applied)" → "Fix landed and shipped as upstream PR flashinfer-ai#3067".

Duplicate content removed:

- Two "What still stands (unaffected by the autotune-context bug)"
  blocks were nearly identical; the first was removed and replaced
  with a one-line pointer to the consolidated version below.

Executive verdict updated:

- Rewritten to reflect current state: now references both PRs, the
  paired-fix story, the comprehensive 30-cell sweep matrix, and the
  +13.7% kernel-level closure. Previously it described "Leading
  hypothesis (NOT YET EMPIRICALLY VERIFIED)" framings that have since
  been confirmed and shipped.

Superseded sub-sections marked:

- "EP=8 vs EP=16: the fix has different impact at each" — was an
  early "necessary but not sufficient" framing under prealloc-only
  state; replaced with a SUPERSEDED note pointing to the paired-fix
  story below (the bucket-cap-fix discovery is what closed both EPs).
- "EP=16 confirmation (2026-04-28)" header clarified to indicate
  it's a single-iter prealloc-only snapshot; the H_A/H_B
  hypothesis-discrimination logic and trt 0.08% margin observation
  are still load-bearing and preserved.
- "Final cross-N validation (2026-04-28)" renamed to "Initial 6-cell
  cross-N spot-check" with a pointer to the comprehensive 30-cell
  sweep that supersedes it.
- Status sub-section: removed outdated EP=8 partial-closure / EP=16
  full-closure bullets (those were prealloc-only state); replaced
  with concise paired-fix branch summary and pointer to comprehensive
  sweep for per-cell results.
- Implications for the audit's earlier claims: updated the
  "dominant factor is the prealloc bias" bullet to the paired-fix
  framing (prealloc + bucket-cap, both required).

Follow-ups updated:

- Open follow-ups remaining: appended a "Update 2026-04-29 (final):
  flashinfer-ai#9 also closed" note documenting that the EP=8/EP=16 perf gap was
  root-caused to two paired wrapper bugs, with the empirical
  investigation thread now closed.
- Item flashinfer-ai#9 in the chronological "Follow-ups queued for later sessions"
  list: added a RESOLVED 2026-04-29 header with pointer to the
  paired-fix story, marking the 9a / 9b / 9c sub-mechanism candidates
  as pre-resolution candidate hypotheses (preserved as record of the
  search space, not blocking work).

What was preserved as historical record (not removed):

- Full "2026-04-24/25/28 correction" chronological log
- Per-component static audit detail
- Issue flashinfer-ai#3067 narrative including the original "Leading hypothesis"
  framing and the post-fix verification table
- The 22-day candidate-root-cause investigation that ruled out 5
  mechanisms before landing on the paired wrapper bugs
- Part 2 baseline numbers (with retraction annotations preserved
  as-is)
- Probes, MoE kernel coverage, reproduction runbook

Net effect: doc is more navigable, internally consistent, and the
authoritative summary is at the top. Body content preserves the
investigative reasoning. Diff: +241 / -217 lines, net +24 lines
(intentional — the new Final state section adds explicit
authoritative content).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant