Skip to content

feat: estimate inference throughput for models on cluster GPUs#311

Merged
surajssd merged 24 commits into
kaito-project:mainfrom
surajssd:suraj/estimate-inf-token-throughput
Jun 11, 2026
Merged

feat: estimate inference throughput for models on cluster GPUs#311
surajssd merged 24 commits into
kaito-project:mainfrom
surajssd:suraj/estimate-inf-token-throughput

Conversation

@surajssd

@surajssd surajssd commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

Adds an offline inference-throughput estimator (issue #139) that surfaces two rough numbers per model without running any inference:

  • per-chat tok/s — single-stream decode speed, memory-bandwidth bound ("how snappy chat feels")
  • concurrent capacity / aggregate tok/s — KV-cache-budget gated, per replica ("how many requests at once")

The estimate is shown on catalog cards (deferred until the card scrolls into view) and on the Deploy page, where it sits in a new Performance & Precision section alongside weight- and KV-cache-precision controls. All values are presented as estimates with a methodology disclaimer. The branch also hardens the supporting backend (input validation, caching, GPU selection), unifies parameter-count parsing across the stack, and restores bun run lint by migrating both workspaces to ESLint flat config.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 📚 Documentation update
  • 🎨 UI/UX improvement
  • ♻️ Refactoring (no functional changes)
  • 🧪 Test update
  • 🔧 Build/CI configuration

Related Issues

Fixes #139

Changes Made

Throughput estimator (backend)

  • Add gpuPerformance.ts with estimatePerChatTokensPerSec and estimateConcurrentCapacity heuristics, plus deriveTpSizeToFitWeights, bytesPerWeightFor, and bytesPerKvFor helpers
  • Add GET /installation/gpu-throughput route: selects the GPU pool to estimate on, derives tpSize when the caller sends no minGpus hint, and degrades to a low-confidence per-chat-only result when architecture data is missing
  • Add getModelArchitecture to huggingface.ts to read transformer dims from config.json (including nested text_config/llm_config/language_config for multimodal models), with a token-scoped, TTL'd, LRU-bounded cache (gated configs keyed by sha256(token) so they never leak across callers)
  • Add per-GPU memBandwidthGBs specs and an H200-141GB entry to costEstimation.ts; gate FP8 to Hopper via gpuSupportsFp8

Throughput estimator (frontend)

  • Add ThroughputEstimate component, useGpuThroughput hook, gpuOperatorApi.getThroughput, and gpu-throughput-params builders
  • Add useInView so catalog cards defer the estimate fetch until visible
  • Wire estimates into ModelCard, HfModelCard, ModelGrid, HfModelSearch, ModelsPage, and the Deploy summary card

Deploy page precision controls

  • Add Model Weights Precision and KV Cache Precision dropdowns; feed FP8 into the deployment as --quantization fp8 / --kv-cache-dtype fp8 engine args for vllm/sglang only
  • Decouple KV-cache precision from weight quantization (bytesPerKvFor defaults to 2 bytes)
  • Gate FP8 to H100/H200: downgrade an FP8 KV cache to FP16 in the estimate, and disable Deploy with a reason on non-Hopper GPUs
  • Surface a non-blocking "model does not fit" warning when the high-confidence estimate leaves no room for the KV cache (Deploy stays enabled, hidden when fp8Blocked already applies)

Correctness & hardening

  • Factor tensor parallelism into per-chat speed: scale bandwidth by tpSize × tpDecodeEfficiency(tpSize), stepped by group size (1.0 for TP1, 0.85 for TP2-4, 0.75 for TP>4) so large groups crossing NVLink domains aren't over-estimated
  • Bound tpSize to a pool's per-node GPU count via perNodeGpuCount; validate an explicit gpuModel against capacity.nodePools and fall back to the highest-VRAM pool
  • Validate and encodeURIComponent modelId (isValidHfRepoId / encodeHfRepoPath) before any token-forwarding Hugging Face fetch, rejecting ../ traversal and stray segments
  • Cap paramCount at 9T and the explicit contextLen at MAX_CONTEXT_LEN (32768); vary the react-query cache key by HF auth state (auth/anon) and stop retrying deterministic 4xx responses
  • Correct estimates for unknown GPUs (skip / 404 instead of silently treating as A10) and accept non-HF custom modelIds (degrade to bandwidth-only instead of a hard 400)

Refactors

  • Unify model parameter-count parsing in @airunway/shared (shared/types/modelParams.ts: parseParameterCountFromName, resolveModelParamCount), removing divergent backend/frontend copies
  • Type quant params with the exported Quantization / KvCacheDtype unions instead of string
  • Drop dead gpuModel client-side selection plumbing (pickGpuModel), making the backend the single source of truth for GPU selection

Build/CI

  • Migrate backend and frontend to ESLint flat config (eslint.config.mjs); add @typescript-eslint/parser + plugin to backend; drop the removed --ext flag so bun run lint works under ESLint v9+

Testing

  • Unit tests pass (bun run test)
  • Manual testing performed
  • Tested with a Kubernetes cluster

New/updated test suites: gpuPerformance.test.ts, huggingface.test.ts, installation.test.ts, costEstimation.test.ts, modelCompatibility.test.ts (backend); DeploymentForm.test.tsx, ThroughputEstimate.test.tsx, useGpuOperator.test.tsx (frontend). They cover the per-chat/concurrency heuristics, TP-size derivation and decode-efficiency tiers, config.json parsing (incl. nested multimodal configs), TTL + LRU cache eviction, modelId validation/encoding, query-param bounds, unknown-GPU 404, FP8 gating, and the non-blocking "does not fit" warning.

Manual Testing

Create a cluster with GPU nodepool and then check out this code and run the following:

bun install
bun run dev

Now go to http://localhost:5173/deploy/microsoft%2FPhi-4-mini-instruct and you can see the new section.

Checklist

  • My code follows the project's style guidelines
  • I have run bun run lint
  • I have added tests that prove my fix/feature works
  • New and existing unit tests pass locally
  • I have updated documentation if needed
  • My changes generate no new warnings

Screenshots

image

Look at the new Performance & Precision section.

Additional Notes

  • All numbers are deliberately simple heuristics and are labelled as estimates in the UI — real throughput depends on the serving engine, batch scheduler, prompt lengths, and quantization.
  • The ESLint migration intentionally demotes the pre-existing no-explicit-any / no-unused-vars backlog (and the newly-enabled react-hooks compiler rules) to warnings so lint is green and CI-usable, leaving the historical backlog visible for incremental burndown rather than fixing it in this PR.
  • One pre-existing tsc error in frontend/src/hooks/useInView.ts (RefObject<T | null>) surfaced during validation; it is unrelated to the estimator logic and can be addressed separately.

@surajssd surajssd requested a review from a team as a code owner June 3, 2026 18:34
Copilot AI review requested due to automatic review settings June 3, 2026 18:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an offline, heuristic GPU inference-throughput estimator and surfaces it in the model catalog and deploy UI, using cluster GPU specs + optional Hugging Face config.json architecture metadata (cached) to estimate per-chat decode speed and concurrency.

Changes:

  • Backend: introduce a bandwidth/KV-cache–based estimator and a new GET /installation/gpu-throughput endpoint that selects a GPU pool and (when possible) uses HF config.json architecture for concurrency estimates.
  • Frontend: add query-param builders + hook + ThroughputEstimate UI component, lazily fetching estimates only when cards scroll into view.
  • Shared types + GPU specs: add ModelArchitecture/GpuThroughputEstimate, extend GPU model table with per-GPU memory bandwidth and H200 support.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
shared/types/model.ts Adds ModelArchitecture type used for KV sizing.
shared/types/installation.ts Adds GpuThroughputEstimate API payload type.
frontend/src/pages/ModelsPage.tsx Plumbs selected GPU model into model grids/search for throughput estimation.
frontend/src/pages/DeployPage.tsx Fetches and displays throughput estimate on the deploy summary card.
frontend/src/lib/gpu-throughput-params.ts Adds helpers to pick GPU model and build throughput query params.
frontend/src/lib/api.ts Adds gpuOperatorApi.getThroughput client method and exports GpuThroughputEstimate.
frontend/src/hooks/useInView.ts Adds in-view hook to defer per-card throughput fetches.
frontend/src/hooks/useGpuOperator.ts Adds useGpuThroughput react-query hook and params typing.
frontend/src/components/models/ThroughputEstimate.tsx New UI component to render per-chat + concurrency estimates with tooltip disclaimer.
frontend/src/components/models/ThroughputEstimate.test.tsx Component tests for confident/low-confidence/loading/empty states.
frontend/src/components/models/ModelGrid.tsx Threads gpuModel prop down to cards.
frontend/src/components/models/ModelCard.tsx Adds lazy throughput fetching/display on curated model cards.
frontend/src/components/models/HfModelSearch.tsx Threads gpuModel prop down to HF cards.
frontend/src/components/models/HfModelCard.tsx Adds lazy throughput fetching/display on HF search result cards.
backend/src/services/huggingface.ts Adds cached config.json architecture lookup keyed by model + token hash.
backend/src/services/gpuPerformance.ts New estimator implementation for per-chat tok/s and concurrent capacity.
backend/src/services/gpuPerformance.test.ts Unit tests covering estimator behavior and edge cases.
backend/src/services/costEstimation.ts Extends GPU specs with memBandwidthGBs and adds H200.
backend/src/services/costEstimation.test.ts Tests H200 normalization and bandwidth lookup.
backend/src/routes/installation.ts Adds GET /installation/gpu-throughput endpoint and GPU selection logic.
.gitignore Ignores .playwright-mcp/ and normalizes trailing newline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/src/routes/installation.ts Outdated
Comment thread backend/src/routes/installation.ts
Comment thread backend/src/services/huggingface.ts
Comment thread frontend/src/hooks/useGpuOperator.ts
Comment thread frontend/src/hooks/useInView.ts
@surajssd surajssd force-pushed the suraj/estimate-inf-token-throughput branch from a1e39f8 to 4325784 Compare June 5, 2026 17:07
Copilot AI review requested due to automatic review settings June 5, 2026 17:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 1 comment.

Comment thread frontend/src/hooks/useInView.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 2 comments.

Comment thread frontend/src/hooks/useInView.ts Outdated
Comment thread frontend/src/hooks/useGpuOperator.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 1 comment.

Comment thread frontend/src/hooks/useInView.ts
Comment thread backend/src/lib/kubeconfig.ts
Comment thread backend/src/services/costEstimation.ts Outdated
surajssd added 15 commits June 10, 2026 11:07
Add a rough, offline inference-speed estimator (issue kaito-project#139) that surfaces two
numbers per model: single-stream per-chat `tok/s` (memory-bandwidth bound) and
KV-cache-gated concurrent capacity. No inference is run; all values are shown
as estimates with a methodology disclaimer.

Backend:
- Add `gpuPerformance.ts` with `estimatePerChatTokensPerSec` and
  `estimateConcurrentCapacity` heuristics, plus `resolveParamCount` and
  `bytesPerWeightFor` helpers
- Add `GET /installation/gpu-throughput` route, selecting the GPU pool to
  estimate on and degrading to a low-confidence per-chat-only result when
  architecture data is missing
- Add `getModelArchitecture` to `huggingface.ts` to read transformer dims from
  `config.json`, with a token-scoped, TTL'd cache (gated configs keyed by
  `sha256(token)` so they never leak across callers)
- Add per-GPU `memBandwidthGBs` specs to `costEstimation.ts` and a new
  `H200-141GB` entry; export `GpuModelInfo`

Frontend:
- Add `ThroughputEstimate` component (per-chat + concurrency label with tooltip)
- Add `useGpuThroughput` hook, `gpuOperatorApi.getThroughput`, and
  `gpu-throughput-params` builders
- Add `useInView` so catalog cards defer the estimate fetch until visible
- Wire estimates into `ModelCard`, `HfModelCard`, `ModelGrid`, `HfModelSearch`,
  `ModelsPage`, and the Deploy summary card

Shared:
- Add `GpuThroughputEstimate` and `ModelArchitecture` types

Chore:
- Ignore `.playwright-mcp/` and add a trailing newline to `.gitignore`

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`selectGpuForEstimate` derived `maxContiguous` (the per-replica tensor-parallel
ceiling) from cluster-wide or pool-total values, producing estimates for
hardware the cluster lacks and `tpSize` values that exceed a single node's GPU
count.

- Add `perNodeGpuCount` helper deriving a pool's per-node GPUs as
  `floor(gpuCount / nodeCount)`, since `gpuCount` is summed across nodes
- Validate an explicit `gpuModel` against `capacity.nodePools` and fall
  back to the highest-VRAM pool when the requested model is absent, instead
  of resolving it directly from the static GPU spec table
- Compute `maxContiguous` and `capacityLabel` from the selected pool's
  per-node count rather than `maxContiguousAvailable` / `pool.gpuCount`
- Import `NodePoolInfo` from `@airunway/shared`

Tests:

- Add `GET /api/installation/gpu-throughput` suite covering per-node
  `tpSize` clamping, fallback when the requested model is absent, the
  no-explicit-model path, and the empty-cluster `404`

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Expired entries in `architectureCache` were left in the map indefinitely, so
over time (many distinct `modelId`/token keys) it could grow unbounded despite
the TTL.

- Delete a cache entry in `getModelArchitecture` when it is found expired,
  keeping the cache bounded by "used within TTL"

Tests:

- Add a `getModelArchitecture` suite covering config.json parsing, fresh
  cache reuse, eviction + re-fetch after the TTL, and the non-ok `undefined`
  fallback

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`useGpuThroughput`'s react-query `queryKey` did not account for auth state, so a
high-confidence estimate fetched for a gated model while logged in could still
be served from cache after logout — even though the backend can no longer read
`config.json` without the token.

- Add an `'auth' | 'anon'` discriminator (derived from token presence, never
  the token itself) to the `queryKey`, forcing a recompute when switching
  between authenticated and anonymous states

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Add `doesNotFit` to `GpuThroughputEstimate` and set it when model weights plus
  headroom leave no room for KV cache; render an explicit "Does not fit — no
  room for KV cache" warning instead of a misleading per-chat speed
- Resolve KV context length after fetching model architecture, falling back to
  `maxPositionEmbeddings` (capped at `MAX_INFERRED_CONTEXT_LEN`) so long-context
  HF models are no longer sized against the 4K default
- Add backend route and frontend component tests for both cases

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Add **Model Weights Precision** and **KV Cache Precision** dropdowns in a new
  "Performance & Precision" section, and move the throughput estimate there from
  the model summary card
- Decouple KV-cache precision from weight quantization: add `bytesPerKvFor`
  (defaults to 2 bytes) and thread `kvCacheDtype` through `api.ts`,
  `useGpuOperator`, and `gpu-throughput-params`
- Feed FP8 into the deployment as `--quantization fp8` / `--kv-cache-dtype fp8`
  engine args for `vllm`/`sglang` only
- Gate FP8 to Hopper (H100/H200) via `gpuSupportsFp8`: downgrade an FP8 KV cache
  to FP16 in the estimate and disable Deploy with a reason on non-Hopper GPUs
- Surface aggregate tokens/sec total in the `ThroughputEstimate` label
- Refactor GPU selection so the backend is the single source of truth: replace
  `pickGpuModel` with `hasEstimableGpu` and stop forwarding `gpuModel` from the
  client
- Add tests for `bytesPerKvFor`, `gpuSupportsFp8`, KV/weight decoupling, and FP8
  gating

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The backend already picks the estimate GPU (highest per-GPU VRAM); the frontend
no longer forwards `gpuModel`, so the threaded value was dead.

- Delete the deprecated `pickGpuModel()` helper from `gpu-throughput-params.ts`
  and narrow `buildThroughputParamsForGpu`'s gate param to `boolean`
- Replace the `gpuModel?: string` prop with a `gpuPresent?: boolean` presence
  flag across `ModelsPage`, `ModelGrid`, `HfModelSearch`, `ModelCard`, and
  `HfModelCard`
- Source `gpuPresent` from `hasEstimableGpu(detailedCapacity)` in `ModelsPage`

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
A user-supplied `modelId` was interpolated raw into Hugging Face URLs while
forwarding the caller's `X-HF-Token`, so malformed ids (`../../…`, extra
`/segments`, query fragments, or whitespace) could steer authenticated outbound
requests to unintended paths.

- Add `isValidHfRepoId()` + `encodeHfRepoPath()` helpers to `huggingface.ts`
  (1-2 safe segments, no `.`/`..` traversal, ≤96 chars/segment, per-segment
  `encodeURIComponent`)
- Guard and encode both token-forwarding fetches: `getModelArchitecture`
  (returns `undefined` on an invalid id) and `getGgufFiles` (throws)
- Reject at the route edges too: `.refine(isValidHfRepoId)` on the
  `gpu-throughput` query schema and a 400 guard on the greedy
  `/:modelId/gguf-files` route
- Add service + route tests covering traversal/unsafe ids and encoded URLs

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- bound `paramCount` with `.max(9_000_000_000_000)` so out-of-range values are
  rejected with 400 instead of yielding garbage throughput estimates, matching
  the existing `contextLen` and `tpSize` caps
- add a test asserting a `paramCount` above the cap returns 400 and never
  reaches the estimator

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The backend and frontend each carried their own parameter-count parser that had
drifted (whitespace boundaries, the `illion` suffix, and the backend-only `<
10000` sanity guard), so a fix to one never reached the other. Consolidate on a
single source of truth.

- add canonical `parseParameterCountFromName` and `resolveModelParamCount` to
  `@airunway/shared` (`shared/types/modelParams.ts`), keeping the stricter
  backend parsing behaviour
- rewire `modelCompatibility.ts` to import `parseParameterCountFromName` from
  shared, and drop the now-dead `resolveParamCount` from `gpuPerformance.ts` (no
  production callers remained)
- replace the divergent inline regex in `gpu-throughput-params.ts` with the
  shared `resolveModelParamCount`
- repoint the backend tests to the shared implementation

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The per-chat tokens/sec estimate used one GPU worth of bandwidth and
ignored `tpSize`, underestimating speed by ~`tpSize×` for multi-GPU
replicas — while the UI tooltip implied the GPU count was already
factored in.

- Scale effective bandwidth by `tpSize × TP_DECODE_EFFICIENCY` (0.85) in
  `estimatePerChatTokensPerSec`; `tpSize=1` reproduces the exact
  single-GPU number
- Thread the resolved `effectiveTpSize` into the per-chat call in the
  `gpu-throughput` route
- Add tests for the TP speedup ratio and the `tpSize=1` regression guard

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`useGpuThroughput` set no `retry`, so it inherited the global `retry: 3`. A 404
(no cluster GPU pool maps to a known spec) is deterministic, so every model card
scrolled into view fired 1 + 3 = 4 doomed requests.

- Add a status-aware `retry` predicate to the `useGpuThroughput` query: skip
  retries on any 4xx, keep a small budget (`failureCount < 2`) for transient
  5xx/network failures
- Add `useGpuOperator.test.tsx` covering the no-retry path (404/400 fire one
  request) and the 5xx retry path, using a `QueryClient` with retries enabled so
  the per-query override is actually exercised

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Tighten the loose `string` parameters to the unions already exported alongside
each function, so the validated enum flows through without being widened back to
`string`:

- `bytesPerWeightFor(quantization?: string)` -> `quantization?: Quantization`
- `bytesPerKvFor(dtype?: string)` -> `dtype?: KvCacheDtype`

The runtime switch and `undefined` defaults are unchanged, so behavior is
identical; an invalid value is now a compile-time error at the boundary instead
of silently mapping to the 2-byte default.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
HuggingFace search cards carry no `minGpus`, so they sent no `tpSize`
and the backend defaulted to tp=1 — making large models spuriously
report "does not fit" while the curated/Deploy tabs showed full
capacity for the same model and cluster.

- Add `deriveTpSizeToFitWeights` in `gpuPerformance.ts`: returns the
  smallest power-of-two TP size whose per-GPU weight shard leaves room
  for a KV cache, bounded by `maxContiguous`; the fit test mirrors
  `estimateConcurrentCapacity`.
- Wire it into the `/gpu-throughput` route so an omitted `tpSize` is
  derived instead of defaulting to 1; an explicit `tpSize` still wins.
- Add tests covering the TP bump, `maxContiguous` cap, single-GPU cap,
  small-model/fp8 stay-at-1, unknown paramCount, headroom-exceeds-VRAM,
  and cross-tab consistency.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- `setFp8PrecisionEngineArgs` now strips `quantization`/`kv-cache-dtype` only
  when the value is `fp8` (the value the precision dropdown owns), so a user-set
  `awq`/`gptq` from the advanced engine-args editor is no longer clobbered when
  weight precision is not FP8
- add a non-blocking warning on the Deploy page when FP8 is selected but the
  throughput estimate is absent (errored/404), so an unsupported `fp8` flag is
  not submitted silently on hardware of unknown capability
- add unit tests for `setFp8PrecisionEngineArgs`

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
surajssd added 7 commits June 10, 2026 11:07
- cap the explicit `contextLen` query param at `MAX_CONTEXT_LEN` (32768), not
  just the arch-inferred window, so a caller forwarding a model's huge
  advertised window (128K–1M) can't collapse the concurrency estimate toward
  zero
- bound the architecture cache with an LRU size cap (`ARCH_CACHE_MAX_ENTRIES`),
  evicting the least-recently-used entry once exceeded, so a wide scan of many
  distinct `modelId`/token keys no longer keeps every entry resident for the
  full TTL
- add tests for the explicit-`contextLen` cap and the LRU eviction

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- correct `formatCount` doc example: `18234` formats to `"18k"` (not `"18.2k"`),
  since values `>= 10000` drop the decimal
- document that `TP_DECODE_EFFICIENCY` (0.85) is a flat factor and is optimistic
  for large TP groups, where the per-GPU haircut grows with `tpSize` and slower
  interconnect
- replace the dead `apiGroup` ternary (both branches returned `''`) with a
  direct `''` and a note that the real CRD group isn't stored on the
  `InferenceProviderConfig`

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Addresses review findings in the GPU throughput estimator:

- `selectGpuForEstimate()` and `gpuSupportsFp8()` now use strict GPU lookups
  (`normalizeKnownGpuModel`/`getKnownGpuInfo`); an unknown GPU label is skipped
  or returns a `404` instead of being silently estimated as an `A10` with wrong
  speed and FP8 numbers
- the `/gpu-throughput` query schema no longer rejects a non-HF `modelId`; the
  handler gates the token-bearing HF fetch on `isValidHfRepoId()`, so a
  curated/custom id degrades to a bandwidth-only estimate from `paramCount`
  instead of a hard `400`
- `getModelArchitecture()` reads transformer dimensions from nested
  `text_config`/`llm_config`/`language_config`, so multimodal and composite
  models yield high-confidence concurrency estimates instead of per-chat-only
- add tests covering unknown-GPU skip/404, non-HF and malformed `modelId`
  degradation, and nested-config parsing

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The `eslint ^8 → ^10` bump broke `bun run lint` in both workspaces: ESLint v9+
requires a flat `eslint.config.*` (the repo never had one) and removed the
`--ext` flag both scripts used.

- add flat `eslint.config.mjs` to `backend` and `frontend`, linting TypeScript
  via the typescript-eslint `flat/recommended` preset
- add `@typescript-eslint/parser` and `@typescript-eslint/eslint-plugin` to
  backend (it previously declared only `eslint`, so it could not parse its own
  `.ts`); lockfile updated to match
- drop the removed `--ext` flag from both lint scripts and `--max-warnings 0`
  from frontend
- demote pre-existing `no-explicit-any`, `no-unused-vars`, and the newly-enabled
  react-hooks compiler rules to warnings so lint is green and CI-usable, leaving
  the historical backlog visible for incremental burndown

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- surface a non-blocking "model does not fit" warning on the deploy
  page when the high-confidence estimate leaves no room for the KV
  cache; `Deploy` stays enabled since the user may pick more GPUs per
  replica than the estimate assumed, and it is hidden when `fp8Blocked`
  already explains a blocking reason
- step `tpDecodeEfficiency` down by TP group size (1.0 for TP1, 0.85
  for TP2-4, 0.75 for TP>4) instead of a flat 0.85, so large
  tensor-parallel groups crossing NVLink domains are not over-estimated
- add tests for the no-fit warning and the new efficiency tiers

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Declare the return ref as `React.RefObject<T | null>` so it matches what
`useRef<T>(null)` produces under `@types/react@19`, fixing the TS2322 build
error that broke the `build` and `e2e-frontend` CI checks.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `React.RefObject` with a type-only `RefObject` import in
  `useInView` so the return type no longer relies on the UMD `React`
  global, which `tsc` may fail to resolve
- Update `useGpuThroughput` JSDoc to note the query gates on
  `paramCount` only, since the GPU model is now chosen server-side

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd force-pushed the suraj/estimate-inf-token-throughput branch from 6808e31 to e5137e8 Compare June 10, 2026 18:49
`gpuSupportsFp8` only recognised Hopper, so the throughput estimator
silently downgraded an FP8 KV cache to fp16 and the Deploy page blocked
FP8 deployments on L40S/L4 — both of which have a native FP8 datapath.
vLLM gates full FP8 (W8A8 / FP8 KV cache) on compute capability >= 8.9,
covering Ada Lovelace and Hopper.

- Add `FP8_CAPABLE_GENERATIONS` and treat `Ada Lovelace` (L40S, L4) as
  FP8-capable alongside Hopper; Ampere and older stay excluded (A100 is
  weight-only W8A16, not the W8A8/FP8-KV path modelled here)
- Update `gpuSupportsFp8` tests to assert L40S/L4 are FP8-capable and
  pre-Ada GPUs are not
- Correct stale "Hopper-only" / "non-Hopper" wording in
  `installation.ts`, `DeployPage.tsx`, `DeploymentForm.tsx`, and the
  `GpuThroughputEstimate` shared type

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings June 10, 2026 19:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 36 changed files in this pull request and generated 4 comments.

Comment thread frontend/src/pages/DeployPage.tsx Outdated
Comment thread frontend/src/pages/DeployPage.tsx Outdated
Comment thread frontend/src/pages/DeployPage.tsx Outdated
Comment thread frontend/src/components/deployments/DeploymentForm.tsx Outdated
The backend's  recognises Ada Lovelace (L40S, L4) and Hopper (H100,
H200), but the UI still told users only H100/H200 support FP8 — wrong
for L40S/L4 clusters.

- `DeployPage.tsx`: update the `fp8BlockReason` text and both weight /
  KV-cache `InfoHint` tooltips to mention L40S/L4
- `DeploymentForm.tsx`: update the fallback FP8 block message to match

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd requested a review from robert-cronin June 10, 2026 22:13
@surajssd surajssd merged commit c4ee1c5 into kaito-project:main Jun 11, 2026
8 of 12 checks passed
@surajssd surajssd deleted the suraj/estimate-inf-token-throughput branch June 11, 2026 00:18
surajssd added a commit to surajssd/kubeairunway that referenced this pull request Jun 11, 2026
PR kaito-project#311 added frontend/eslint.config.mjs to main; this branch
independently added frontend/eslint.config.js. With both present ESLint
silently loads only the .js and ignores the .mjs.

Merge into the single .mjs (keeping kaito-project#311's parser/JSX wiring and
react-hooks recommended set) and delete the .js. The rules this PR
cleaned up — no-explicit-any, no-empty-object-type, no-unused-vars (with
argsIgnorePattern '^_') — are promoted to errors so they cannot regress;
the experimental react-hooks React-Compiler rules stay warnings per
kaito-project#311's backlog decision.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface GPU hardware details and estimated inference throughput

3 participants