feat: estimate inference throughput for models on cluster GPUs#311
Merged
surajssd merged 24 commits intoJun 11, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an offline, heuristic GPU inference-throughput estimator and surfaces it in the model catalog and deploy UI, using cluster GPU specs + optional Hugging Face config.json architecture metadata (cached) to estimate per-chat decode speed and concurrency.
Changes:
- Backend: introduce a bandwidth/KV-cache–based estimator and a new
GET /installation/gpu-throughputendpoint that selects a GPU pool and (when possible) uses HFconfig.jsonarchitecture for concurrency estimates. - Frontend: add query-param builders + hook +
ThroughputEstimateUI component, lazily fetching estimates only when cards scroll into view. - Shared types + GPU specs: add
ModelArchitecture/GpuThroughputEstimate, extend GPU model table with per-GPU memory bandwidth and H200 support.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| shared/types/model.ts | Adds ModelArchitecture type used for KV sizing. |
| shared/types/installation.ts | Adds GpuThroughputEstimate API payload type. |
| frontend/src/pages/ModelsPage.tsx | Plumbs selected GPU model into model grids/search for throughput estimation. |
| frontend/src/pages/DeployPage.tsx | Fetches and displays throughput estimate on the deploy summary card. |
| frontend/src/lib/gpu-throughput-params.ts | Adds helpers to pick GPU model and build throughput query params. |
| frontend/src/lib/api.ts | Adds gpuOperatorApi.getThroughput client method and exports GpuThroughputEstimate. |
| frontend/src/hooks/useInView.ts | Adds in-view hook to defer per-card throughput fetches. |
| frontend/src/hooks/useGpuOperator.ts | Adds useGpuThroughput react-query hook and params typing. |
| frontend/src/components/models/ThroughputEstimate.tsx | New UI component to render per-chat + concurrency estimates with tooltip disclaimer. |
| frontend/src/components/models/ThroughputEstimate.test.tsx | Component tests for confident/low-confidence/loading/empty states. |
| frontend/src/components/models/ModelGrid.tsx | Threads gpuModel prop down to cards. |
| frontend/src/components/models/ModelCard.tsx | Adds lazy throughput fetching/display on curated model cards. |
| frontend/src/components/models/HfModelSearch.tsx | Threads gpuModel prop down to HF cards. |
| frontend/src/components/models/HfModelCard.tsx | Adds lazy throughput fetching/display on HF search result cards. |
| backend/src/services/huggingface.ts | Adds cached config.json architecture lookup keyed by model + token hash. |
| backend/src/services/gpuPerformance.ts | New estimator implementation for per-chat tok/s and concurrent capacity. |
| backend/src/services/gpuPerformance.test.ts | Unit tests covering estimator behavior and edge cases. |
| backend/src/services/costEstimation.ts | Extends GPU specs with memBandwidthGBs and adds H200. |
| backend/src/services/costEstimation.test.ts | Tests H200 normalization and bandwidth lookup. |
| backend/src/routes/installation.ts | Adds GET /installation/gpu-throughput endpoint and GPU selection logic. |
| .gitignore | Ignores .playwright-mcp/ and normalizes trailing newline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a1e39f8 to
4325784
Compare
Add a rough, offline inference-speed estimator (issue kaito-project#139) that surfaces two numbers per model: single-stream per-chat `tok/s` (memory-bandwidth bound) and KV-cache-gated concurrent capacity. No inference is run; all values are shown as estimates with a methodology disclaimer. Backend: - Add `gpuPerformance.ts` with `estimatePerChatTokensPerSec` and `estimateConcurrentCapacity` heuristics, plus `resolveParamCount` and `bytesPerWeightFor` helpers - Add `GET /installation/gpu-throughput` route, selecting the GPU pool to estimate on and degrading to a low-confidence per-chat-only result when architecture data is missing - Add `getModelArchitecture` to `huggingface.ts` to read transformer dims from `config.json`, with a token-scoped, TTL'd cache (gated configs keyed by `sha256(token)` so they never leak across callers) - Add per-GPU `memBandwidthGBs` specs to `costEstimation.ts` and a new `H200-141GB` entry; export `GpuModelInfo` Frontend: - Add `ThroughputEstimate` component (per-chat + concurrency label with tooltip) - Add `useGpuThroughput` hook, `gpuOperatorApi.getThroughput`, and `gpu-throughput-params` builders - Add `useInView` so catalog cards defer the estimate fetch until visible - Wire estimates into `ModelCard`, `HfModelCard`, `ModelGrid`, `HfModelSearch`, `ModelsPage`, and the Deploy summary card Shared: - Add `GpuThroughputEstimate` and `ModelArchitecture` types Chore: - Ignore `.playwright-mcp/` and add a trailing newline to `.gitignore` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`selectGpuForEstimate` derived `maxContiguous` (the per-replica tensor-parallel ceiling) from cluster-wide or pool-total values, producing estimates for hardware the cluster lacks and `tpSize` values that exceed a single node's GPU count. - Add `perNodeGpuCount` helper deriving a pool's per-node GPUs as `floor(gpuCount / nodeCount)`, since `gpuCount` is summed across nodes - Validate an explicit `gpuModel` against `capacity.nodePools` and fall back to the highest-VRAM pool when the requested model is absent, instead of resolving it directly from the static GPU spec table - Compute `maxContiguous` and `capacityLabel` from the selected pool's per-node count rather than `maxContiguousAvailable` / `pool.gpuCount` - Import `NodePoolInfo` from `@airunway/shared` Tests: - Add `GET /api/installation/gpu-throughput` suite covering per-node `tpSize` clamping, fallback when the requested model is absent, the no-explicit-model path, and the empty-cluster `404` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Expired entries in `architectureCache` were left in the map indefinitely, so over time (many distinct `modelId`/token keys) it could grow unbounded despite the TTL. - Delete a cache entry in `getModelArchitecture` when it is found expired, keeping the cache bounded by "used within TTL" Tests: - Add a `getModelArchitecture` suite covering config.json parsing, fresh cache reuse, eviction + re-fetch after the TTL, and the non-ok `undefined` fallback Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`useGpuThroughput`'s react-query `queryKey` did not account for auth state, so a high-confidence estimate fetched for a gated model while logged in could still be served from cache after logout — even though the backend can no longer read `config.json` without the token. - Add an `'auth' | 'anon'` discriminator (derived from token presence, never the token itself) to the `queryKey`, forcing a recompute when switching between authenticated and anonymous states Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Add `doesNotFit` to `GpuThroughputEstimate` and set it when model weights plus headroom leave no room for KV cache; render an explicit "Does not fit — no room for KV cache" warning instead of a misleading per-chat speed - Resolve KV context length after fetching model architecture, falling back to `maxPositionEmbeddings` (capped at `MAX_INFERRED_CONTEXT_LEN`) so long-context HF models are no longer sized against the 4K default - Add backend route and frontend component tests for both cases Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Add **Model Weights Precision** and **KV Cache Precision** dropdowns in a new "Performance & Precision" section, and move the throughput estimate there from the model summary card - Decouple KV-cache precision from weight quantization: add `bytesPerKvFor` (defaults to 2 bytes) and thread `kvCacheDtype` through `api.ts`, `useGpuOperator`, and `gpu-throughput-params` - Feed FP8 into the deployment as `--quantization fp8` / `--kv-cache-dtype fp8` engine args for `vllm`/`sglang` only - Gate FP8 to Hopper (H100/H200) via `gpuSupportsFp8`: downgrade an FP8 KV cache to FP16 in the estimate and disable Deploy with a reason on non-Hopper GPUs - Surface aggregate tokens/sec total in the `ThroughputEstimate` label - Refactor GPU selection so the backend is the single source of truth: replace `pickGpuModel` with `hasEstimableGpu` and stop forwarding `gpuModel` from the client - Add tests for `bytesPerKvFor`, `gpuSupportsFp8`, KV/weight decoupling, and FP8 gating Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The backend already picks the estimate GPU (highest per-GPU VRAM); the frontend no longer forwards `gpuModel`, so the threaded value was dead. - Delete the deprecated `pickGpuModel()` helper from `gpu-throughput-params.ts` and narrow `buildThroughputParamsForGpu`'s gate param to `boolean` - Replace the `gpuModel?: string` prop with a `gpuPresent?: boolean` presence flag across `ModelsPage`, `ModelGrid`, `HfModelSearch`, `ModelCard`, and `HfModelCard` - Source `gpuPresent` from `hasEstimableGpu(detailedCapacity)` in `ModelsPage` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
A user-supplied `modelId` was interpolated raw into Hugging Face URLs while forwarding the caller's `X-HF-Token`, so malformed ids (`../../…`, extra `/segments`, query fragments, or whitespace) could steer authenticated outbound requests to unintended paths. - Add `isValidHfRepoId()` + `encodeHfRepoPath()` helpers to `huggingface.ts` (1-2 safe segments, no `.`/`..` traversal, ≤96 chars/segment, per-segment `encodeURIComponent`) - Guard and encode both token-forwarding fetches: `getModelArchitecture` (returns `undefined` on an invalid id) and `getGgufFiles` (throws) - Reject at the route edges too: `.refine(isValidHfRepoId)` on the `gpu-throughput` query schema and a 400 guard on the greedy `/:modelId/gguf-files` route - Add service + route tests covering traversal/unsafe ids and encoded URLs Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- bound `paramCount` with `.max(9_000_000_000_000)` so out-of-range values are rejected with 400 instead of yielding garbage throughput estimates, matching the existing `contextLen` and `tpSize` caps - add a test asserting a `paramCount` above the cap returns 400 and never reaches the estimator Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The backend and frontend each carried their own parameter-count parser that had drifted (whitespace boundaries, the `illion` suffix, and the backend-only `< 10000` sanity guard), so a fix to one never reached the other. Consolidate on a single source of truth. - add canonical `parseParameterCountFromName` and `resolveModelParamCount` to `@airunway/shared` (`shared/types/modelParams.ts`), keeping the stricter backend parsing behaviour - rewire `modelCompatibility.ts` to import `parseParameterCountFromName` from shared, and drop the now-dead `resolveParamCount` from `gpuPerformance.ts` (no production callers remained) - replace the divergent inline regex in `gpu-throughput-params.ts` with the shared `resolveModelParamCount` - repoint the backend tests to the shared implementation Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The per-chat tokens/sec estimate used one GPU worth of bandwidth and ignored `tpSize`, underestimating speed by ~`tpSize×` for multi-GPU replicas — while the UI tooltip implied the GPU count was already factored in. - Scale effective bandwidth by `tpSize × TP_DECODE_EFFICIENCY` (0.85) in `estimatePerChatTokensPerSec`; `tpSize=1` reproduces the exact single-GPU number - Thread the resolved `effectiveTpSize` into the per-chat call in the `gpu-throughput` route - Add tests for the TP speedup ratio and the `tpSize=1` regression guard Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
`useGpuThroughput` set no `retry`, so it inherited the global `retry: 3`. A 404 (no cluster GPU pool maps to a known spec) is deterministic, so every model card scrolled into view fired 1 + 3 = 4 doomed requests. - Add a status-aware `retry` predicate to the `useGpuThroughput` query: skip retries on any 4xx, keep a small budget (`failureCount < 2`) for transient 5xx/network failures - Add `useGpuOperator.test.tsx` covering the no-retry path (404/400 fire one request) and the 5xx retry path, using a `QueryClient` with retries enabled so the per-query override is actually exercised Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Tighten the loose `string` parameters to the unions already exported alongside each function, so the validated enum flows through without being widened back to `string`: - `bytesPerWeightFor(quantization?: string)` -> `quantization?: Quantization` - `bytesPerKvFor(dtype?: string)` -> `dtype?: KvCacheDtype` The runtime switch and `undefined` defaults are unchanged, so behavior is identical; an invalid value is now a compile-time error at the boundary instead of silently mapping to the 2-byte default. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
HuggingFace search cards carry no `minGpus`, so they sent no `tpSize` and the backend defaulted to tp=1 — making large models spuriously report "does not fit" while the curated/Deploy tabs showed full capacity for the same model and cluster. - Add `deriveTpSizeToFitWeights` in `gpuPerformance.ts`: returns the smallest power-of-two TP size whose per-GPU weight shard leaves room for a KV cache, bounded by `maxContiguous`; the fit test mirrors `estimateConcurrentCapacity`. - Wire it into the `/gpu-throughput` route so an omitted `tpSize` is derived instead of defaulting to 1; an explicit `tpSize` still wins. - Add tests covering the TP bump, `maxContiguous` cap, single-GPU cap, small-model/fp8 stay-at-1, unknown paramCount, headroom-exceeds-VRAM, and cross-tab consistency. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- `setFp8PrecisionEngineArgs` now strips `quantization`/`kv-cache-dtype` only when the value is `fp8` (the value the precision dropdown owns), so a user-set `awq`/`gptq` from the advanced engine-args editor is no longer clobbered when weight precision is not FP8 - add a non-blocking warning on the Deploy page when FP8 is selected but the throughput estimate is absent (errored/404), so an unsupported `fp8` flag is not submitted silently on hardware of unknown capability - add unit tests for `setFp8PrecisionEngineArgs` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- cap the explicit `contextLen` query param at `MAX_CONTEXT_LEN` (32768), not just the arch-inferred window, so a caller forwarding a model's huge advertised window (128K–1M) can't collapse the concurrency estimate toward zero - bound the architecture cache with an LRU size cap (`ARCH_CACHE_MAX_ENTRIES`), evicting the least-recently-used entry once exceeded, so a wide scan of many distinct `modelId`/token keys no longer keeps every entry resident for the full TTL - add tests for the explicit-`contextLen` cap and the LRU eviction Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- correct `formatCount` doc example: `18234` formats to `"18k"` (not `"18.2k"`), since values `>= 10000` drop the decimal - document that `TP_DECODE_EFFICIENCY` (0.85) is a flat factor and is optimistic for large TP groups, where the per-GPU haircut grows with `tpSize` and slower interconnect - replace the dead `apiGroup` ternary (both branches returned `''`) with a direct `''` and a note that the real CRD group isn't stored on the `InferenceProviderConfig` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Addresses review findings in the GPU throughput estimator: - `selectGpuForEstimate()` and `gpuSupportsFp8()` now use strict GPU lookups (`normalizeKnownGpuModel`/`getKnownGpuInfo`); an unknown GPU label is skipped or returns a `404` instead of being silently estimated as an `A10` with wrong speed and FP8 numbers - the `/gpu-throughput` query schema no longer rejects a non-HF `modelId`; the handler gates the token-bearing HF fetch on `isValidHfRepoId()`, so a curated/custom id degrades to a bandwidth-only estimate from `paramCount` instead of a hard `400` - `getModelArchitecture()` reads transformer dimensions from nested `text_config`/`llm_config`/`language_config`, so multimodal and composite models yield high-confidence concurrency estimates instead of per-chat-only - add tests covering unknown-GPU skip/404, non-HF and malformed `modelId` degradation, and nested-config parsing Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The `eslint ^8 → ^10` bump broke `bun run lint` in both workspaces: ESLint v9+ requires a flat `eslint.config.*` (the repo never had one) and removed the `--ext` flag both scripts used. - add flat `eslint.config.mjs` to `backend` and `frontend`, linting TypeScript via the typescript-eslint `flat/recommended` preset - add `@typescript-eslint/parser` and `@typescript-eslint/eslint-plugin` to backend (it previously declared only `eslint`, so it could not parse its own `.ts`); lockfile updated to match - drop the removed `--ext` flag from both lint scripts and `--max-warnings 0` from frontend - demote pre-existing `no-explicit-any`, `no-unused-vars`, and the newly-enabled react-hooks compiler rules to warnings so lint is green and CI-usable, leaving the historical backlog visible for incremental burndown Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- surface a non-blocking "model does not fit" warning on the deploy page when the high-confidence estimate leaves no room for the KV cache; `Deploy` stays enabled since the user may pick more GPUs per replica than the estimate assumed, and it is hidden when `fp8Blocked` already explains a blocking reason - step `tpDecodeEfficiency` down by TP group size (1.0 for TP1, 0.85 for TP2-4, 0.75 for TP>4) instead of a flat 0.85, so large tensor-parallel groups crossing NVLink domains are not over-estimated - add tests for the no-fit warning and the new efficiency tiers Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Declare the return ref as `React.RefObject<T | null>` so it matches what `useRef<T>(null)` produces under `@types/react@19`, fixing the TS2322 build error that broke the `build` and `e2e-frontend` CI checks. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `React.RefObject` with a type-only `RefObject` import in `useInView` so the return type no longer relies on the UMD `React` global, which `tsc` may fail to resolve - Update `useGpuThroughput` JSDoc to note the query gates on `paramCount` only, since the GPU model is now chosen server-side Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
6808e31 to
e5137e8
Compare
`gpuSupportsFp8` only recognised Hopper, so the throughput estimator silently downgraded an FP8 KV cache to fp16 and the Deploy page blocked FP8 deployments on L40S/L4 — both of which have a native FP8 datapath. vLLM gates full FP8 (W8A8 / FP8 KV cache) on compute capability >= 8.9, covering Ada Lovelace and Hopper. - Add `FP8_CAPABLE_GENERATIONS` and treat `Ada Lovelace` (L40S, L4) as FP8-capable alongside Hopper; Ampere and older stay excluded (A100 is weight-only W8A16, not the W8A8/FP8-KV path modelled here) - Update `gpuSupportsFp8` tests to assert L40S/L4 are FP8-capable and pre-Ada GPUs are not - Correct stale "Hopper-only" / "non-Hopper" wording in `installation.ts`, `DeployPage.tsx`, `DeploymentForm.tsx`, and the `GpuThroughputEstimate` shared type Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The backend's recognises Ada Lovelace (L40S, L4) and Hopper (H100, H200), but the UI still told users only H100/H200 support FP8 — wrong for L40S/L4 clusters. - `DeployPage.tsx`: update the `fp8BlockReason` text and both weight / KV-cache `InfoHint` tooltips to mention L40S/L4 - `DeploymentForm.tsx`: update the fallback FP8 block message to match Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
robert-cronin
approved these changes
Jun 10, 2026
17 tasks
surajssd
added a commit
to surajssd/kubeairunway
that referenced
this pull request
Jun 11, 2026
PR kaito-project#311 added frontend/eslint.config.mjs to main; this branch independently added frontend/eslint.config.js. With both present ESLint silently loads only the .js and ignores the .mjs. Merge into the single .mjs (keeping kaito-project#311's parser/JSX wiring and react-hooks recommended set) and delete the .js. The rules this PR cleaned up — no-explicit-any, no-empty-object-type, no-unused-vars (with argsIgnorePattern '^_') — are promoted to errors so they cannot regress; the experimental react-hooks React-Compiler rules stay warnings per kaito-project#311's backlog decision. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds an offline inference-throughput estimator (issue #139) that surfaces two rough numbers per model without running any inference:
tok/s— single-stream decode speed, memory-bandwidth bound ("how snappy chat feels")tok/s— KV-cache-budget gated, per replica ("how many requests at once")The estimate is shown on catalog cards (deferred until the card scrolls into view) and on the Deploy page, where it sits in a new Performance & Precision section alongside weight- and KV-cache-precision controls. All values are presented as estimates with a methodology disclaimer. The branch also hardens the supporting backend (input validation, caching, GPU selection), unifies parameter-count parsing across the stack, and restores
bun run lintby migrating both workspaces to ESLint flat config.Type of Change
Related Issues
Fixes #139
Changes Made
Throughput estimator (backend)
gpuPerformance.tswithestimatePerChatTokensPerSecandestimateConcurrentCapacityheuristics, plusderiveTpSizeToFitWeights,bytesPerWeightFor, andbytesPerKvForhelpersGET /installation/gpu-throughputroute: selects the GPU pool to estimate on, derivestpSizewhen the caller sends nominGpushint, and degrades to a low-confidence per-chat-only result when architecture data is missinggetModelArchitecturetohuggingface.tsto read transformer dims fromconfig.json(including nestedtext_config/llm_config/language_configfor multimodal models), with a token-scoped, TTL'd, LRU-bounded cache (gated configs keyed bysha256(token)so they never leak across callers)memBandwidthGBsspecs and anH200-141GBentry tocostEstimation.ts; gate FP8 to Hopper viagpuSupportsFp8Throughput estimator (frontend)
ThroughputEstimatecomponent,useGpuThroughputhook,gpuOperatorApi.getThroughput, andgpu-throughput-paramsbuildersuseInViewso catalog cards defer the estimate fetch until visibleModelCard,HfModelCard,ModelGrid,HfModelSearch,ModelsPage, and the Deploy summary cardDeploy page precision controls
--quantization fp8/--kv-cache-dtype fp8engine args forvllm/sglangonlybytesPerKvFordefaults to 2 bytes)Deploystays enabled, hidden whenfp8Blockedalready applies)Correctness & hardening
tpSize × tpDecodeEfficiency(tpSize), stepped by group size (1.0for TP1,0.85for TP2-4,0.75for TP>4) so large groups crossing NVLink domains aren't over-estimatedtpSizeto a pool's per-node GPU count viaperNodeGpuCount; validate an explicitgpuModelagainstcapacity.nodePoolsand fall back to the highest-VRAM poolencodeURIComponentmodelId(isValidHfRepoId/encodeHfRepoPath) before any token-forwarding Hugging Face fetch, rejecting../traversal and stray segmentsparamCountat 9T and the explicitcontextLenatMAX_CONTEXT_LEN(32768); vary the react-query cache key by HF auth state (auth/anon) and stop retrying deterministic 4xx responses404instead of silently treating asA10) and accept non-HF custommodelIds (degrade to bandwidth-only instead of a hard400)Refactors
@airunway/shared(shared/types/modelParams.ts:parseParameterCountFromName,resolveModelParamCount), removing divergent backend/frontend copiesQuantization/KvCacheDtypeunions instead ofstringgpuModelclient-side selection plumbing (pickGpuModel), making the backend the single source of truth for GPU selectionBuild/CI
backendandfrontendto ESLint flat config (eslint.config.mjs); add@typescript-eslint/parser+ plugin to backend; drop the removed--extflag sobun run lintworks under ESLint v9+Testing
bun run test)New/updated test suites:
gpuPerformance.test.ts,huggingface.test.ts,installation.test.ts,costEstimation.test.ts,modelCompatibility.test.ts(backend);DeploymentForm.test.tsx,ThroughputEstimate.test.tsx,useGpuOperator.test.tsx(frontend). They cover the per-chat/concurrency heuristics, TP-size derivation and decode-efficiency tiers,config.jsonparsing (incl. nested multimodal configs), TTL + LRU cache eviction,modelIdvalidation/encoding, query-param bounds, unknown-GPU404, FP8 gating, and the non-blocking "does not fit" warning.Manual Testing
Create a cluster with GPU nodepool and then check out this code and run the following:
Now go to http://localhost:5173/deploy/microsoft%2FPhi-4-mini-instruct and you can see the new section.
Checklist
bun run lintScreenshots
Look at the new Performance & Precision section.
Additional Notes
no-explicit-any/no-unused-varsbacklog (and the newly-enabled react-hooks compiler rules) to warnings so lint is green and CI-usable, leaving the historical backlog visible for incremental burndown rather than fixing it in this PR.tscerror infrontend/src/hooks/useInView.ts(RefObject<T | null>) surfaced during validation; it is unrelated to the estimator logic and can be addressed separately.