feat: estimate inference throughput for models on cluster GPUs by surajssd · Pull Request #311 · kaito-project/airunway

surajssd · 2026-06-03T18:34:21Z

Description

Adds an offline inference-throughput estimator (issue #139) that surfaces two rough numbers per model without running any inference:

per-chat tok/s — single-stream decode speed, memory-bandwidth bound ("how snappy chat feels")
concurrent capacity / aggregate tok/s — KV-cache-budget gated, per replica ("how many requests at once")

The estimate is shown on catalog cards (deferred until the card scrolls into view) and on the Deploy page, where it sits in a new Performance & Precision section alongside weight- and KV-cache-precision controls. All values are presented as estimates with a methodology disclaimer. The branch also hardens the supporting backend (input validation, caching, GPU selection), unifies parameter-count parsing across the stack, and restores bun run lint by migrating both workspaces to ESLint flat config.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
📚 Documentation update
🎨 UI/UX improvement
♻️ Refactoring (no functional changes)
🧪 Test update
🔧 Build/CI configuration

Related Issues

Fixes #139

Changes Made

Throughput estimator (backend)

Add gpuPerformance.ts with estimatePerChatTokensPerSec and estimateConcurrentCapacity heuristics, plus deriveTpSizeToFitWeights, bytesPerWeightFor, and bytesPerKvFor helpers
Add GET /installation/gpu-throughput route: selects the GPU pool to estimate on, derives tpSize when the caller sends no minGpus hint, and degrades to a low-confidence per-chat-only result when architecture data is missing
Add getModelArchitecture to huggingface.ts to read transformer dims from config.json (including nested text_config/llm_config/language_config for multimodal models), with a token-scoped, TTL'd, LRU-bounded cache (gated configs keyed by sha256(token) so they never leak across callers)
Add per-GPU memBandwidthGBs specs and an H200-141GB entry to costEstimation.ts; gate FP8 to Hopper via gpuSupportsFp8

Throughput estimator (frontend)

Add ThroughputEstimate component, useGpuThroughput hook, gpuOperatorApi.getThroughput, and gpu-throughput-params builders
Add useInView so catalog cards defer the estimate fetch until visible
Wire estimates into ModelCard, HfModelCard, ModelGrid, HfModelSearch, ModelsPage, and the Deploy summary card

Deploy page precision controls

Add Model Weights Precision and KV Cache Precision dropdowns; feed FP8 into the deployment as --quantization fp8 / --kv-cache-dtype fp8 engine args for vllm/sglang only
Decouple KV-cache precision from weight quantization (bytesPerKvFor defaults to 2 bytes)
Gate FP8 to H100/H200: downgrade an FP8 KV cache to FP16 in the estimate, and disable Deploy with a reason on non-Hopper GPUs
Surface a non-blocking "model does not fit" warning when the high-confidence estimate leaves no room for the KV cache (Deploy stays enabled, hidden when fp8Blocked already applies)

Correctness & hardening

Factor tensor parallelism into per-chat speed: scale bandwidth by tpSize × tpDecodeEfficiency(tpSize), stepped by group size (1.0 for TP1, 0.85 for TP2-4, 0.75 for TP>4) so large groups crossing NVLink domains aren't over-estimated
Bound tpSize to a pool's per-node GPU count via perNodeGpuCount; validate an explicit gpuModel against capacity.nodePools and fall back to the highest-VRAM pool
Validate and encodeURIComponent modelId (isValidHfRepoId / encodeHfRepoPath) before any token-forwarding Hugging Face fetch, rejecting ../ traversal and stray segments
Cap paramCount at 9T and the explicit contextLen at MAX_CONTEXT_LEN (32768); vary the react-query cache key by HF auth state (auth/anon) and stop retrying deterministic 4xx responses
Correct estimates for unknown GPUs (skip / 404 instead of silently treating as A10) and accept non-HF custom modelIds (degrade to bandwidth-only instead of a hard 400)

Refactors

Unify model parameter-count parsing in @airunway/shared (shared/types/modelParams.ts: parseParameterCountFromName, resolveModelParamCount), removing divergent backend/frontend copies
Type quant params with the exported Quantization / KvCacheDtype unions instead of string
Drop dead gpuModel client-side selection plumbing (pickGpuModel), making the backend the single source of truth for GPU selection

Build/CI

Migrate backend and frontend to ESLint flat config (eslint.config.mjs); add @typescript-eslint/parser + plugin to backend; drop the removed --ext flag so bun run lint works under ESLint v9+

Testing

Unit tests pass (bun run test)
Manual testing performed
Tested with a Kubernetes cluster

New/updated test suites: gpuPerformance.test.ts, huggingface.test.ts, installation.test.ts, costEstimation.test.ts, modelCompatibility.test.ts (backend); DeploymentForm.test.tsx, ThroughputEstimate.test.tsx, useGpuOperator.test.tsx (frontend). They cover the per-chat/concurrency heuristics, TP-size derivation and decode-efficiency tiers, config.json parsing (incl. nested multimodal configs), TTL + LRU cache eviction, modelId validation/encoding, query-param bounds, unknown-GPU 404, FP8 gating, and the non-blocking "does not fit" warning.

Manual Testing

Create a cluster with GPU nodepool and then check out this code and run the following:

bun install
bun run dev

Now go to http://localhost:5173/deploy/microsoft%2FPhi-4-mini-instruct and you can see the new section.

Checklist

My code follows the project's style guidelines
I have run bun run lint
I have added tests that prove my fix/feature works
New and existing unit tests pass locally
I have updated documentation if needed
My changes generate no new warnings

Screenshots

Look at the new Performance & Precision section.

Additional Notes

All numbers are deliberately simple heuristics and are labelled as estimates in the UI — real throughput depends on the serving engine, batch scheduler, prompt lengths, and quantization.
The ESLint migration intentionally demotes the pre-existing no-explicit-any / no-unused-vars backlog (and the newly-enabled react-hooks compiler rules) to warnings so lint is green and CI-usable, leaving the historical backlog visible for incremental burndown rather than fixing it in this PR.
One pre-existing tsc error in frontend/src/hooks/useInView.ts (RefObject<T | null>) surfaced during validation; it is unrelated to the estimator logic and can be addressed separately.

Copilot

Pull request overview

Adds an offline, heuristic GPU inference-throughput estimator and surfaces it in the model catalog and deploy UI, using cluster GPU specs + optional Hugging Face config.json architecture metadata (cached) to estimate per-chat decode speed and concurrency.

Changes:

Backend: introduce a bandwidth/KV-cache–based estimator and a new GET /installation/gpu-throughput endpoint that selects a GPU pool and (when possible) uses HF config.json architecture for concurrency estimates.
Frontend: add query-param builders + hook + ThroughputEstimate UI component, lazily fetching estimates only when cards scroll into view.
Shared types + GPU specs: add ModelArchitecture/GpuThroughputEstimate, extend GPU model table with per-GPU memory bandwidth and H200 support.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
shared/types/model.ts	Adds `ModelArchitecture` type used for KV sizing.
shared/types/installation.ts	Adds `GpuThroughputEstimate` API payload type.
frontend/src/pages/ModelsPage.tsx	Plumbs selected GPU model into model grids/search for throughput estimation.
frontend/src/pages/DeployPage.tsx	Fetches and displays throughput estimate on the deploy summary card.
frontend/src/lib/gpu-throughput-params.ts	Adds helpers to pick GPU model and build throughput query params.
frontend/src/lib/api.ts	Adds `gpuOperatorApi.getThroughput` client method and exports `GpuThroughputEstimate`.
frontend/src/hooks/useInView.ts	Adds in-view hook to defer per-card throughput fetches.
frontend/src/hooks/useGpuOperator.ts	Adds `useGpuThroughput` react-query hook and params typing.
frontend/src/components/models/ThroughputEstimate.tsx	New UI component to render per-chat + concurrency estimates with tooltip disclaimer.
frontend/src/components/models/ThroughputEstimate.test.tsx	Component tests for confident/low-confidence/loading/empty states.
frontend/src/components/models/ModelGrid.tsx	Threads `gpuModel` prop down to cards.
frontend/src/components/models/ModelCard.tsx	Adds lazy throughput fetching/display on curated model cards.
frontend/src/components/models/HfModelSearch.tsx	Threads `gpuModel` prop down to HF cards.
frontend/src/components/models/HfModelCard.tsx	Adds lazy throughput fetching/display on HF search result cards.
backend/src/services/huggingface.ts	Adds cached `config.json` architecture lookup keyed by model + token hash.
backend/src/services/gpuPerformance.ts	New estimator implementation for per-chat tok/s and concurrent capacity.
backend/src/services/gpuPerformance.test.ts	Unit tests covering estimator behavior and edge cases.
backend/src/services/costEstimation.ts	Extends GPU specs with `memBandwidthGBs` and adds H200.
backend/src/services/costEstimation.test.ts	Tests H200 normalization and bandwidth lookup.
backend/src/routes/installation.ts	Adds `GET /installation/gpu-throughput` endpoint and GPU selection logic.
.gitignore	Ignores `.playwright-mcp/` and normalizes trailing newline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 41 out of 43 changed files in this pull request and generated 1 comment.

Add a rough, offline inference-speed estimator (issue kaito-project#139) that surfaces two numbers per model: single-stream per-chat `tok/s` (memory-bandwidth bound) and KV-cache-gated concurrent capacity. No inference is run; all values are shown as estimates with a methodology disclaimer. Backend: - Add `gpuPerformance.ts` with `estimatePerChatTokensPerSec` and `estimateConcurrentCapacity` heuristics, plus `resolveParamCount` and `bytesPerWeightFor` helpers - Add `GET /installation/gpu-throughput` route, selecting the GPU pool to estimate on and degrading to a low-confidence per-chat-only result when architecture data is missing - Add `getModelArchitecture` to `huggingface.ts` to read transformer dims from `config.json`, with a token-scoped, TTL'd cache (gated configs keyed by `sha256(token)` so they never leak across callers) - Add per-GPU `memBandwidthGBs` specs to `costEstimation.ts` and a new `H200-141GB` entry; export `GpuModelInfo` Frontend: - Add `ThroughputEstimate` component (per-chat + concurrency label with tooltip) - Add `useGpuThroughput` hook, `gpuOperatorApi.getThroughput`, and `gpu-throughput-params` builders - Add `useInView` so catalog cards defer the estimate fetch until visible - Wire estimates into `ModelCard`, `HfModelCard`, `ModelGrid`, `HfModelSearch`, `ModelsPage`, and the Deploy summary card Shared: - Add `GpuThroughputEstimate` and `ModelArchitecture` types Chore: - Ignore `.playwright-mcp/` and add a trailing newline to `.gitignore` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

`selectGpuForEstimate` derived `maxContiguous` (the per-replica tensor-parallel ceiling) from cluster-wide or pool-total values, producing estimates for hardware the cluster lacks and `tpSize` values that exceed a single node's GPU count. - Add `perNodeGpuCount` helper deriving a pool's per-node GPUs as `floor(gpuCount / nodeCount)`, since `gpuCount` is summed across nodes - Validate an explicit `gpuModel` against `capacity.nodePools` and fall back to the highest-VRAM pool when the requested model is absent, instead of resolving it directly from the static GPU spec table - Compute `maxContiguous` and `capacityLabel` from the selected pool's per-node count rather than `maxContiguousAvailable` / `pool.gpuCount` - Import `NodePoolInfo` from `@airunway/shared` Tests: - Add `GET /api/installation/gpu-throughput` suite covering per-node `tpSize` clamping, fallback when the requested model is absent, the no-explicit-model path, and the empty-cluster `404` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Expired entries in `architectureCache` were left in the map indefinitely, so over time (many distinct `modelId`/token keys) it could grow unbounded despite the TTL. - Delete a cache entry in `getModelArchitecture` when it is found expired, keeping the cache bounded by "used within TTL" Tests: - Add a `getModelArchitecture` suite covering config.json parsing, fresh cache reuse, eviction + re-fetch after the TTL, and the non-ok `undefined` fallback Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

`useGpuThroughput`'s react-query `queryKey` did not account for auth state, so a high-confidence estimate fetched for a gated model while logged in could still be served from cache after logout — even though the backend can no longer read `config.json` without the token. - Add an `'auth' | 'anon'` discriminator (derived from token presence, never the token itself) to the `queryKey`, forcing a recompute when switching between authenticated and anonymous states Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- Add `doesNotFit` to `GpuThroughputEstimate` and set it when model weights plus headroom leave no room for KV cache; render an explicit "Does not fit — no room for KV cache" warning instead of a misleading per-chat speed - Resolve KV context length after fetching model architecture, falling back to `maxPositionEmbeddings` (capped at `MAX_INFERRED_CONTEXT_LEN`) so long-context HF models are no longer sized against the 4K default - Add backend route and frontend component tests for both cases Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- Add **Model Weights Precision** and **KV Cache Precision** dropdowns in a new "Performance & Precision" section, and move the throughput estimate there from the model summary card - Decouple KV-cache precision from weight quantization: add `bytesPerKvFor` (defaults to 2 bytes) and thread `kvCacheDtype` through `api.ts`, `useGpuOperator`, and `gpu-throughput-params` - Feed FP8 into the deployment as `--quantization fp8` / `--kv-cache-dtype fp8` engine args for `vllm`/`sglang` only - Gate FP8 to Hopper (H100/H200) via `gpuSupportsFp8`: downgrade an FP8 KV cache to FP16 in the estimate and disable Deploy with a reason on non-Hopper GPUs - Surface aggregate tokens/sec total in the `ThroughputEstimate` label - Refactor GPU selection so the backend is the single source of truth: replace `pickGpuModel` with `hasEstimableGpu` and stop forwarding `gpuModel` from the client - Add tests for `bytesPerKvFor`, `gpuSupportsFp8`, KV/weight decoupling, and FP8 gating Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

The backend already picks the estimate GPU (highest per-GPU VRAM); the frontend no longer forwards `gpuModel`, so the threaded value was dead. - Delete the deprecated `pickGpuModel()` helper from `gpu-throughput-params.ts` and narrow `buildThroughputParamsForGpu`'s gate param to `boolean` - Replace the `gpuModel?: string` prop with a `gpuPresent?: boolean` presence flag across `ModelsPage`, `ModelGrid`, `HfModelSearch`, `ModelCard`, and `HfModelCard` - Source `gpuPresent` from `hasEstimableGpu(detailedCapacity)` in `ModelsPage` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

A user-supplied `modelId` was interpolated raw into Hugging Face URLs while forwarding the caller's `X-HF-Token`, so malformed ids (`../../…`, extra `/segments`, query fragments, or whitespace) could steer authenticated outbound requests to unintended paths. - Add `isValidHfRepoId()` + `encodeHfRepoPath()` helpers to `huggingface.ts` (1-2 safe segments, no `.`/`..` traversal, ≤96 chars/segment, per-segment `encodeURIComponent`) - Guard and encode both token-forwarding fetches: `getModelArchitecture` (returns `undefined` on an invalid id) and `getGgufFiles` (throws) - Reject at the route edges too: `.refine(isValidHfRepoId)` on the `gpu-throughput` query schema and a 400 guard on the greedy `/:modelId/gguf-files` route - Add service + route tests covering traversal/unsafe ids and encoded URLs Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- bound `paramCount` with `.max(9_000_000_000_000)` so out-of-range values are rejected with 400 instead of yielding garbage throughput estimates, matching the existing `contextLen` and `tpSize` caps - add a test asserting a `paramCount` above the cap returns 400 and never reaches the estimator Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

The backend and frontend each carried their own parameter-count parser that had drifted (whitespace boundaries, the `illion` suffix, and the backend-only `< 10000` sanity guard), so a fix to one never reached the other. Consolidate on a single source of truth. - add canonical `parseParameterCountFromName` and `resolveModelParamCount` to `@airunway/shared` (`shared/types/modelParams.ts`), keeping the stricter backend parsing behaviour - rewire `modelCompatibility.ts` to import `parseParameterCountFromName` from shared, and drop the now-dead `resolveParamCount` from `gpuPerformance.ts` (no production callers remained) - replace the divergent inline regex in `gpu-throughput-params.ts` with the shared `resolveModelParamCount` - repoint the backend tests to the shared implementation Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

The per-chat tokens/sec estimate used one GPU worth of bandwidth and ignored `tpSize`, underestimating speed by ~`tpSize×` for multi-GPU replicas — while the UI tooltip implied the GPU count was already factored in. - Scale effective bandwidth by `tpSize × TP_DECODE_EFFICIENCY` (0.85) in `estimatePerChatTokensPerSec`; `tpSize=1` reproduces the exact single-GPU number - Thread the resolved `effectiveTpSize` into the per-chat call in the `gpu-throughput` route - Add tests for the TP speedup ratio and the `tpSize=1` regression guard Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

`useGpuThroughput` set no `retry`, so it inherited the global `retry: 3`. A 404 (no cluster GPU pool maps to a known spec) is deterministic, so every model card scrolled into view fired 1 + 3 = 4 doomed requests. - Add a status-aware `retry` predicate to the `useGpuThroughput` query: skip retries on any 4xx, keep a small budget (`failureCount < 2`) for transient 5xx/network failures - Add `useGpuOperator.test.tsx` covering the no-retry path (404/400 fire one request) and the 5xx retry path, using a `QueryClient` with retries enabled so the per-query override is actually exercised Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Tighten the loose `string` parameters to the unions already exported alongside each function, so the validated enum flows through without being widened back to `string`: - `bytesPerWeightFor(quantization?: string)` -> `quantization?: Quantization` - `bytesPerKvFor(dtype?: string)` -> `dtype?: KvCacheDtype` The runtime switch and `undefined` defaults are unchanged, so behavior is identical; an invalid value is now a compile-time error at the boundary instead of silently mapping to the 2-byte default. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

HuggingFace search cards carry no `minGpus`, so they sent no `tpSize` and the backend defaulted to tp=1 — making large models spuriously report "does not fit" while the curated/Deploy tabs showed full capacity for the same model and cluster. - Add `deriveTpSizeToFitWeights` in `gpuPerformance.ts`: returns the smallest power-of-two TP size whose per-GPU weight shard leaves room for a KV cache, bounded by `maxContiguous`; the fit test mirrors `estimateConcurrentCapacity`. - Wire it into the `/gpu-throughput` route so an omitted `tpSize` is derived instead of defaulting to 1; an explicit `tpSize` still wins. - Add tests covering the TP bump, `maxContiguous` cap, single-GPU cap, small-model/fp8 stay-at-1, unknown paramCount, headroom-exceeds-VRAM, and cross-tab consistency. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- `setFp8PrecisionEngineArgs` now strips `quantization`/`kv-cache-dtype` only when the value is `fp8` (the value the precision dropdown owns), so a user-set `awq`/`gptq` from the advanced engine-args editor is no longer clobbered when weight precision is not FP8 - add a non-blocking warning on the Deploy page when FP8 is selected but the throughput estimate is absent (errored/404), so an unsupported `fp8` flag is not submitted silently on hardware of unknown capability - add unit tests for `setFp8PrecisionEngineArgs` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- cap the explicit `contextLen` query param at `MAX_CONTEXT_LEN` (32768), not just the arch-inferred window, so a caller forwarding a model's huge advertised window (128K–1M) can't collapse the concurrency estimate toward zero - bound the architecture cache with an LRU size cap (`ARCH_CACHE_MAX_ENTRIES`), evicting the least-recently-used entry once exceeded, so a wide scan of many distinct `modelId`/token keys no longer keeps every entry resident for the full TTL - add tests for the explicit-`contextLen` cap and the LRU eviction Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- correct `formatCount` doc example: `18234` formats to `"18k"` (not `"18.2k"`), since values `>= 10000` drop the decimal - document that `TP_DECODE_EFFICIENCY` (0.85) is a flat factor and is optimistic for large TP groups, where the per-GPU haircut grows with `tpSize` and slower interconnect - replace the dead `apiGroup` ternary (both branches returned `''`) with a direct `''` and a note that the real CRD group isn't stored on the `InferenceProviderConfig` Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Addresses review findings in the GPU throughput estimator: - `selectGpuForEstimate()` and `gpuSupportsFp8()` now use strict GPU lookups (`normalizeKnownGpuModel`/`getKnownGpuInfo`); an unknown GPU label is skipped or returns a `404` instead of being silently estimated as an `A10` with wrong speed and FP8 numbers - the `/gpu-throughput` query schema no longer rejects a non-HF `modelId`; the handler gates the token-bearing HF fetch on `isValidHfRepoId()`, so a curated/custom id degrades to a bandwidth-only estimate from `paramCount` instead of a hard `400` - `getModelArchitecture()` reads transformer dimensions from nested `text_config`/`llm_config`/`language_config`, so multimodal and composite models yield high-confidence concurrency estimates instead of per-chat-only - add tests covering unknown-GPU skip/404, non-HF and malformed `modelId` degradation, and nested-config parsing Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

The `eslint ^8 → ^10` bump broke `bun run lint` in both workspaces: ESLint v9+ requires a flat `eslint.config.*` (the repo never had one) and removed the `--ext` flag both scripts used. - add flat `eslint.config.mjs` to `backend` and `frontend`, linting TypeScript via the typescript-eslint `flat/recommended` preset - add `@typescript-eslint/parser` and `@typescript-eslint/eslint-plugin` to backend (it previously declared only `eslint`, so it could not parse its own `.ts`); lockfile updated to match - drop the removed `--ext` flag from both lint scripts and `--max-warnings 0` from frontend - demote pre-existing `no-explicit-any`, `no-unused-vars`, and the newly-enabled react-hooks compiler rules to warnings so lint is green and CI-usable, leaving the historical backlog visible for incremental burndown Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- surface a non-blocking "model does not fit" warning on the deploy page when the high-confidence estimate leaves no room for the KV cache; `Deploy` stays enabled since the user may pick more GPUs per replica than the estimate assumed, and it is hidden when `fp8Blocked` already explains a blocking reason - step `tpDecodeEfficiency` down by TP group size (1.0 for TP1, 0.85 for TP2-4, 0.75 for TP>4) instead of a flat 0.85, so large tensor-parallel groups crossing NVLink domains are not over-estimated - add tests for the no-fit warning and the new efficiency tiers Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Declare the return ref as `React.RefObject<T | null>` so it matches what `useRef<T>(null)` produces under `@types/react@19`, fixing the TS2322 build error that broke the `build` and `e2e-frontend` CI checks. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- Replace `React.RefObject` with a type-only `RefObject` import in `useInView` so the return type no longer relies on the UMD `React` global, which `tsc` may fail to resolve - Update `useGpuThroughput` JSDoc to note the query gates on `paramCount` only, since the GPU model is now chosen server-side Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

`gpuSupportsFp8` only recognised Hopper, so the throughput estimator silently downgraded an FP8 KV cache to fp16 and the Deploy page blocked FP8 deployments on L40S/L4 — both of which have a native FP8 datapath. vLLM gates full FP8 (W8A8 / FP8 KV cache) on compute capability >= 8.9, covering Ada Lovelace and Hopper. - Add `FP8_CAPABLE_GENERATIONS` and treat `Ada Lovelace` (L40S, L4) as FP8-capable alongside Hopper; Ampere and older stay excluded (A100 is weight-only W8A16, not the W8A8/FP8-KV path modelled here) - Update `gpuSupportsFp8` tests to assert L40S/L4 are FP8-capable and pre-Ada GPUs are not - Correct stale "Hopper-only" / "non-Hopper" wording in `installation.ts`, `DeployPage.tsx`, `DeploymentForm.tsx`, and the `GpuThroughputEstimate` shared type Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Copilot

Pull request overview

Copilot reviewed 34 out of 36 changed files in this pull request and generated 4 comments.

The backend's recognises Ada Lovelace (L40S, L4) and Hopper (H100, H200), but the UI still told users only H100/H200 support FP8 — wrong for L40S/L4 clusters. - `DeployPage.tsx`: update the `fp8BlockReason` text and both weight / KV-cache `InfoHint` tooltips to mention L40S/L4 - `DeploymentForm.tsx`: update the fallback FP8 block message to match Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

PR kaito-project#311 added frontend/eslint.config.mjs to main; this branch independently added frontend/eslint.config.js. With both present ESLint silently loads only the .js and ignores the .mjs. Merge into the single .mjs (keeping kaito-project#311's parser/JSX wiring and react-hooks recommended set) and delete the .js. The rules this PR cleaned up — no-explicit-any, no-empty-object-type, no-unused-vars (with argsIgnorePattern '^_') — are promoted to errors so they cannot regress; the experimental react-hooks React-Compiler rules stay warnings per kaito-project#311's backlog decision. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

surajssd requested a review from a team as a code owner June 3, 2026 18:34

Copilot AI review requested due to automatic review settings June 3, 2026 18:34

Copilot started reviewing on behalf of surajssd June 3, 2026 18:34 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread backend/src/routes/installation.ts Outdated

Comment thread backend/src/routes/installation.ts

Comment thread backend/src/services/huggingface.ts

Comment thread frontend/src/hooks/useGpuOperator.ts

Comment thread frontend/src/hooks/useInView.ts

surajssd force-pushed the suraj/estimate-inf-token-throughput branch from a1e39f8 to 4325784 Compare June 5, 2026 17:07

Copilot AI review requested due to automatic review settings June 5, 2026 17:49

Copilot started reviewing on behalf of surajssd June 5, 2026 17:50 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread frontend/src/hooks/useInView.ts

surajssd requested a review from Copilot June 5, 2026 18:03

Copilot started reviewing on behalf of surajssd June 5, 2026 18:03 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread frontend/src/hooks/useInView.ts Outdated

Comment thread frontend/src/hooks/useGpuOperator.ts

surajssd requested a review from Copilot June 5, 2026 20:32

Copilot started reviewing on behalf of surajssd June 5, 2026 20:32 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread frontend/src/hooks/useInView.ts

robert-cronin reviewed Jun 10, 2026

View reviewed changes

Comment thread backend/src/lib/kubeconfig.ts

Comment thread backend/src/services/costEstimation.ts Outdated

surajssd added 15 commits June 10, 2026 11:07

surajssd added 7 commits June 10, 2026 11:07

surajssd force-pushed the suraj/estimate-inf-token-throughput branch from 6808e31 to e5137e8 Compare June 10, 2026 18:49

Copilot AI review requested due to automatic review settings June 10, 2026 19:16

Copilot started reviewing on behalf of surajssd June 10, 2026 19:16 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread frontend/src/pages/DeployPage.tsx Outdated

Comment thread frontend/src/pages/DeployPage.tsx Outdated

Comment thread frontend/src/pages/DeployPage.tsx Outdated

Comment thread frontend/src/components/deployments/DeploymentForm.tsx Outdated

surajssd requested a review from robert-cronin June 10, 2026 22:13

robert-cronin approved these changes Jun 10, 2026

View reviewed changes

surajssd merged commit c4ee1c5 into kaito-project:main Jun 11, 2026
8 of 12 checks passed

surajssd deleted the suraj/estimate-inf-token-throughput branch June 11, 2026 00:18

robert-cronin mentioned this pull request Jun 11, 2026

build(lint): add ESLint flat configs for both workspaces #320

Merged

17 tasks

Conversation

surajssd commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Manual Testing

Checklist

Screenshots

Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

surajssd commented Jun 3, 2026 •

edited

Loading