ggml Chromium WebGPU ShaderF16 error/assertion stops CPU/WASM fallback when `n_gpu_layers=0`

Hey, 

---

I hit this snag through `wllama` but the native abort seems to originate from `ggml-webgpu.cpp`, so I am reporting it here after ngxson suggested to go upstream with this.

CC @reeselevine 

---

I'm working (with the help of codex) on a "simple", offline-first, CPU-only, educational PWA and I've run into a bit of a doozy with the way `wllama` loads the model in Chromium, i.e. Chromium fails during model setup before preflight/generation.

Chromium logs show a 'ShaderF16 WebGPU' assertion during `wllamaStart()` when the actual app is set up with `n_gpu_layers: 0`.
It makes sense to expect some 'CPU/WASM' operation or a 'no WebGPU device' statement when GPU layers are disabled, right?

Instead, Chromium throws this error:

```text
/source/llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:3699:
GGML_ASSERT(ctx->webgpu_global_ctx->adapter.HasFeature(wgpu::FeatureName::ShaderF16)) failed
```

Then it shows:

```text
Received abort signal from llama.cpp; Message: (empty)
Aborted()
Cannot find waiting task with callbackId = ...
```

btw, the same app/model path works in Firefox for both the online mode and the offline cached `Model` mode - so I think that the model, app shell, service worker, and cache path are not the root issue.

Here's a rundown of a codex-assisted session that traced the timing of WebGPU/backend init relative to model load params:

```
- `wllamaStart()` is called before the normal load action.
- Native backend init happens through `wllama_start()` / `llama_backend_init()`.
- Load params are converted for the load action after startup.
- `n_gpu_layers` does not appear to gate early backend init.
- JS source does not explicitly select WebGPU before load.
- No public supported option was found to disable WebGPU/backend registration before `wllamaStart()`.
- Default V3.1 wasm appears built with WebGPU enabled.
- An internal-looking build-wasm/wllama.wasm artifact appeared CPU-only / GGML_WEBGPU=OFF by metadata, strings/size, but it failed with JS-wrapper/import-object incompatibility and is not a supported runtime path.
```

## Questions

- is `shader-f16` currently a hard requirement for the ggml WebGPU backend?
- where should CPU fallback for this case be handled: llama.cpp/ggml, wllama, or both?
- should the WebGPU backend assert when the adapter lacks shader-f16, or should it return a recoverable 'backend-unavailable' error?
- when no GPU layers are requested through `n_gpu_layers: 0`, should WebGPU backend/device initialization still happen?
- is there a supported way to force CPU/WASM-only behavior in Chromium, disable WebGPU entirely? i.e. is there a supported way for browser/WASM callers to disable WebGPU backend registration entirely and use CPU/WASM only?
- or is this expected behavior, a docs ambiguity, some wllama bug, a ggml/llama.cpp WebGPU backend issue, or a browser thing only?
- should I just look into doing my own local, CPU-only wllama/llama.cpp build, or maybe some kind of runtime switch?

---

Thanks in advance for any pointers, help I can get.

---

## Environment

- OS: EndeavourOS / Arch Linux, `BUILD_ID=2025.03.19`
- Kernel: Linux `6.18.33-1-lts` x86_64
- Desktop/session: KDE Plasma / Wayland
- Device class: Lenovo Legion Y530 laptop
- CPU: Intel i7-8750H
- RAM: 32 GB
- GPUs:
  - Intel UHD Graphics 630 / Mesa `26.1.1-arch1.2`
  - NVIDIA GeForce GTX 1060 / NVIDIA driver `580.126.09`
- Chromium: `148.0.7778.178` Arch Linux
- Firefox: `151.0.2`
- `@wllama/wllama`: dependency `^3.1.1`, locked/installed `3.1.1`
- App stack: Vite PWA / TypeScript
- Model: `Qwen2.5-0.5B-Instruct-Q4_K_M.gguf`

---

## Current wllama setup

Current imports/module map/constructor/load params:

```ts
import { LogLevel, Wllama, type Model } from "@wllama/wllama";
import wllamaWasmUrl from "@wllama/wllama/esm/wasm/wllama.wasm?url";

const wllama = new Wllama(
  { default: wllamaWasmUrl },
  { allowOffline: true, suppressNativeLog: false },
);

const loadModelParams = {
  n_ctx: 256,
  n_batch: 64,
  n_gpu_layers: 0,
  log_level: LogLevel.DEBUG,
  progressCallback: (...),
};
```

The online URL path calls:

```ts
await wllama.loadModelFromUrl(localAiAbsoluteModelUrl, {
  ...loadModelParams,
  useCache: true,
});
```

The offline cached `Model` path calls:

```ts
await wllama.loadModel(cachedModel, loadModelParams);
```

Both paths use `n_gpu_layers: 0`.

---

## Observed behavior

Chromium diagnostics before model load:

- WebGPU: present
- `wllama.isSupportWebGPU()`: yes
- GPU adapter: available
- `adapter.features.has("shader-f16")`: no
- SharedArrayBuffer: unavailable
- `crossOriginIsolated`: no
- `hardwareConcurrency`: 12
- `deviceMemory`: 32 GB
- GGUF `HEAD`: yes / 200
- GGUF `GET`: yes / 200
- WASM loads
- Final load stage: failed after Model URL load started
- Does not reach URL load succeeded
- Does not reach preflight started
- No model metadata/tensor loading appears before abort

- Firefox comparison: Firefox succeeds; previous app diagnostics observed WebGPU missing / wllama WebGPU support no / adapter not requested / `shader-f16` not requested

---

## Chromium WebGPU adapter check

{
  "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36",
  "platform": "Linux x86_64",
  "hardwareConcurrency": 12,
  "deviceMemory": 32,
  "crossOriginIsolated": false,
  "sharedArrayBuffer": false,
  "webgpu": true,
  "adapterInfo": {},
  "features": [
    "bgra8unorm-storage",
    "clip-distances",
    "core-features-and-limits",
    "depth-clip-control",
    "depth32float-stencil8",
    "dual-source-blending",
    "float32-blendable",
    "float32-filterable",
    "indirect-first-instance",
    "primitive-index",
    "rg11b10ufloat-renderable",
    "subgroups",
    "texture-component-swizzle",
    "texture-compression-bc",
    "texture-compression-bc-sliced-3d",
    "texture-formats-tier1",
    "texture-formats-tier2",
    "timestamp-query"
  ],
  "shaderF16": false,
  "error": ""
}



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml Chromium WebGPU ShaderF16 error/assertion stops CPU/WASM fallback when `n_gpu_layers=0` #23844

Questions

Environment

Current wllama setup

Observed behavior

Chromium WebGPU adapter check

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ggml Chromium WebGPU ShaderF16 error/assertion stops CPU/WASM fallback when n_gpu_layers=0 #23844

Description

Questions

Environment

Current wllama setup

Observed behavior

Chromium WebGPU adapter check

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

ggml Chromium WebGPU ShaderF16 error/assertion stops CPU/WASM fallback when `n_gpu_layers=0` #23844