Hey,
I hit this snag through wllama but the native abort seems to originate from ggml-webgpu.cpp, so I am reporting it here after ngxson suggested to go upstream with this.
CC @reeselevine
I'm working (with the help of codex) on a "simple", offline-first, CPU-only, educational PWA and I've run into a bit of a doozy with the way wllama loads the model in Chromium, i.e. Chromium fails during model setup before preflight/generation.
Chromium logs show a 'ShaderF16 WebGPU' assertion during wllamaStart() when the actual app is set up with n_gpu_layers: 0.
It makes sense to expect some 'CPU/WASM' operation or a 'no WebGPU device' statement when GPU layers are disabled, right?
Instead, Chromium throws this error:
/source/llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:3699:
GGML_ASSERT(ctx->webgpu_global_ctx->adapter.HasFeature(wgpu::FeatureName::ShaderF16)) failed
Then it shows:
Received abort signal from llama.cpp; Message: (empty)
Aborted()
Cannot find waiting task with callbackId = ...
btw, the same app/model path works in Firefox for both the online mode and the offline cached Model mode - so I think that the model, app shell, service worker, and cache path are not the root issue.
Here's a rundown of a codex-assisted session that traced the timing of WebGPU/backend init relative to model load params:
- `wllamaStart()` is called before the normal load action.
- Native backend init happens through `wllama_start()` / `llama_backend_init()`.
- Load params are converted for the load action after startup.
- `n_gpu_layers` does not appear to gate early backend init.
- JS source does not explicitly select WebGPU before load.
- No public supported option was found to disable WebGPU/backend registration before `wllamaStart()`.
- Default V3.1 wasm appears built with WebGPU enabled.
- An internal-looking build-wasm/wllama.wasm artifact appeared CPU-only / GGML_WEBGPU=OFF by metadata, strings/size, but it failed with JS-wrapper/import-object incompatibility and is not a supported runtime path.
Questions
- is
shader-f16 currently a hard requirement for the ggml WebGPU backend?
- where should CPU fallback for this case be handled: llama.cpp/ggml, wllama, or both?
- should the WebGPU backend assert when the adapter lacks shader-f16, or should it return a recoverable 'backend-unavailable' error?
- when no GPU layers are requested through
n_gpu_layers: 0, should WebGPU backend/device initialization still happen?
- is there a supported way to force CPU/WASM-only behavior in Chromium, disable WebGPU entirely? i.e. is there a supported way for browser/WASM callers to disable WebGPU backend registration entirely and use CPU/WASM only?
- or is this expected behavior, a docs ambiguity, some wllama bug, a ggml/llama.cpp WebGPU backend issue, or a browser thing only?
- should I just look into doing my own local, CPU-only wllama/llama.cpp build, or maybe some kind of runtime switch?
Thanks in advance for any pointers, help I can get.
Environment
- OS: EndeavourOS / Arch Linux,
BUILD_ID=2025.03.19
- Kernel: Linux
6.18.33-1-lts x86_64
- Desktop/session: KDE Plasma / Wayland
- Device class: Lenovo Legion Y530 laptop
- CPU: Intel i7-8750H
- RAM: 32 GB
- GPUs:
- Intel UHD Graphics 630 / Mesa
26.1.1-arch1.2
- NVIDIA GeForce GTX 1060 / NVIDIA driver
580.126.09
- Chromium:
148.0.7778.178 Arch Linux
- Firefox:
151.0.2
@wllama/wllama: dependency ^3.1.1, locked/installed 3.1.1
- App stack: Vite PWA / TypeScript
- Model:
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
Current wllama setup
Current imports/module map/constructor/load params:
import { LogLevel, Wllama, type Model } from "@wllama/wllama";
import wllamaWasmUrl from "@wllama/wllama/esm/wasm/wllama.wasm?url";
const wllama = new Wllama(
{ default: wllamaWasmUrl },
{ allowOffline: true, suppressNativeLog: false },
);
const loadModelParams = {
n_ctx: 256,
n_batch: 64,
n_gpu_layers: 0,
log_level: LogLevel.DEBUG,
progressCallback: (...),
};
The online URL path calls:
await wllama.loadModelFromUrl(localAiAbsoluteModelUrl, {
...loadModelParams,
useCache: true,
});
The offline cached Model path calls:
await wllama.loadModel(cachedModel, loadModelParams);
Both paths use n_gpu_layers: 0.
Observed behavior
Chromium diagnostics before model load:
-
WebGPU: present
-
wllama.isSupportWebGPU(): yes
-
GPU adapter: available
-
adapter.features.has("shader-f16"): no
-
SharedArrayBuffer: unavailable
-
crossOriginIsolated: no
-
hardwareConcurrency: 12
-
deviceMemory: 32 GB
-
GGUF HEAD: yes / 200
-
GGUF GET: yes / 200
-
WASM loads
-
Final load stage: failed after Model URL load started
-
Does not reach URL load succeeded
-
Does not reach preflight started
-
No model metadata/tensor loading appears before abort
-
Firefox comparison: Firefox succeeds; previous app diagnostics observed WebGPU missing / wllama WebGPU support no / adapter not requested / shader-f16 not requested
Chromium WebGPU adapter check
{
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36",
"platform": "Linux x86_64",
"hardwareConcurrency": 12,
"deviceMemory": 32,
"crossOriginIsolated": false,
"sharedArrayBuffer": false,
"webgpu": true,
"adapterInfo": {},
"features": [
"bgra8unorm-storage",
"clip-distances",
"core-features-and-limits",
"depth-clip-control",
"depth32float-stencil8",
"dual-source-blending",
"float32-blendable",
"float32-filterable",
"indirect-first-instance",
"primitive-index",
"rg11b10ufloat-renderable",
"subgroups",
"texture-component-swizzle",
"texture-compression-bc",
"texture-compression-bc-sliced-3d",
"texture-formats-tier1",
"texture-formats-tier2",
"timestamp-query"
],
"shaderF16": false,
"error": ""
}
Hey,
I hit this snag through
wllamabut the native abort seems to originate fromggml-webgpu.cpp, so I am reporting it here after ngxson suggested to go upstream with this.CC @reeselevine
I'm working (with the help of codex) on a "simple", offline-first, CPU-only, educational PWA and I've run into a bit of a doozy with the way
wllamaloads the model in Chromium, i.e. Chromium fails during model setup before preflight/generation.Chromium logs show a 'ShaderF16 WebGPU' assertion during
wllamaStart()when the actual app is set up withn_gpu_layers: 0.It makes sense to expect some 'CPU/WASM' operation or a 'no WebGPU device' statement when GPU layers are disabled, right?
Instead, Chromium throws this error:
Then it shows:
btw, the same app/model path works in Firefox for both the online mode and the offline cached
Modelmode - so I think that the model, app shell, service worker, and cache path are not the root issue.Here's a rundown of a codex-assisted session that traced the timing of WebGPU/backend init relative to model load params:
Questions
shader-f16currently a hard requirement for the ggml WebGPU backend?n_gpu_layers: 0, should WebGPU backend/device initialization still happen?Thanks in advance for any pointers, help I can get.
Environment
BUILD_ID=2025.03.196.18.33-1-ltsx86_6426.1.1-arch1.2580.126.09148.0.7778.178Arch Linux151.0.2@wllama/wllama: dependency^3.1.1, locked/installed3.1.1Qwen2.5-0.5B-Instruct-Q4_K_M.ggufCurrent wllama setup
Current imports/module map/constructor/load params:
The online URL path calls:
The offline cached
Modelpath calls:Both paths use
n_gpu_layers: 0.Observed behavior
Chromium diagnostics before model load:
WebGPU: present
wllama.isSupportWebGPU(): yesGPU adapter: available
adapter.features.has("shader-f16"): noSharedArrayBuffer: unavailable
crossOriginIsolated: nohardwareConcurrency: 12deviceMemory: 32 GBGGUF
HEAD: yes / 200GGUF
GET: yes / 200WASM loads
Final load stage: failed after Model URL load started
Does not reach URL load succeeded
Does not reach preflight started
No model metadata/tensor loading appears before abort
Firefox comparison: Firefox succeeds; previous app diagnostics observed WebGPU missing / wllama WebGPU support no / adapter not requested /
shader-f16not requestedChromium WebGPU adapter check
{
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36",
"platform": "Linux x86_64",
"hardwareConcurrency": 12,
"deviceMemory": 32,
"crossOriginIsolated": false,
"sharedArrayBuffer": false,
"webgpu": true,
"adapterInfo": {},
"features": [
"bgra8unorm-storage",
"clip-distances",
"core-features-and-limits",
"depth-clip-control",
"depth32float-stencil8",
"dual-source-blending",
"float32-blendable",
"float32-filterable",
"indirect-first-instance",
"primitive-index",
"rg11b10ufloat-renderable",
"subgroups",
"texture-component-swizzle",
"texture-compression-bc",
"texture-compression-bc-sliced-3d",
"texture-formats-tier1",
"texture-formats-tier2",
"timestamp-query"
],
"shaderF16": false,
"error": ""
}