Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb brings memory-efficient, performance-portable, multi-precision LLM inference to the browser with a WebGPU backend for llama.cpp, reducing memory use and improving decode throughput across diverse devices.
