Eval bug: Embedding models crash either upon loading or during use

### Name and Version

Observed first on:
version: 8312 (e4cff0956)
built with GNU 11.4.0 for Linux x86_64

and also on (my) latest:
version: 8322 (557fe2d91)
built with GNU 15.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

Vulkan, CUDA

### Hardware

- Ryzen 7900X
- 64 GB DDR5@6000
- X870e board
- NVIDIA RTX 5080 16GB
- AMD Radeon AI Pro R9700

### Models

Instant crash:
- nomic-embed-text-v1.5.Q4_K_M.gguf
- embeddinggemma-300M-Q8_0.gguf

Random crash while embedding:
- mxbai-embed-large-v1-f16.gguf

### Problem description & steps to reproduce

Loading the models mentioned above with llama-server results in either an insta-crash, or one during embedding. Observed in both Vulkan & CUDA.

llama commands:
```bash
./build/bin/llama-server \
	--port 1234 --host 0.0.0.0 \
	-m /media/aanagnostopoulos/models/models/ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf \
	--embedding
	
./build/bin/llama-server \
	--port 1234 --host 0.0.0.0 \
	-m /media/aanagnostopoulos/models/models/mixedbread-ai/mxbai-embed-large-v1/mxbai-embed-large-v1-f16.gguf \
	--embedding
	
./build/bin/llama-server \
	--port 1234 --host 0.0.0.0 \
	-m /media/aanagnostopoulos/models/models/nomic-embed-text-v1.5.Q4_K_M.gguf \
	--embedding
```

### First Bad Commit

I first observed it around build 8312.

### Relevant log output

<details>
<summary>CUDA + Vulkan  crash log</summary>


```console
**CUDA**

((HEAD detached at b8322))$ ./build/bin/llama-server  --port 1234 --host 0.0.0.0      -m /media/aanagnostopoulos/models/models/ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf  --embedding
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15839 MiB):
  Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes, VRAM: 15839 MiB (12871 MiB free)
main: embeddings enabled with n_batch (2048) > n_ubatch (512)
main: setting n_batch = n_ubatch = 512 to avoid assertion failure
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8322 (557fe2d91) with GNU 15.2.0 for Linux x86_64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 23 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/media/aanagnostopoulos/models/models/ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/home/aanagnostopoulos/code/llama.cpp/ggml/src/ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
[New LWP 275310]
[New LWP 275309]
[New LWP 275308]
[New LWP 275307]
[New LWP 275306]
[New LWP 275305]
[New LWP 275304]
[New LWP 275303]
[New LWP 275302]
[New LWP 275301]
[New LWP 275300]
[New LWP 275299]
[New LWP 275298]
[New LWP 275297]
[New LWP 275296]
[New LWP 275295]
[New LWP 275294]
[New LWP 275293]
[New LWP 275292]
[New LWP 275291]
[New LWP 275290]
[New LWP 275289]
[New LWP 275288]
[New LWP 275287]
[New LWP 275286]
[New LWP 275284]
[New LWP 275283]
[New LWP 275282]
[New LWP 275281]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x000073ff18ca013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x000073ff18d1ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x000073ff19367e13 in ggml_print_backtrace () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#5  0x000073ff19367fc6 in ggml_abort () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#6  0x000073ff1936da71 in ggml_mul_mat () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#7  0x000073ff1950edf6 in llm_graph_context::build_pooling(ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*) const () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#8  0x000073ff1955a69b in llama_model::build_graph(llm_graph_params const&) const () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#9  0x000073ff194d0130 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#10 0x000073ff194d2348 in llama_context::sched_reserve() () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#11 0x000073ff194d4645 in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#12 0x000073ff194d5340 in llama_init_from_model () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#13 0x000073ff194a9261 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*, std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&, ggml_log_level) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#14 0x000073ff194aa3d4 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*, llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#15 0x000073ff194ae202 in llama_params_fit () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#16 0x0000647713ff91d5 in common_init_result::common_init_result(common_params&) ()
#17 0x0000647713ffbb6a in common_init_from_params(common_params&) ()
#18 0x0000647713f2b457 in server_context_impl::load_model(common_params const&) ()
#19 0x0000647713e71f26 in main ()
[Inferior 1 (process 275280) detached]
Aborted (core dumped)

```
```console
**Vulkan**

((HEAD detached at b8322))$ ./build/bin/llama-server \
        --port 1234 --host 0.0.0.0 \
        -m /media/aanagnostopoulos/models/models/ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf \
        --embedding
        
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5080 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
main: embeddings enabled with n_batch (2048) > n_ubatch (512)
main: setting n_batch = n_ubatch = 512 to avoid assertion failure
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8322 (557fe2d91) with GNU 15.2.0 for Linux x86_64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 23 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/media/aanagnostopoulos/models/models/ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/home/aanagnostopoulos/code/llama.cpp/ggml/src/ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
[New LWP 285137]
[New LWP 285134]
[New LWP 285133]
[New LWP 285132]
[New LWP 285129]
[New LWP 285128]
[New LWP 285127]
[New LWP 285126]
[New LWP 285125]
[New LWP 285124]
[New LWP 285123]
[New LWP 285122]
[New LWP 285121]
[New LWP 285120]
[New LWP 285119]
[New LWP 285118]
[New LWP 285117]
[New LWP 285116]
[New LWP 285115]
[New LWP 285114]
[New LWP 285113]
[New LWP 285112]
[New LWP 285111]
[New LWP 285110]
[New LWP 285109]
[New LWP 285108]
[New LWP 285107]
[New LWP 285106]
[New LWP 285105]
[New LWP 285103]
[New LWP 285102]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007775c64a013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007775c651ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007775c7037e13 in ggml_print_backtrace () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007775c7037fc6 in ggml_abort () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007775c703da71 in ggml_mul_mat () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libggml-base.so.0
#7  0x00007775c6d0edf6 in llm_graph_context::build_pooling(ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*) const () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#8  0x00007775c6d5a69b in llama_model::build_graph(llm_graph_params const&) const () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#9  0x00007775c6cd0130 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#10 0x00007775c6cd2348 in llama_context::sched_reserve() () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#11 0x00007775c6cd4645 in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#12 0x00007775c6cd5340 in llama_init_from_model () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#13 0x00007775c6ca9261 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*, std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&, ggml_log_level) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#14 0x00007775c6caa3d4 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*, llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#15 0x00007775c6cae202 in llama_params_fit () from /home/aanagnostopoulos/code/llama.cpp/build/bin/libllama.so.0
#16 0x00005f75480401d5 in common_init_result::common_init_result(common_params&) ()
#17 0x00005f7548042b6a in common_init_from_params(common_params&) ()
#18 0x00005f7547f72457 in server_context_impl::load_model(common_params const&) ()
#19 0x00005f7547eb8f26 in main ()
[Inferior 1 (process 285101) detached]
Aborted (core dumped)
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Embedding models crash either upon loading or during use #20514

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Embedding models crash either upon loading or during use #20514

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions