Name and Version
llama.cpp version: 9189 (64b38b5)
Operating systems
Linux
GGML backends
CUDA
Hardware
Nvidia 5060Ti / Intel CPU
Models
Gemma 4 31B - BF16 GGUF
Problem description & steps to reproduce
I want to measure perplexity and KL-divergence across a variety of context lengths. The system that performs the measurements has to run the BF16 of Gemma 4 31B split across GPU and RAM. It works fine for context length = 1024 but crashes at context length = 10000.
command is:
llama-perplexity -m gemma-4-31B-it-BF16-00001-of-00002.gguf -f ./logits_temp/txt_ctx10000_ch0.txt -c 10000 -t 10 --no-mmap --no-warmup -ctk bf16 -ctv bf16 --kl-divergence-base ./logits_temp/gt_gemma-4-31B-it-BF16-00001-of-00002.gguf_ctx10000_ch0.bin --fit off -ngl 13 -dev CUDA0 --verbose
The main error is:
0.47.670.644 I perplexity: saving all logits to ./logits_temp/gt_gemma-4-31B-it-BF16-00001-of-00002.gguf_ctx10000_ch0.bin
0.47.670.648 I perplexity: tokenizing the input ..
0.47.743.853 I perplexity: tokenization took 73.199 ms
0.47.743.922 I perplexity: calculating perplexity over 2 chunks, n_ctx=10000, batch_size=2048, n_seq=1
llama.cpp/build/bin/libggml-base.so.0(+0x1a7b6) [0x7871cf7947b6]
llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x20d) [0x7871cf794c3d]
llama.cpp/build/bin/libggml-base.so.0(+0x2e69f) [0x7871cf7a869f]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xc364a) [0x7871ceac364a]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7871ceaabc6c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xc3901) [0x7871ceac3901]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44) [0x7871ceab0073]
llama-perplexity(+0x4f87) [0x64175a061f87]
llama-perplexity(+0x7f48) [0x64175a064f48]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a601) [0x7871ce62a601]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x88) [0x7871ce62a718]
llama-perplexity(+0x82a5) [0x64175a0652a5]
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append
Aborted (core dumped)
First Bad Commit
No response
Relevant log output
Logs
Name and Version
llama.cpp version: 9189 (64b38b5)
Operating systems
Linux
GGML backends
CUDA
Hardware
Nvidia 5060Ti / Intel CPU
Models
Gemma 4 31B - BF16 GGUF
Problem description & steps to reproduce
I want to measure perplexity and KL-divergence across a variety of context lengths. The system that performs the measurements has to run the BF16 of Gemma 4 31B split across GPU and RAM. It works fine for context length = 1024 but crashes at context length = 10000.
command is:
llama-perplexity -m gemma-4-31B-it-BF16-00001-of-00002.gguf -f ./logits_temp/txt_ctx10000_ch0.txt -c 10000 -t 10 --no-mmap --no-warmup -ctk bf16 -ctv bf16 --kl-divergence-base ./logits_temp/gt_gemma-4-31B-it-BF16-00001-of-00002.gguf_ctx10000_ch0.bin --fit off -ngl 13 -dev CUDA0 --verboseThe main error is:
First Bad Commit
No response
Relevant log output
Logs