Skip to content

Eval bug: Segmentation fault when tokenizing with long sequences of repeated characters #21113

@ketaiq

Description

@ketaiq

Name and Version

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 81037 MiB):
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes, VRAM: 81037 MiB
version: 8376 (67a2209)
built with GNU 9.4.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

NVIDIA A100 80GB PCIe

Models

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf

Problem description & steps to reproduce

The bug is very easy to reproduce with the given input test.txt, which contains 43,695 'A' characters.

I launched llama-server with

nohup llama-server -m "Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf" --port 8000 --jinja -ngl 99 --ctx-size $((64*1024)) --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --rep
eat-penalty 1.05 > llama-server.log 2>&1 &

I sent a post request to tokenize a long input string

curl --request POST --url http://localhost:8000/tokenize --header "Content-Type: application/json" --data "{\"content\": \"$(cat test.txt)\"}"

This input file test.txt was generated by the test case generator TestFusion developed in our STAR lab.

test.txt

First Bad Commit

The bug may relate with this pull request: #17786

Relevant log output

Logs
curl: (52) Empty reply from server

[2]-  Segmentation fault      (core dumped) nohup llama-server -m "Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf" --port 8000 --jinja -ngl 99 --ctx-size $((64*1024)) --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 > llama-server.log 2>&1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions