Skip to content

Fix build for Android#125

Closed
rgerganov wants to merge 1 commit into
ggml-org:masterfrom
rgerganov:fix-android
Closed

Fix build for Android#125
rgerganov wants to merge 1 commit into
ggml-org:masterfrom
rgerganov:fix-android

Conversation

@rgerganov

Copy link
Copy Markdown
Member

The project can be built for Android with NDK and CMake like this:

cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI='arm64-v8a' -DANDROID_PLATFORM=android-23 ..

However, vdotq_* intrinsics are not available on Android. Fix this by checking for ANDROID and use the code replaced by commit 84d9015 in this case.

The project can be built for Android with NDK and CMake like this:

cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake
-DANDROID_ABI='arm64-v8a' -DANDROID_PLATFORM=android-23 ..

However, vdotq_* intrinsics are not available on Android.
Fix this by checking for __ANDROID__ and use the code replaced by commit
84d9015 in this case.
@rgerganov

Copy link
Copy Markdown
Member Author

Turns out this is not needed as long as we have -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod

@rgerganov rgerganov closed this Mar 14, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
…ortable

Make app server module importable
thedanhoffman pushed a commit to thedanhoffman/llama.cpp that referenced this pull request Apr 14, 2026
ggerganov pushed a commit that referenced this pull request Apr 21, 2026
#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (#125)

* Imrope support (#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* q4_0_r4: 6% faster PP on NEON

* qx_0_r4_q8_0 template

Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP
~7% faster.

* Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4

* Simplify

* Minor iq4_xs_r4 improvement on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
ggml-org#21944)

* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (ggml-org#125)

* Imrope support (ggml-org#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
marksverdhei added a commit to heiervang-technologies/ht-llama.cpp that referenced this pull request Jun 12, 2026
… (#71)

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512:

| cache-type | PPL    | vs f16 |
|------------|--------|--------|
| f16        | 19.08  | baseline |
| q8_0       | 19.08  | lossless |
| tbq3_0     | 1252.30 | 65x worse |
| tbq4_0     | 1393.00 | 73x worse |

TBQ KV-cache produces near-random output. Likely root cause is statistical:
TBQ's rotated-domain codebook was calibrated for weight distributions, not
the K/V tensor distributions seen during inference. The encoding scheme
itself cannot faithfully represent KV values.

Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache
(every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65).
No production consumer exists.

This PR adds a one-line experimental note to the --cache-type-k/v and
--cache-type-k-draft/v-draft help text, referencing issue #70 for the
full data + recommendation. Code path stays in place — Markus may have
roadmap intent I'm not aware of; this just stops anyone reading --help
from assuming tbq* is a usable choice without checking.

Follow-ups if Markus prefers full removal:
* drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list
* keep the ftypes (TBQ weight quantization is separate from KV use)
* close issues ggml-org#124 + ggml-org#125 as wont-fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant