Feat/web chat multimodal#183
Merged
Merged
Conversation
… run Windows CUDA's "Install CUDA toolkit" step swung 6-17 min because method: network only caches the tiny network bootstrapper — running it re-fetches ~5 GB of packages from NVIDIA's CDN on every run (logs show cuda_installer-...-x64_12.8.0.exe -s pulling packages live). The actual compile is already warm (~5.5 min via sccache/S3); this download was the real variance. Switch to method: local (full ~3 GB offline installer). With use-github-cache (default true) the full installer is cached in GitHub Actions cache, so subsequent runs restore it locally and install offline — no NVIDIA CDN round-trip. sccache stays on S3 and is untouched; only the ~3 GB installer shares the 10 GB GitHub cache with the rust-cache. First run after this change still pays the one-time full-installer download; runs after that should drop the step to ~2-4 min.
Compared our generate_multimodal against llama.cpp's reference mtmd-cli (tools/mtmd/mtmd-cli.cpp). The reference sets `text.add_special = add_bos` (true on the first turn), so it prepends the model's BOS token. We set `add_special = !request.raw`, and the multimodal path always uses raw=true (hand-built turn template) — so we were sending the prompt WITHOUT <bos>. Gemma requires BOS. Without it the prompt is malformed and the model degrades into "wall of text / line art / collage" confabulation — but only when the image signal is weak. Strong landscapes survived the missing BOS (which is why some images "worked"); portraits, people and dense/abstract images tipped into garbage. This reproduced identically on CLI and web, and on 0.6.0 — i.e. it was never a beta regression nor the web pipeline; it was this one missing token in the shared core, present since multimodal landed. Fix: add_special = true (always prepend BOS) for the mtmd tokenize, matching the reference. parse_special stays true so the hand-built turn markers are still recognised.
The headline of this beta: the missing-<bos> fix in the image prompt, which reproduced the "wall of text / line art" misreads on CLI, web and 0.6.0 alike. Also carries the n_ubatch sizing, image_max_tokens override, WebP→PNG, error surfacing, web audio, and the local CUDA installer CI change. First release on `method: local` for the Windows CUDA toolkit — that step is slow once (downloads + caches the full offline installer), fast afterwards.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.