Skip to content

llamafile reloaded (v0.10.0)#867

Merged
aittalam merged 127 commits intomainfrom
new_build_wip
Mar 19, 2026
Merged

llamafile reloaded (v0.10.0)#867
aittalam merged 127 commits intomainfrom
new_build_wip

Conversation

@aittalam
Copy link
Copy Markdown
Member

A modern version of llamafile, rebuilt with the intention of keeping it aligned with recent versions of llama.cpp

aittalam and others added 6 commits February 25, 2026 16:55
* Update llama.cpp submodule to b908baf1825b1a89afef87b09e22c32af2ca6548

Updates patches and integration code for new llama.cpp version:
- Regenerated all patches for updated upstream code
- Added common_ngram-mod.cpp.patch (adds #include <algorithm>)
- Added vendor_cpp-httplib_httplib.cpp.patch (XNU futex workaround moved from .h)
- Added common/license.cpp stub for LICENSES symbol
- Removed obsolete vendor_minja_minja.hpp.patch (jinja now built-in)
- Removed obsolete vendor_cpp-httplib_httplib.h.patch (code moved to .cpp)
- Updated chatbot.h/cpp for common_chat_syntax -> common_chat_parser_params rename
- Removed minja test from tests/BUILD.mk

* Updated license.cpp with the one generated by cmake in upstream llama.cpp
* Updated info about license.cpp in patches' README
* Remove minja from tests
* Updated refs to minja in docs
* Fixed templating issue with Apertus
* Load the PEG parser in chatbot_main if one is provided
* Updated whisper.cpp submodule from v1.6.2-168 (6739eb83) to v1.8.3 (2eeeba56).
* Updated patches scripts + removed old patches
* Added whisperfile + extra tools (mic2raw, mic2txt, stream, whisper-server)
* Added slurp
* Updated docs and man pages

---------

Co-authored-by: angpt <anushrigupta@gmail.com>
* Add CLI, SERVER, CHAT, and combined modes
* Removed log 'path  does not exist'
* Added server routes to main.cpp
* Fixing GPU log callbacks
* Added --nothink feature for CLI
* Refactored args + cleaned FLAGS
@github-actions github-actions bot added the devops label Mar 2, 2026
aittalam and others added 20 commits March 3, 2026 19:39
llama.cpp update, Qwen3.5 Think Mode & CLI Improvements

llama.cpp Submodule Update

    Updated to 7f5ee549683d600ad41db6a295a232cdd2d8eb9f
    Updated associated patches (removed obsolete vendor_miniaudio_miniaudio.h.patch)

Qwen3.5 Think Mode Support

    Proper handling of think/nothink mode in both chat and CLI modes
    Uses common_chat_templates_apply() with enable_thinking parameter instead of manually constructing prompts
    Correctly parses reasoning content using PEG parser with COMMON_REASONING_FORMAT_DEEPSEEK

System Prompt Handling

    Captures -p/--prompt value early in argument parsing (needed for combined mode where server parsing excludes -p)
    /clear command now properly resets g_pending_file_content to prevent uploaded files from persisting after clear

Code Quality

    Refactored cli_apply_chat_template() to return both prompt and parser params
    Added documentation comments for subtle pointer lifetime and argument parsing behaviors


---------

Co-authored-by: Stuart Henderson <sthen@users.noreply.github.com> (OpenBSD supported versions update)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (llama.cpp submodule update)
Adds integration tests for llamafiles:

    run a pre-built llamafile as well as the plain executable with a model passed as a parameter
    test tui (piping inputs to the process), server (sending HTTP requestS), cli (passing prompts as params), hybrid modes
    test plain text, multimodal, and tool calling with ad-hoc prompts
    test thinking vs no-thinking mode
    test CPU vs GPU


* Added timeout multiplier

* Added combined marker, improved combined tests

* Added check for GPU presence

* Added meaningful temperature test

* Fixed platforms where sh is needed

* Adding retry logic to server requests
- Add sm_110f (Jetson Thor & family) and sm_121a (DGX Spark GB10)
  support for aarch64 platforms with CUDA 13.x
- Add sm_120f (RTX 5000 series, RTX PRO Blackwell) support for x86_64
  platforms with CUDA 13.x
- Enable --compress-mode=size for optimized binary size on Blackwell GPUs
- Detect CUDA version and host architecture at build time

Co-authored-by: wingx <wingenlit@outlook.com>
* Implement chat in combined mode as an OpenAI client
* Implemented stop_tui in tests
* Fixed CLI tests using --verbose
* Accept non-utf8 chars in responses
* Simplify prompt for t=0
* Added patch for tools/server/server.cpp
* Server output > devnull to avoid buffer fill up with --verbose
* Added back support for --image in CLI tool
* Added tests for multimodal cli
* Added optional mmproj parameter to TUI tests too
* Addressed review comments
* Added test to check multiple markers/images on cli
* Updated index.md

* Moar updates to index.md

* Updated quickstart.md

* Updated support + example llamafiles

* Added example files and examples + minor fixes

* Updated structure

* Removed security

* Updated source installation

* Updated README_0.10.0, now frozen doc

* Removed ref to new_build_wip in whisperfile, make setup installs cosmocc

* Apply suggestion from @dpoulopoulos

Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>

* Addressed review comments

* Addressed review comments #2

---------

Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>
* Added per-mode help + nologo/ascii support
* If model is missing, bump to help for respective mode
* Updated skill not to use new_build_wip + improved it
* Removed stray new_build_wip reference
@aittalam aittalam marked this pull request as ready for review March 19, 2026 11:12
@aittalam aittalam merged commit 4cc1a5f into main Mar 19, 2026
3 checks passed
@aittalam aittalam deleted the new_build_wip branch March 19, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants