Threadstone is offline terminal chat for local MLX language models on Apple Silicon. It starts a local mlx-vlm server for the model you choose, streams the answer into your terminal, keeps the conversation usable across crashes, and shuts the server down when you leave.
The useful part is ownership. The network is only for setup and model downloads. After that, chat runs against local snapshots with Hugging Face offline flags, no telemetry, no hosted API, and no account.
Requirements: macOS, Apple Silicon, Python 3.13+, Homebrew, and about 12 GB for the default model set.
git clone https://github.com/gabrimatic/threadstone.git
cd threadstone
./setup.shsetup.sh creates ~/mlx-env, installs the MLX runtime, downloads the configured Qwen model snapshots, installs the threadstone CLI, and writes shell helpers for oracle, forge, and quench.
Start a chat:
oracle
oracle 4B
oracle "be terse" 2BUse the installed CLI directly:
threadstone --list-models
threadstone --doctor
threadstone "answer like a systems engineer" 9BThreadstone uses the network during setup. Runtime chat stays on localhost and local model files.
| Path | Runtime scope |
|---|---|
| Model snapshots | Hugging Face cache on disk |
| Inference server | mlx_vlm.server on 127.0.0.1 |
| Chat client | Python stdlib HTTP and SSE client |
| Session history | In memory for the current terminal session |
| Session restore | ~/.cache/threadstone/ per model and terminal tab |
| Telemetry | Disabled through environment flags |
Offline guard:
HF_HUB_OFFLINE=1
TRANSFORMERS_OFFLINE=1
HF_DATASETS_OFFLINE=1
HF_HUB_DISABLE_TELEMETRY=1
HF_HUB_DISABLE_IMPLICIT_TOKEN=1
HF_TOKEN=
DO_NOT_TRACK=1
DISABLE_TELEMETRY=1
ANONYMIZED_TELEMETRY=0
Inside chat:
| Command | Effect |
|---|---|
/read <path> |
Attach a file or directory listing to the next message |
/drop |
Cancel the pending attachment |
/history |
Show recent visible conversation turns |
/restore |
Restore the saved session for this terminal tab |
/clear |
Reset conversation state and pending attachment |
/help |
Show in-session commands |
exit / quit |
Stop the chat and terminate the owned server |
Shell helpers from setup.sh:
| Command | Effect |
|---|---|
oracle [prompt] [size] |
Start chat with an owned server |
forge [size] |
Start a manual background server |
| `quench [size | all]` |
threadstone --doctor |
Check platform, venv, model snapshots, offline flags, and port state |
threadstone --list-models |
Print configured models, ports, context limits, and paths |
Default models are 4-bit MLX snapshots from mlx-community.
| Key | Model | Thinking | Max tokens | Context trim |
|---|---|---|---|---|
9B |
Qwen3.5-9B-MLX-4bit | Yes | 4096 | 8000 |
4B |
Qwen3.5-4B-MLX-4bit | Yes | 4096 | 8000 |
2B |
Qwen3.5-2B-MLX-4bit | No | 2048 | 4000 |
0.8B |
Qwen3.5-0.8B-MLX-4bit | No | 1024 | 2000 |
Each model has its own default port, memory estimate, context threshold, and response budget in config.py.
Threadstone owns the server lifecycle for normal chat.
- Port recovery: if the default port is busy, Threadstone scans forward and starts the model on the next available localhost port.
- RAM guard: startup checks free and reclaimable memory before launching a model, including other reachable model servers.
- Crash recovery: if the server disappears during a turn, Threadstone restarts it and resends the pending message.
- Thinking models: reasoning streams dimmed until
</think>, then the final answer prints normally and only the final answer is sent back in later history. - Attachments:
/readaccepts text files and directory listings, rejects non-regular files, rejects binary-looking content, and caps file payloads at 50 KB. - Context trimming: old turns are trimmed when the approximate context crosses the configured threshold while preserving valid role alternation.
oracle / threadstone
|
v
threadstone.py
|-- argument parsing, doctor, REPL, streaming, history
|-- ServerManager starts and monitors mlx_vlm.server
|-- /read attaches bounded local text context
|
v
config.py
|-- model registry, ports, limits, memory estimates
|-- snapshot resolution from the Hugging Face cache
|-- offline environment guard
Files:
| File | Role |
|---|---|
threadstone.py |
CLI, server lifecycle, chat loop, streaming parser, attachment handling |
config.py |
Model registry, offline env, runtime limits, validation |
setup.sh |
One-time macOS installer and shell helper setup |
tests/ |
Unit coverage for parsing, streaming, history, config, attachments, and snapshots |
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
python3 -m unittest discover -s tests -t . -v
python3 -m build --sdist --wheel
python3 -m twine check dist/*Run the CLI without starting a model:
threadstone --version
threadstone --list-modelsRun the local setup check after ./setup.sh:
threadstone --doctor --all-modelsThreadstone is a Python package, so the package registry is PyPI, not pub.dev. pub.dev is for Dart and Flutter packages.
Release path:
- Update
CHANGELOG.mdandpyproject.toml. - Run tests, build, and
twine check. - Create a GitHub release tag such as
v1.1.0. - The release workflow builds the sdist and wheel, then publishes to PyPI through Trusted Publishing.
PyPI Trusted Publishing must be configured for:
| Field | Value |
|---|---|
| Owner | gabrimatic |
| Repository | threadstone |
| Workflow | release.yml |
| Environment | pypi |
Runtime chat is local by design, but model files and dependencies still come from external package and model hosts during setup.
Report vulnerabilities through GitHub private vulnerability reporting. Do not open a public issue for security reports.