[Feature] Bundle a zero-config CPU-only local model (llama.cpp + Qwen2.5-1.5B) as the default model — true offline-first experience

## 🎯 Feature Goal

Currently, `openclaw-portable` runs OpenClaw entirely from a USB drive — Node.js, config, workspace — but **still requires an external API key or a separately installed Ollama** to actually talk to any AI model. This is the last missing piece for a true zero-dependency offline experience.

This issue proposes bundling a small, CPU-only local model directly inside the portable package, launching it as a sidecar process alongside the OpenClaw gateway, and auto-configuring it as the default model — so users plug in the USB and AI just works, with no internet, no API key, no pre-installed software.

---

## 🏗️ Architecture Analysis (based on current repo structure)

After reviewing the codebase, the integration fits cleanly into the **existing launch flow**:

```
start.sh / start.bat
  │
  ├─ [2/5] 设置环境 (Node, OpenClaw binary check)
  ├─ [NEW] [3/5] Launch bundled llama-server sidecar  ← INSERT HERE
  ├─ [4/5] 初始化工作目录
  └─ [5/5] openclaw gateway start
```

The `install-models.js` + `deepMerge` config mechanism already provides the exact hook needed to inject the bundled model provider config into `openclaw.json` — no architectural changes required.

---

## 📦 Recommended Model: `Qwen2.5-1.5B-Instruct Q4_K_M`

| Attribute | Value |
|---|---|
| Disk size | ~900 MB |
| RAM usage | ~1.2 GB |
| Inference engine | llama.cpp `llama-server` (static binary, no install) |
| Tool calling | ✅ Native support |
| Context window | 32k tokens |
| CPU speed (4-core) | ~8–12 tok/s |
| License | Apache 2.0 ✅ redistributable |

This is the smallest model with **reliable tool-calling support**, which OpenClaw's agent runtime requires. Smaller models (0.5B) lack consistent JSON function-call formatting.

---

## 📁 Proposed Directory Layout

Add the following structure inside the portable package (alongside existing `node/`, `config/`, `data/`):

```
openclaw-portable/
├── node/                    # existing
├── npm-global/              # existing  
├── config/                  # existing
├── data/                    # existing
├── llm/                     # NEW
│   ├── bin/
│   │   ├── llama-server-linux-x86_64      # ~8MB static binary
│   │   ├── llama-server-macos-arm64       # ~9MB
│   │   ├── llama-server-macos-x86_64      # ~9MB
│   │   └── llama-server-win32-avx2.exe    # ~10MB
│   ├── models/
│   │   └── qwen2.5-1.5b-instruct-q4_k_m.gguf   # ~900MB
│   └── server.log           # runtime, gitignored
├── start.sh                 # MODIFIED
├── start.bat                # MODIFIED
├── stop.sh                  # MODIFIED
└── stop.bat                 # MODIFIED
```

The `llm/models/` directory should be listed in `.gitignore` and distributed via GitHub Releases as a separate download or via the build script.

---

## 🔧 Implementation Details

### 1. `start.sh` — Add sidecar launch step (between step 2 and current step 3)

```bash
# ============================================
# [NEW] 3/6 启动内置本地模型 (llama-server)
# ============================================
echo -e "${BLUE}[3/6] 启动内置本地模型...${NC}"

LLM_DIR="$USB_PATH/llm"
LLM_PORT=18080
LLM_PID_FILE="$LLM_DIR/server.pid"
LLM_LOG="$LLM_DIR/server.log"

# Detect platform binary
OS_TYPE="$(uname -s)"
ARCH_TYPE="$(uname -m)"

case "$OS_TYPE" in
  Linux*)  LLM_BIN="$LLM_DIR/bin/llama-server-linux-x86_64" ;;
  Darwin*)
    if [ "$ARCH_TYPE" = "arm64" ]; then
      LLM_BIN="$LLM_DIR/bin/llama-server-macos-arm64"
    else
      LLM_BIN="$LLM_DIR/bin/llama-server-macos-x86_64"
    fi ;;
  *) LLM_BIN="" ;;
esac

LLM_MODEL="$LLM_DIR/models/qwen2.5-1.5b-instruct-q4_k_m.gguf"
LLM_BUNDLED_READY=0

if [ -x "$LLM_BIN" ] && [ -f "$LLM_MODEL" ]; then
  # Check if already running on port
  if ! lsof -i :$LLM_PORT -sTCP:LISTEN -t &>/dev/null 2>&1; then
    THREADS=$(( $(nproc 2>/dev/null || sysctl -n hw.logicalcpu 2>/dev/null || echo 2) - 1 ))
    THREADS=$(( THREADS < 1 ? 1 : THREADS ))
    
    chmod +x "$LLM_BIN"
    nohup "$LLM_BIN" \
      --model "$LLM_MODEL" \
      --port $LLM_PORT \
      --host 127.0.0.1 \
      --ctx-size 32768 \
      --threads $THREADS \
      --parallel 1 \
      -ngl 0 \
      --log-disable \
      >> "$LLM_LOG" 2>&1 &
    echo $! > "$LLM_PID_FILE"
    echo -e "${GREEN}✅ llama-server 已启动 (PID: $!, port: $LLM_PORT, threads: $THREADS)${NC}"
    echo -e "${YELLOW}   模型加载中，首次响应约需 5-15 秒...${NC}"
    LLM_BUNDLED_READY=1
  else
    echo -e "${GREEN}✅ 内置模型已在运行 (port: $LLM_PORT)${NC}"
    LLM_BUNDLED_READY=1
  fi
else
  echo -e "${YELLOW}⚠️  内置模型未找到，跳过 (仍可使用云端 API)${NC}"
fi
```

### 2. Inject model config into `openclaw.json` before gateway starts

Add this block **after** the config file is initialized but **before** `openclaw gateway start`:

```bash
# Auto-inject bundled model config if model is available and user has no primary model set
if [ $LLM_BUNDLED_READY -eq 1 ]; then
  BUNDLED_MODEL_CONFIG=$(cat <<'JSONEOF'
{
  "models": {
    "mode": "merge",
    "providers": {
      "bundled-local": {
        "baseUrl": "http://127.0.0.1:18080/v1",
        "apiKey": "bundled-no-key",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen2.5-1.5b",
            "name": "Qwen2.5 1.5B (Bundled CPU)",
            "contextWindow": 32768,
            "maxTokens": 4096,
            "cost": { "input": 0, "output": 0 }
          }
        ]
      }
    }
  }
}
JSONEOF
  )
  
  # Use existing install-models.js merge mechanism
  echo "$BUNDLED_MODEL_CONFIG" > "$TEMP_DIR/bundled-model-inject.json"
  
  # Only set as default if user has NOT configured a primary model
  HAS_PRIMARY=$(node -e "
    try {
      const cfg = JSON.parse(require('fs').readFileSync('$TEMP_DIR/openclaw.json','utf8'));
      console.log(cfg?.agents?.defaults?.model?.primary ? 'yes' : 'no');
    } catch(e) { console.log('no'); }
  " 2>/dev/null || echo "no")
  
  if [ "$HAS_PRIMARY" = "no" ]; then
    # Merge and set as default
    node -e "
      const fs = require('fs');
      const cfgPath = '$TEMP_DIR/openclaw.json';
      const cfg = fs.existsSync(cfgPath) ? JSON.parse(fs.readFileSync(cfgPath,'utf8')) : {};
      const inject = JSON.parse(fs.readFileSync('$TEMP_DIR/bundled-model-inject.json','utf8'));
      
      // Deep merge providers
      cfg.models = cfg.models || {};
      cfg.models.providers = Object.assign({}, cfg.models.providers, inject.models.providers);
      cfg.models.mode = 'merge';
      
      // Set as default only if no primary configured
      cfg.agents = cfg.agents || {};
      cfg.agents.defaults = cfg.agents.defaults || {};
      cfg.agents.defaults.model = cfg.agents.defaults.model || {};
      cfg.agents.defaults.model.primary = 'bundled-local/qwen2.5-1.5b';
      
      fs.writeFileSync(cfgPath, JSON.stringify(cfg, null, 2));
      console.log('✅ 内置模型已设为默认模型');
    " 2>/dev/null && echo -e "${GREEN}   bundled-local/qwen2.5-1.5b 已注册为默认模型${NC}"
    rm -f "$TEMP_DIR/bundled-model-inject.json"
  else
    echo -e "${CYAN}   检测到已配置主模型，内置模型作为备用 (fallback)${NC}"
  fi
fi
```

### 3. `stop.sh` — Clean up llama-server on shutdown

```bash
# Kill bundled llama-server
LLM_PID_FILE="$USB_PATH/llm/server.pid"
if [ -f "$LLM_PID_FILE" ]; then
  LLM_PID=$(cat "$LLM_PID_FILE")
  if kill -0 "$LLM_PID" 2>/dev/null; then
    kill "$LLM_PID"
    echo -e "${GREEN}✅ 内置模型已停止 (PID: $LLM_PID)${NC}"
  fi
  rm -f "$LLM_PID_FILE"
fi
```

### 4. `start.bat` — Windows equivalent (PowerShell snippet)

```batch
REM === Start bundled llama-server ===
SET LLM_BIN=%USB_PATH%\llm\bin\llama-server-win32-avx2.exe
SET LLM_MODEL=%USB_PATH%\llm\models\qwen2.5-1.5b-instruct-q4_k_m.gguf
SET LLM_PORT=18080

IF EXIST "%LLM_BIN%" IF EXIST "%LLM_MODEL%" (
    netstat -ano | findstr :%LLM_PORT% >nul 2>&1
    IF ERRORLEVEL 1 (
        FOR /F "tokens=1" %%i IN ('wmic cpu get NumberOfLogicalProcessors /value ^| find "="') DO SET /A THREADS=%%i-1
        IF %THREADS% LSS 1 SET THREADS=1
        START /B "" "%LLM_BIN%" --model "%LLM_MODEL%" --port %LLM_PORT% --host 127.0.0.1 --ctx-size 32768 --threads %THREADS% --parallel 1 -ngl 0 >> "%USB_PATH%\llm\server.log" 2>&1
        ECHO [OK] llama-server started on port %LLM_PORT%
        SET LLM_BUNDLED_READY=1
    ) ELSE (
        ECHO [OK] Bundled model already running
        SET LLM_BUNDLED_READY=1
    )
) ELSE (
    ECHO [WARN] Bundled model not found, skipping
    SET LLM_BUNDLED_READY=0
)
```

---

## 📥 Model Distribution Strategy

The ~900MB GGUF file should NOT be committed to git. Recommended approach:

**Option A (recommended): GitHub Releases attachment**
```bash
# In build-offline-package.sh, add:
download_bundled_model() {
  MODEL_URL="https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf"
  MODEL_DIR="$SCRIPT_DIR/llm/models"
  mkdir -p "$MODEL_DIR"
  if [ ! -f "$MODEL_DIR/qwen2.5-1.5b-instruct-q4_k_m.gguf" ]; then
    echo "Downloading bundled model (~900MB)..."
    curl -L --progress-bar -o "$MODEL_DIR/qwen2.5-1.5b-instruct-q4_k_m.gguf" "$MODEL_URL"
  fi
}
```

**Option B: First-run auto-download**
Add a `setup-llm.sh` script that users run once to download the model into `llm/models/`. The `start.sh` gracefully degrades if the model is absent.

**`.gitignore` additions:**
```
llm/models/*.gguf
llm/bin/llama-server*
llm/server.log
llm/server.pid
```

---

## ⚠️ Known Risks & Mitigations

| Risk | Mitigation |
|---|---|
| Cold-start latency (5-15s model load) | Show explicit "加载中" message; start llama-server **before** gateway |
| Port 18080 conflict | Check with `lsof`/`netstat` before starting; fall back to 18081 |
| AVX2 not supported (old CPUs) | Detect with `grep avx2 /proc/cpuinfo`; warn user and skip |
| USB read speed bottleneck | Copy model to `$TEMP_DIR` on first run if USB speed < 50MB/s |
| Context overflow (32k) | Enable OpenClaw compaction in injected config: `"compaction": { "enabled": true }` |
| User already has API keys | Respect existing `agents.defaults.model.primary`; register bundled as fallback only |

---

## 🪜 Suggested Implementation Phases

- [ ] **Phase 1** — `start.sh`/`stop.sh`: Add llama-server sidecar lifecycle (no model bundled yet, just the plumbing)
- [ ] **Phase 2** — `build-offline-package.sh`: Add `download_bundled_model()` step; download llama binaries for all 4 platforms
- [ ] **Phase 3** — Config injection: Auto-register `bundled-local` provider and set as default model in `openclaw.json`
- [ ] **Phase 4** — `start.bat`/`stop.bat`: Windows parity
- [ ] **Phase 5** — `setup-llm.sh`: Standalone first-time model download helper
- [ ] **Phase 6** — Update `README.md` and `OFFLINE-GUIDE.md` with bundled model section

---

## 💡 Why This Matters

This feature would make `openclaw-portable` the first USB-bootable AI assistant that:
1. Requires **zero internet** after initial setup
2. Requires **zero pre-installed software** (no Ollama, no Python, no Docker)
3. Has **zero API cost** for basic usage
4. Works on any x86-64/ARM64 machine, just plug and run

The existing infrastructure (`install-models.js` merge logic, `start.sh` modular step structure, `data/.openclaw/` config path) makes this integration very clean — this is essentially adding one new service to an already well-designed process manager.

---

*Interested in contributing a draft PR for Phase 1 if the maintainer approves the direction.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Bundle a zero-config CPU-only local model (llama.cpp + Qwen2.5-1.5B) as the default model — true offline-first experience #51

🎯 Feature Goal

🏗️ Architecture Analysis (based on current repo structure)

📦 Recommended Model: `Qwen2.5-1.5B-Instruct Q4_K_M`

📁 Proposed Directory Layout

🔧 Implementation Details

1. `start.sh` — Add sidecar launch step (between step 2 and current step 3)

2. Inject model config into `openclaw.json` before gateway starts

3. `stop.sh` — Clean up llama-server on shutdown

4. `start.bat` — Windows equivalent (PowerShell snippet)

📥 Model Distribution Strategy

⚠️ Known Risks & Mitigations

🪜 Suggested Implementation Phases

💡 Why This Matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Attribute	Value
Disk size	~900 MB
RAM usage	~1.2 GB
Inference engine	llama.cpp `llama-server` (static binary, no install)
Tool calling	✅ Native support
Context window	32k tokens
CPU speed (4-core)	~8–12 tok/s
License	Apache 2.0 ✅ redistributable

Risk	Mitigation
Cold-start latency (5-15s model load)	Show explicit "加载中" message; start llama-server before gateway
Port 18080 conflict	Check with `lsof`/`netstat` before starting; fall back to 18081
AVX2 not supported (old CPUs)	Detect with `grep avx2 /proc/cpuinfo`; warn user and skip
USB read speed bottleneck	Copy model to `$TEMP_DIR` on first run if USB speed < 50MB/s
Context overflow (32k)	Enable OpenClaw compaction in injected config: `"compaction": { "enabled": true }`
User already has API keys	Respect existing `agents.defaults.model.primary`; register bundled as fallback only

[Feature] Bundle a zero-config CPU-only local model (llama.cpp + Qwen2.5-1.5B) as the default model — true offline-first experience #51

Description

🎯 Feature Goal

🏗️ Architecture Analysis (based on current repo structure)

📦 Recommended Model: Qwen2.5-1.5B-Instruct Q4_K_M

📁 Proposed Directory Layout

🔧 Implementation Details

1. start.sh — Add sidecar launch step (between step 2 and current step 3)

2. Inject model config into openclaw.json before gateway starts

3. stop.sh — Clean up llama-server on shutdown

4. start.bat — Windows equivalent (PowerShell snippet)

📥 Model Distribution Strategy

⚠️ Known Risks & Mitigations

🪜 Suggested Implementation Phases

💡 Why This Matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

📦 Recommended Model: `Qwen2.5-1.5B-Instruct Q4_K_M`

1. `start.sh` — Add sidecar launch step (between step 2 and current step 3)

2. Inject model config into `openclaw.json` before gateway starts

3. `stop.sh` — Clean up llama-server on shutdown

4. `start.bat` — Windows equivalent (PowerShell snippet)