Local AI server with persistent memory. Zero cloud. Full control.
I've reached the minimum viable product for the real world — but feedback is still missing. 🚀
Documentation · Install · Architecture · Releases
v1.0.6 — Security hardening. Server Nexe now ships as a Tauri v2 desktop application with onboarding wizard, system tray, and automatic sidecar management. Available as macOS DMG (Apple Silicon) and Linux AppImage (ARM64). See Releases.
Linux note: tested on Ubuntu 24.04 ARM64 virtual machines (UTM). CPU inference (Ollama) verified. If you test on native hardware or with GPU acceleration, please open an issue with your results.
- The Story
- Screenshots
- Why Server Nexe?
- Quick Start
- Backends
- Available Models by RAM Tier
- Architecture
- Plugin System
- AI-Ready Documentation
- Security
- Platform Support
- Requirements
- Testing
- Roadmap
- Limitations
- Contributing
- Acknowledgments
- Disclaimer
Server Nexe started as a learning-by-doing experiment: "What would it take to have your own local AI with persistent memory?" Since I wasn't going to build an LLM, I started picking up pieces to assemble a useful lego for myself and my day-to-day work. One thing led to another — inference backends, RAG pipelines, vector search, plugin systems, security layers, a web UI, an installer with hardware detection.
This entire project — code, tests, audits, documentation — has been built by one person orchestrating different AI models, both local (MLX, Ollama) and cloud (Claude, GPT, Gemini, DeepSeek, Qwen, Grok...), as collaborators. The human decides what to build, designs the architecture, reviews lines and runs tests. The AIs write, audit, and stress-test under human direction.
What began as a prototype has turned into a genuinely useful product: 6776 tests, security audits, encryption at rest, a macOS installer with hardware detection, and a plugin system. It's not done — there's a roadmap full of ideas — but it already does what it set out to do: run an AI server on your machine, with memory that persists, and zero data leaving your device.
This is not trying to compete with ChatGPT or Claude. But it can be complementary for less demanding tasks. It's an open-source tool for people who want to own their AI infrastructure. Built by one person in Barcelona, with AI as co-pilot, music, and stubbornness.
More technically: what was a giant spaghetti monster ended up distilling, refactor after refactor, into a minimal, backend-agnostic (MLX / llama.cpp / Ollama), modular core — where security and memory are solved at the base so building on top is fast and comfortable, in human–AI collaboration. Whether that worked is for the community to say (the AI says yes, but what did you expect 🤪).
Web UI — light mode |
Web UI — dark mode |
Your conversations, documents, embeddings, and model weights stay on your machine. Always. Server Nexe combines LLM inference with a persistent RAG memory system — your AI remembers context across sessions, indexes your documents, and never phones home.
|
Every conversation, document, and embedding stays on your device at runtime. No telemetry, no cloud calls during operation, no server that phones home. Initial install downloads the chosen LLM and the |
Remembers context across sessions using Qdrant vector search with 768-dimensional embeddings across 3 specialized collections. Ingest documents, recall knowledge. |
|
The model extracts facts from conversations automatically — names, jobs, preferences, projects — and stores them to memory inside the same LLM call, with zero extra latency. Trilingual intent detection (ca/es/en), semantic deduplication, and deletion by voice ("forget that..."). |
Switch between MLX (Apple Silicon native), llama.cpp (GGUF, universal), or Ollama — one config change, same OpenAI-compatible API. |
|
Auto-discovered plugins with independent manifests. Security, web UI, RAG, backends — everything is a plugin. Add capabilities without touching the core. NexeModule protocol with duck typing, no inheritance. |
Tauri v2 desktop application for macOS (DMG) and Linux (AppImage). Onboarding wizard detects your hardware, picks the right backend, recommends models for your RAM, and gets you running in minutes. System tray, native menus, and automatic sidecar management. |
|
Upload |
6776 tests (~85% coverage), security audits, i18n in 3 languages, comprehensive API. What started as an experiment is being built with production practices. |
Download the latest package from Releases:
| Platform | Package | Size |
|---|---|---|
| macOS (Apple Silicon) | nexe-app_1.0.6_aarch64.dmg |
~1.3 GB |
| Linux (ARM64) | nexe-app_1.0.6_aarch64.AppImage |
~1.2 GB |
The onboarding wizard handles everything: hardware detection, backend selection, model download, and configuration. The app runs server-nexe as a sidecar process with system tray integration.
git clone https://github.com/jgoy-labs/server-nexe.git
cd server-nexe
./setup.sh # guided installation (detects hardware, picks backend & model)
nexe go # start server on port 9119Once running:
nexe chat # interactive chat (RAG memory on by default)
nexe memory store "Barcelona is the capital of Catalonia"
nexe memory recall "capital Catalonia"
nexe status # system statuspython -m installer.install_headless --backend ollama --model qwen3.5:latest
nexe goEndpoints at http://localhost:9119:
| Endpoint | Description |
|---|---|
/v1/chat/completions |
OpenAI-compatible chat API |
/ui |
Web UI (chat, file upload, sessions) |
/health |
Health check |
/docs |
Interactive API documentation (Swagger) |
Authentication via
X-API-Keyheader. Key is generated during installation and stored in.env.
| Backend | Platform | Best for |
|---|---|---|
| MLX | macOS (Apple Silicon) | Recommended for Mac — native Metal GPU acceleration, fastest on M-series |
| llama.cpp | macOS / Linux | Universal — GGUF format, Metal on Mac, CPU/CUDA on Linux |
| Ollama | macOS / Linux | Bridge to existing Ollama installations, easiest model management |
The installer auto-detects your hardware and recommends the best backend. You can switch anytime in personality/server.toml.
The installer organizes the 14 catalog models by the RAM available on your machine (4 tiers):
| Tier | Models | Origin |
|---|---|---|
| 8 GB | Qwen3.5 4B | Alibaba |
| 16 GB | Qwen3.5 9B, Gemma 4 E4B, Mistral Nemo 12B, Salamandra 7B | Alibaba, Google, Mistral AI, BSC/AINA |
| 24 GB | Qwen3.5 27B, Gemma 4 31B, Mistral Small 3.2 24B, GPT-OSS 20B | Alibaba, Google, Mistral AI, OpenAI |
| 32 GB | Qwen3.5 35B-A3B, Gemma 4 31B, Mixtral 8x7B, DeepSeek R1 32B, ALIA-40B | Alibaba, Google, Mistral AI, DeepSeek, BSC (Barcelona Supercomputing Center) |
In addition, you can use any Ollama model by name or any GGUF model from Hugging Face.
server-nexe/
├── core/ # FastAPI server, endpoints, CLI, config, metrics, resilience
│ ├── endpoints/ # REST API (v1 chat, health, status, system, installer)
│ ├── cli/ # CLI commands & i18n (ca/es/en)
│ └── resilience/ # Circuit breaker, rate limiting
├── personality/ # Module manager, plugin discovery, server.toml
│ ├── loading/ # Plugin loading pipeline (find, validate, import, lifecycle)
│ └── module_manager/ # Discovery, registry, config, sync
├── memory/ # Embeddings, RAG engine, vector memory, document ingestion
│ ├── embeddings/ # Chunking, embedding generation
│ ├── rag/ # Retrieval-augmented generation pipeline
│ └── memory/ # Persistent vector store (Qdrant)
├── plugins/ # Auto-discovered plugin modules
│ ├── mlx_module/ # MLX backend (Apple Silicon)
│ ├── llama_cpp_module/ # llama.cpp backend (GGUF)
│ ├── ollama_module/ # Ollama bridge
│ ├── security/ # Auth, injection detection, CSRF, rate limiting, input sanitization
│ └── web_ui_module/ # Browser-based chat UI with file upload
├── installer/ # Guided installer, headless mode, hardware detection, model catalog
├── knowledge/ # Indexed documentation for RAG (ca/es/en)
└── tests/ # Integration & e2e test suites
flowchart LR
A[Request] --> B[Auth<br/>X-API-Key]
B --> C[Rate Limit<br/>slowapi]
C --> D[validate_string_input<br/>context parameter]
D --> E[RAG Recall<br/>3 collections]
E --> F[_sanitize_rag_context<br/>injection filter]
F --> G[LLM Inference<br/>MLX/Ollama/llama.cpp]
G --> H[Stream Response<br/>SSE markers]
H --> I[MEM_SAVE Parsing<br/>fact extraction]
I --> J[Response<br/>to client]
Server Nexe uses a duck typing protocol (NexeModule Protocol) — no class inheritance, no BasePlugin. Each plugin is a directory under plugins/ with a manifest.toml and a module.py.
5 active plugins:
| Plugin | Type | Key features |
|---|---|---|
| mlx_module | LLM Backend | Apple Silicon native, prefix caching (trie), Metal GPU |
| llama_cpp_module | LLM Backend | Universal GGUF, LRU ModelPool, CPU/GPU |
| ollama_module | LLM Backend | HTTP bridge to Ollama, auto-start, VRAM cleanup |
| security | Core | Dual-key auth, 6 injection detectors + NFKC, 47 jailbreak patterns, rate limiting, RFC5424 audit logging |
| web_ui_module | Interface | Web chat, sessions, document upload, MEM_SAVE, RAG sanitization, i18n |
The knowledge/ folder contains 15 thematic documents × 3 languages = 45 files, structured with YAML frontmatter for RAG ingestion:
API, Architecture, Use Cases, Errors, Identity, Installation, Limitations, Plugins, RAG, README, Security, Testing, Threat Model, Usage.
Point any AI assistant at this repo and it can understand the complete architecture.
| Language | Link |
|---|---|
| English | knowledge/en/README.md |
| Catalan | knowledge/ca/README.md |
| Spanish | knowledge/es/README.md |
Server Nexe includes a security module enabled by default:
- API key authentication on all endpoints
- CSP headers (
script-src 'self'withoutunsafe-inline;style-src 'self' 'unsafe-inline'for Web UI) - CSRF protection with token validation
- Rate limiting per IP
- Input sanitization — 6 injection detectors + Unicode normalization
- Jailbreak detection — 47 pattern speed-bump detector
- Upload denylist — blocks accidental upload of API keys, PEM keys
- Memory injection protection — tag stripping on all input paths
- RAG injection sanitization —
[MEM_SAVE:],[MEM_DELETE:],[OLVIDA|OBLIT|FORGET:],[MEMORIA:]neutralized at ingest and retrieval (v0.9.9) - Pipeline enforcement — all chat through canonical endpoints only
- Encryption at rest — AES-256-GCM, SQLCipher. Default
auto: encrypted whensqlcipher3is available (the DMG bundles it), otherwise plaintext with a startupWARNING. SetNEXE_ENCRYPTION_ENABLED=truefor strict fail-closed mode (v0.9.2+) - Trusted host middleware
Note: This project has not been tested in production with real users. Security testing has been performed by AI, not by professional auditors. See SECURITY.md for full disclosure and vulnerability reporting.
| Platform | Status | Backends |
|---|---|---|
| macOS Apple Silicon (M1+) | Supported — all 3 backends | MLX, llama.cpp, Ollama |
| macOS Intel | Not supported since v0.9.9 | — |
| macOS 13 Ventura or earlier | Not supported since v0.9.9 (requires macOS 14 Sonoma+) | — |
| Linux ARM64 | Supported — AppImage + Ollama, tested on VM | Ollama |
| Linux x86_64 | Supported (Ollama, CPU) — unit tests pass | Ollama, llama.cpp |
| Windows | In development (no public ETA) | — |
Since v0.9.9, server-nexe requires macOS 14 Sonoma+ with Apple Silicon (M1 or later). The pre-built wheels in the DMG are
arm64exclusive. Linux is supported with the Ollama backend (CPU). Tested on Ubuntu 24.04 ARM64 VM. Native hardware validation on the roadmap.
| Minimum | Recommended | |
|---|---|---|
| OS | macOS 14 Sonoma (Apple Silicon only) | macOS 14+ (Apple Silicon) |
| CPU | Apple Silicon M1 | Apple Silicon M2 / M3 / M4 |
| Python | 3.11+ | 3.12+ |
| RAM | 8 GB | 16 GB+ (for larger models) |
| Disk | 10 GB free | 20 GB+ free |
Intel Macs and macOS 13 Ventura are no longer supported. Apple Silicon only (arm64). Linux: Supported with the Ollama backend (CPU). Tested on Ubuntu 24.04 ARM64 VM. Native hardware validation on the roadmap.
6776 tests collected (of 6991 total, 215 deselected by default markers) with ~85% code coverage. CI runs the full suite on every push.
# Unit tests
pytest core memory personality plugins -m "not integration and not e2e and not slow" \
--cov=core --cov=memory --cov=personality --cov=plugins \
--cov-report=term --tb=short -q
# Integration tests (requires Ollama running)
NEXE_AUTOSTART_OLLAMA=true pytest -m "integration" -qServer Nexe is actively developed. Here's what's coming:
- Persistent memory with RAG (v0.9.0)
- Encryption at rest — AES-256-GCM (v0.9.0)
- macOS code signing & notarization (v0.9.0)
- Security hardening — jailbreak detection, upload denylist, pipeline enforcement (v0.9.1)
- Encryption default
auto; strict fail-closed viaNEXE_ENCRYPTION_ENABLED=true(v0.9.2) - Embeddings on ONNX (
fastembed), PyTorch removed (v0.9.3) - Multimodal VLM — 4 backends (Ollama, MLX, llama.cpp, Web UI) (v0.9.7)
- Precomputed KB embeddings (~10.7x faster startup) (v0.9.8)
- RAG injection sanitization (MEM tags neutralized at ingest and retrieval) (v0.9.9)
- Offline install bundle — all wheels + embedding model in DMG (~1.2 GB, post-v0.9.9)
- Thinking toggle endpoint —
PATCH /session/{id}/thinking(post-v0.9.9) - Desktop app (Tauri v2) — macOS DMG + Linux AppImage, onboarding wizard, system tray (v1.0.6)
- Configurable inference parameters via UI
- Community forum
See CHANGELOG.md for version history.
Honest disclosure of what server Nexe does not do or does not do well:
- Local models < cloud — Local models are less capable than GPT-4 or Claude. That's the trade-off for privacy.
- RAG is not perfect — Homonymy, negations, cold start (empty memory), and contradictory information across time periods.
- Partially OpenAI-compatible API —
/v1/chat/completionsworks. Missing:/v1/embeddings,/v1/models, and function calling. - Single user — Mono-user by design. No multi-device sync, no accounts.
- No fine-tuning — You cannot train or fine-tune models.
- New encryption — Added in v0.9.0 (default
autosince v0.9.2; strict fail-closed only whenNEXE_ENCRYPTION_ENABLED=true). Not battle-tested. If you lose the master key, data cannot be recovered (see MEK fallback: file → keyring → env → generate). - Single developer, single real user — Personal open-source project, not an enterprise product.
See knowledge/en/LIMITATIONS.md for full detail.
See CONTRIBUTING.md for setup instructions and guidelines.
server-nexe is built on the shoulders of these amazing open-source projects:
AI & Inference
- MLX — Apple Silicon native ML framework
- llama.cpp — Efficient GGUF model inference
- Ollama — Local model management and serving
- fastembed — ONNX-based text embeddings (replaced
sentence-transformerssince v0.9.3, saves ~600 MB) - sentence-transformers — Historical: original embedding backend, replaced by
fastembedin v0.9.3 - Hugging Face — Model hub and transformers library
Desktop App
- Tauri v2 — Cross-platform desktop framework (Rust + WebView)
Infrastructure
- Qdrant — Vector search engine powering RAG memory
- FastAPI — High-performance async web framework
- Uvicorn — Lightning-fast ASGI server
- Pydantic — Data validation
Tools & Libraries
- Rich — Beautiful terminal formatting
- marked.js — Markdown rendering in web UI
- PyPDF — PDF text extraction for RAG
- rumps — macOS menu bar integration
Security & Monitoring
- Prometheus — Metrics and monitoring
- SlowAPI — Rate limiting
Also built with: Python, NumPy, httpx, tenacity, Click, Typer, Colorama, python-dotenv, PyYAML, toml, structlog, starlette-csrf, python-multipart, psutil, PyObjC, and Linux.
20% of Enterprise sponsorships go directly to supporting these projects.
Built with AI collaboration · Barcelona
This software is provided "as is", without warranty of any kind. Use it at your own risk. The author is not responsible for any damage, data loss, security incidents, or misuse arising from the use of this software.
See LICENSE for details.
Version 1.0.6 · Apache 2.0 · Made by Jordi Goy in Barcelona

