Problem
Zeph currently uses regex/rule-based heuristics in several critical subsystems. These have well-known failure modes: no context awareness, brittle to paraphrasing, require manual pattern maintenance, and miss semantic variants.
Affected subsystems:
- FeedbackDetector (
detector_mode = "regex") — detects user corrections/disagreements for skill learning
- Content injection detection —
flag_injection_patterns regex list in SecurityConfig
- PII filter — email/phone/SSN/credit card regex patterns
redact_sensitive() — credential pattern detection (sk-, AKIA, ghp_, Bearer)
- ACON compression failure detection — UNCERTAINTY_PATTERNS + PRIOR_CONTEXT_PATTERNS
- Output filter SecurityPatterns — 17 LazyLock regex in tools
Strategic Direction
Replace all heuristic regex classifiers with specialized lightweight models via Candle (HuggingFace). Large LLMs (GPT-4, Claude Opus) should be reserved for complex reasoning and planning. Classification/detection tasks go to dedicated small models.
Architecture principle: every subsystem that currently calls regex.is_match() on LLM input/output should instead call a ClassifierProvider backed by a Candle model (or zero-shot gpt-4o-mini as fallback). Provider pattern already established via [[llm.providers]] and *_provider fields.
Research Findings (CI-178)
Three-tier hierarchy from 2024–2026 literature:
| Tier |
Approach |
Latency |
Best for |
| 1 |
Regex pre-filter |
<1ms |
High-recall first pass (keep as fallback) |
| 2 |
Linear probe on LLM activations |
<10ms |
Injection, PII when host model available |
| 3 |
Fine-tuned small transformer via Candle |
50–200ms |
All tasks with HuggingFace models |
| 4 |
Zero-shot gpt-4o-mini prompt |
200ms+ |
Cold-start, no labeled data available |
Recommended Models (HuggingFace / Candle-compatible)
| Task |
Model |
Size |
Notes |
| Injection detection |
mDeBERTa-v3-base-prompt-injection-v2 |
280MB |
Multi-label, production-grade |
| Content safety |
Llama-Guard-3-1B |
1B |
Meta canonical agent safeguard |
| PII detection |
piiranha-v1-detect-personal-information |
300MB |
BERT-based, 300k+ downloads |
| Feedback/correction |
zero-shot gpt-4o-mini |
API |
No labeled dataset; bootstrap with synthetic data |
| General safety (lightweight) |
DeBERTa-v3 + LEC head (arXiv:2412.13435) |
0.5–3B |
Qwen 0.5B backbone, fast after warmup |
Key Papers
- arXiv:2412.13435 — LEC (Layer Enhanced Classification): logistic regression on intermediate transformer layer activations, surpasses GPT-4o with <100 examples, Qwen 0.5B–3B backbone. Most applicable to Zeph.
- arXiv:2510.14005 — PIShield: linear probe on residual stream, no fine-tuning, covers injection detection
- arXiv:2510.07551 — RECAP Hybrid PII: regex fast path + LLM second pass, 82% better than NER-only
- arXiv:2312.06674 — Llama Guard 3: Meta production safety classifier for agent I/O
- arXiv:2509.23994 — AI Agent Code of Conduct: policy-as-prompt enforcement via LLM classifier
- arXiv:2510.09781 — Safiron: pre-execution guardian model for agentic plans
Implementation Plan
Phase 1 — ClassifierProvider abstraction
- Add
ClassifierProvider trait to zeph-core with classify(text) -> ClassificationResult
- Candle backend: load ONNX/safetensors from HuggingFace cache
- Fallback: zero-shot LLM via existing
[[llm.providers]]
- Config:
[classifiers] section with injection_provider, pii_provider, feedback_provider, safety_provider
Phase 2 — FeedbackDetector migration
- Replace
detector_mode = "regex" with detector_mode = "model"
- Zero-shot gpt-4o-mini or fine-tuned small model; keep regex as offline fallback
Phase 3 — Injection detection
- Replace
flag_injection_patterns regex with mDeBERTa-v3-base-prompt-injection-v2 via Candle
- Score threshold replaces pattern list
Phase 4 — PII filter
- Hybrid: keep regex fast path, add piiranha/NER second pass for contextual PII
Notes
- Candle backend (
zeph-candle crate) already exists but underutilized — large selection of ready HuggingFace models
- All classifier calls must be async with configurable timeouts; fall back to regex on timeout
- Models cached in
~/.cache/zeph/classifiers/ on first use
- Privacy: classifier models MUST run locally via Candle by default — no external API for PII/injection data
- Future direction: many specialized lightweight models per task > one large LLM for everything
Problem
Zeph currently uses regex/rule-based heuristics in several critical subsystems. These have well-known failure modes: no context awareness, brittle to paraphrasing, require manual pattern maintenance, and miss semantic variants.
Affected subsystems:
detector_mode = "regex") — detects user corrections/disagreements for skill learningflag_injection_patternsregex list in SecurityConfigredact_sensitive()— credential pattern detection (sk-, AKIA, ghp_, Bearer)Strategic Direction
Replace all heuristic regex classifiers with specialized lightweight models via Candle (HuggingFace). Large LLMs (GPT-4, Claude Opus) should be reserved for complex reasoning and planning. Classification/detection tasks go to dedicated small models.
Architecture principle: every subsystem that currently calls
regex.is_match()on LLM input/output should instead call aClassifierProviderbacked by a Candle model (or zero-shot gpt-4o-mini as fallback). Provider pattern already established via[[llm.providers]]and*_providerfields.Research Findings (CI-178)
Three-tier hierarchy from 2024–2026 literature:
Recommended Models (HuggingFace / Candle-compatible)
Key Papers
Implementation Plan
Phase 1 — ClassifierProvider abstraction
ClassifierProvidertrait tozeph-corewithclassify(text) -> ClassificationResult[[llm.providers]][classifiers]section withinjection_provider,pii_provider,feedback_provider,safety_providerPhase 2 — FeedbackDetector migration
detector_mode = "regex"withdetector_mode = "model"Phase 3 — Injection detection
flag_injection_patternsregex with mDeBERTa-v3-base-prompt-injection-v2 via CandlePhase 4 — PII filter
Notes
zeph-candlecrate) already exists but underutilized — large selection of ready HuggingFace models~/.cache/zeph/classifiers/on first use