Goal
Give deslop a structured input so it spends LLM tokens only where slop is likely. Two outputs:
query slop-fixes — pinpoint structured fix actions (Haiku-tier: confirm shape, apply). AST/graph signals only, no embedder needed.
query slop-targets — ranked file/module targets with tier field (sonnet=file-level, opus=cross-file/module). Uses embeddings if present, degrades gracefully to AST/graph signals.
New crate: analyzer-embed
Separate Cargo crate in this workspace, separate binary agent-analyzer-embed. Heavy deps (ort, tokenizers) isolated here; main agent-analyzer binary stays small.
Two model variants, picked at install time (skill prompts user):
| Variant |
Model |
Size |
Use case |
| small |
BAAI/bge-small-en-v1.5 Q8 |
~30 MB |
English-only, weaker on code, cheap |
| big |
google/embeddinggemma-300m Q4 (ONNX) |
~195 MB |
SOTA <500M params, code-aware, multilingual, Matryoshka |
Released as 5 platforms × 2 sizes = 10 release assets. Model file downloaded + cached next to binary, content-hashed.
Granularity picked at install time:
- compact: per-file × 128 dim
- balanced: per-function × 256 dim
- maximum: per-function × 768 dim
Storage: sidecar .claude/repo-intel.embeddings.bin (packed int8/fp16). Main repo-intel.json stays diffable.
Subcommands on agent-analyzer-embed:
scan — full re-embed of all files, JSON to stdout
update — delta-only: read existing sidecar, hash files, re-embed changed/added, drop removed
version
New subcommands on main agent-analyzer binary
set-embeddings --input - — accept JSON via stdin, merge into sidecar (mirrors existing set-descriptors / set-summary pattern)
New queries
query slop-fixes
Returns structured fix actions for Haiku to apply:
```json
{
"fixes": [
{"action": "delete-file", "path": "debug.log", "reason": "tracked log artifact"},
{"action": "delete-lines", "path": "src/auth.ts", "lines": [142, 158], "reason": "orphan export legacyHash — 0 importers"},
{"action": "delete-lines", "path": "src/api.ts", "lines": [88, 90], "reason": "empty catch block"},
{"action": "remove-dep", "manifest": "package.json", "name": "lodash.merge", "reason": "no import resolves"},
{"action": "replace-lines", "path": "tests/auth.test.ts", "lines": [12, 14], "with": "", "reason": "tautological assertion"}
]
}
```
Categories (all AST/graph, no embedder needed):
- Orphan exports (symbol exported, 0 importers in graph)
- Orphan files (not imported, not entry-point, not test)
- Tracked artifacts (
.log, .bak, .orig, .DS_Store, coverage/, dist/ in git)
- Unused deps (declared in manifest, no import resolves)
- Empty catch blocks (AST shape)
- Tautological tests (AST:
expect(x).toBe(x), assert!(true))
- Orphan snapshots/fixtures (file exists, no test references it)
- Duplicate constants (same literal/string defined N+ places)
- Long-skipped tests (
it.skip / #[ignore] + git age >180d)
- Old TODOs (
TODO/FIXME/XXX/HACK regex + git blame age >180d)
- Old
@ts-ignore / #[allow] / eslint-disable (regex + age)
- Stale CI configs (
.travis.yml when .github/workflows/ exists)
- Two-of-same tooling (eslint + biome, prettier + biome, multiple lockfiles)
query slop-targets
Returns ranked targets for Sonnet/Opus models:
```json
{
"targets": [
{"path": "src/worker.ts", "tier": "sonnet", "score": 8.7, "suspect": "defensive-cargo-cult", "why": "hotspot 2 + bugspot 5 + 1.6x comment density"},
{"area": "src/auth/", "tier": "opus", "score": 9.1, "suspect": "over-abstraction", "why": "4-deep wrapper chain, single impl per layer"}
]
}
```
Sonnet tier uses file-level signals (combine hotspot + bugspot + size anomaly + comment density + bot authorship + recent big-drop).
Opus tier requires new graph traversals (see below).
Opus-tier graph traversals (v1 scope)
New work in analyzer-graph:
- Single-impl chains: trait/interface with exactly one concrete implementor across the import graph
- Wrapper towers: chains where every node has fan-in 1 + fan-out 1 (A→B→C all pass-through)
- Duplicate logic: AST-shape isomorphism + (if embedder present) embedding similarity for semantic duplicates
- Cliché-name clusters:
helper / utility / manager etc. flagged when clustered, not just present
NLP-enabled patterns (require embedder)
Layered on when embedder is installed; degrade silently when not:
- Comment-restates-code: embedding similarity between comment text and next N tokens of code > threshold
- History-prose detection in docs: classifier on past-tense + change-reference patterns
- Stylometry-based AI authorship: per-file embedding distance from repo's own human-authored baseline (replaces removed metadata-based
aiAttribution)
- Semantic duplicates: function-level embedding similarity (catches what AST-shape match misses)
- Doc-drift v2: README prose section ↔ function semantics similarity
Schema additions (backward compat)
All new fields use #[serde(skip_serializing_if = \"Option::is_none\", default)]:
```rust
pub struct RepoIntelData {
// ... existing fields
#[serde(skip_serializing_if = "Option::is_none", default)]
pub embeddings_meta: Option, // model id, dim, granularity
// raw vectors live in sidecar .embeddings.bin
}
```
Why not bundle the embedder
Default install stays ~10MB. Power users opt into ~30MB or ~195MB explicitly. Architectural symmetry with main binary download flow: skill prompts → JS wrapper downloads → cached locally.
Acceptance criteria
Companion work
Skill orchestration + UI prompts: agent-sh/repo-intel#TBD
Goal
Give deslop a structured input so it spends LLM tokens only where slop is likely. Two outputs:
query slop-fixes— pinpoint structured fix actions (Haiku-tier: confirm shape, apply). AST/graph signals only, no embedder needed.query slop-targets— ranked file/module targets withtierfield (sonnet=file-level, opus=cross-file/module). Uses embeddings if present, degrades gracefully to AST/graph signals.New crate:
analyzer-embedSeparate Cargo crate in this workspace, separate binary
agent-analyzer-embed. Heavy deps (ort,tokenizers) isolated here; mainagent-analyzerbinary stays small.Two model variants, picked at install time (skill prompts user):
Released as 5 platforms × 2 sizes = 10 release assets. Model file downloaded + cached next to binary, content-hashed.
Granularity picked at install time:
Storage: sidecar
.claude/repo-intel.embeddings.bin(packed int8/fp16). Mainrepo-intel.jsonstays diffable.Subcommands on agent-analyzer-embed:
scan— full re-embed of all files, JSON to stdoutupdate— delta-only: read existing sidecar, hash files, re-embed changed/added, drop removedversionNew subcommands on main
agent-analyzerbinaryset-embeddings --input -— accept JSON via stdin, merge into sidecar (mirrors existingset-descriptors/set-summarypattern)New queries
query slop-fixesReturns structured fix actions for Haiku to apply:
```json
{
"fixes": [
{"action": "delete-file", "path": "debug.log", "reason": "tracked log artifact"},
{"action": "delete-lines", "path": "src/auth.ts", "lines": [142, 158], "reason": "orphan export legacyHash — 0 importers"},
{"action": "delete-lines", "path": "src/api.ts", "lines": [88, 90], "reason": "empty catch block"},
{"action": "remove-dep", "manifest": "package.json", "name": "lodash.merge", "reason": "no import resolves"},
{"action": "replace-lines", "path": "tests/auth.test.ts", "lines": [12, 14], "with": "", "reason": "tautological assertion"}
]
}
```
Categories (all AST/graph, no embedder needed):
.log,.bak,.orig,.DS_Store,coverage/,dist/in git)expect(x).toBe(x),assert!(true))it.skip/#[ignore]+ git age >180d)TODO/FIXME/XXX/HACKregex + git blame age >180d)@ts-ignore/#[allow]/eslint-disable(regex + age).travis.ymlwhen.github/workflows/exists)query slop-targetsReturns ranked targets for Sonnet/Opus models:
```json
{
"targets": [
{"path": "src/worker.ts", "tier": "sonnet", "score": 8.7, "suspect": "defensive-cargo-cult", "why": "hotspot 2 + bugspot 5 + 1.6x comment density"},
{"area": "src/auth/", "tier": "opus", "score": 9.1, "suspect": "over-abstraction", "why": "4-deep wrapper chain, single impl per layer"}
]
}
```
Sonnet tier uses file-level signals (combine hotspot + bugspot + size anomaly + comment density + bot authorship + recent big-drop).
Opus tier requires new graph traversals (see below).
Opus-tier graph traversals (v1 scope)
New work in
analyzer-graph:helper/utility/manageretc. flagged when clustered, not just presentNLP-enabled patterns (require embedder)
Layered on when embedder is installed; degrade silently when not:
aiAttribution)Schema additions (backward compat)
All new fields use
#[serde(skip_serializing_if = \"Option::is_none\", default)]:```rust
pub struct RepoIntelData {
// ... existing fields
#[serde(skip_serializing_if = "Option::is_none", default)]
pub embeddings_meta: Option, // model id, dim, granularity
// raw vectors live in sidecar .embeddings.bin
}
```
Why not bundle the embedder
Default install stays ~10MB. Power users opt into ~30MB or ~195MB explicitly. Architectural symmetry with main binary download flow: skill prompts → JS wrapper downloads → cached locally.
Acceptance criteria
analyzer-embedcrate compiles, produces standalone binaryagent-analyzer-embed scanandupdateproduce well-formed JSONagent-analyzer set-embeddingsmerges JSON into sidecarinit/updaterunsquery slop-fixesreturns valid fix actions across all categories abovequery slop-targetsreturns ranked targets withtierandwhyCompanion work
Skill orchestration + UI prompts: agent-sh/repo-intel#TBD