Skip to content

feat: analyzer-embed crate + slop-fixes/slop-targets queries (semantic deslop targeting) #27

@avifenesh

Description

@avifenesh

Goal

Give deslop a structured input so it spends LLM tokens only where slop is likely. Two outputs:

  1. query slop-fixes — pinpoint structured fix actions (Haiku-tier: confirm shape, apply). AST/graph signals only, no embedder needed.
  2. query slop-targets — ranked file/module targets with tier field (sonnet=file-level, opus=cross-file/module). Uses embeddings if present, degrades gracefully to AST/graph signals.

New crate: analyzer-embed

Separate Cargo crate in this workspace, separate binary agent-analyzer-embed. Heavy deps (ort, tokenizers) isolated here; main agent-analyzer binary stays small.

Two model variants, picked at install time (skill prompts user):

Variant Model Size Use case
small BAAI/bge-small-en-v1.5 Q8 ~30 MB English-only, weaker on code, cheap
big google/embeddinggemma-300m Q4 (ONNX) ~195 MB SOTA <500M params, code-aware, multilingual, Matryoshka

Released as 5 platforms × 2 sizes = 10 release assets. Model file downloaded + cached next to binary, content-hashed.

Granularity picked at install time:

  • compact: per-file × 128 dim
  • balanced: per-function × 256 dim
  • maximum: per-function × 768 dim

Storage: sidecar .claude/repo-intel.embeddings.bin (packed int8/fp16). Main repo-intel.json stays diffable.

Subcommands on agent-analyzer-embed:

  • scan — full re-embed of all files, JSON to stdout
  • update — delta-only: read existing sidecar, hash files, re-embed changed/added, drop removed
  • version

New subcommands on main agent-analyzer binary

  • set-embeddings --input - — accept JSON via stdin, merge into sidecar (mirrors existing set-descriptors / set-summary pattern)

New queries

query slop-fixes

Returns structured fix actions for Haiku to apply:

```json
{
"fixes": [
{"action": "delete-file", "path": "debug.log", "reason": "tracked log artifact"},
{"action": "delete-lines", "path": "src/auth.ts", "lines": [142, 158], "reason": "orphan export legacyHash — 0 importers"},
{"action": "delete-lines", "path": "src/api.ts", "lines": [88, 90], "reason": "empty catch block"},
{"action": "remove-dep", "manifest": "package.json", "name": "lodash.merge", "reason": "no import resolves"},
{"action": "replace-lines", "path": "tests/auth.test.ts", "lines": [12, 14], "with": "", "reason": "tautological assertion"}
]
}
```

Categories (all AST/graph, no embedder needed):

  • Orphan exports (symbol exported, 0 importers in graph)
  • Orphan files (not imported, not entry-point, not test)
  • Tracked artifacts (.log, .bak, .orig, .DS_Store, coverage/, dist/ in git)
  • Unused deps (declared in manifest, no import resolves)
  • Empty catch blocks (AST shape)
  • Tautological tests (AST: expect(x).toBe(x), assert!(true))
  • Orphan snapshots/fixtures (file exists, no test references it)
  • Duplicate constants (same literal/string defined N+ places)
  • Long-skipped tests (it.skip / #[ignore] + git age >180d)
  • Old TODOs (TODO/FIXME/XXX/HACK regex + git blame age >180d)
  • Old @ts-ignore / #[allow] / eslint-disable (regex + age)
  • Stale CI configs (.travis.yml when .github/workflows/ exists)
  • Two-of-same tooling (eslint + biome, prettier + biome, multiple lockfiles)

query slop-targets

Returns ranked targets for Sonnet/Opus models:

```json
{
"targets": [
{"path": "src/worker.ts", "tier": "sonnet", "score": 8.7, "suspect": "defensive-cargo-cult", "why": "hotspot 2 + bugspot 5 + 1.6x comment density"},
{"area": "src/auth/", "tier": "opus", "score": 9.1, "suspect": "over-abstraction", "why": "4-deep wrapper chain, single impl per layer"}
]
}
```

Sonnet tier uses file-level signals (combine hotspot + bugspot + size anomaly + comment density + bot authorship + recent big-drop).

Opus tier requires new graph traversals (see below).

Opus-tier graph traversals (v1 scope)

New work in analyzer-graph:

  • Single-impl chains: trait/interface with exactly one concrete implementor across the import graph
  • Wrapper towers: chains where every node has fan-in 1 + fan-out 1 (A→B→C all pass-through)
  • Duplicate logic: AST-shape isomorphism + (if embedder present) embedding similarity for semantic duplicates
  • Cliché-name clusters: helper / utility / manager etc. flagged when clustered, not just present

NLP-enabled patterns (require embedder)

Layered on when embedder is installed; degrade silently when not:

  • Comment-restates-code: embedding similarity between comment text and next N tokens of code > threshold
  • History-prose detection in docs: classifier on past-tense + change-reference patterns
  • Stylometry-based AI authorship: per-file embedding distance from repo's own human-authored baseline (replaces removed metadata-based aiAttribution)
  • Semantic duplicates: function-level embedding similarity (catches what AST-shape match misses)
  • Doc-drift v2: README prose section ↔ function semantics similarity

Schema additions (backward compat)

All new fields use #[serde(skip_serializing_if = \"Option::is_none\", default)]:

```rust
pub struct RepoIntelData {
// ... existing fields
#[serde(skip_serializing_if = "Option::is_none", default)]
pub embeddings_meta: Option, // model id, dim, granularity
// raw vectors live in sidecar .embeddings.bin
}
```

Why not bundle the embedder

Default install stays ~10MB. Power users opt into ~30MB or ~195MB explicitly. Architectural symmetry with main binary download flow: skill prompts → JS wrapper downloads → cached locally.

Acceptance criteria

  • analyzer-embed crate compiles, produces standalone binary
  • Both model variants (small/big) load + embed successfully
  • agent-analyzer-embed scan and update produce well-formed JSON
  • agent-analyzer set-embeddings merges JSON into sidecar
  • Sidecar persists across init / update runs
  • query slop-fixes returns valid fix actions across all categories above
  • query slop-targets returns ranked targets with tier and why
  • Opus-tier graph traversals: single-impl chains, wrapper towers, duplicate logic
  • All new fields backward-compatible (load existing v0.5.0 artifacts without error)
  • Tests for: empty embeddings (no embedder), partial embeddings (delta update), full embeddings
  • CI builds 10 release assets (5 platforms × 2 sizes)

Companion work

Skill orchestration + UI prompts: agent-sh/repo-intel#TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions