feat(workspace): FTS5 foundation + pluggable indexer + agent tools#11796
feat(workspace): FTS5 foundation + pluggable indexer + agent tools#11796alt-glitch wants to merge 40 commits into
Conversation
|
…ad code - Rename IndexError → IndexingError (shadows Python builtin) - Use cached CodeChunker instead of per-block instantiation - Collapse identical strategy branches in _process_plain - Remove unused suffix param from _process_code/_process_plain - Remove dead iter_workspace_files function - Expose overlap property on _ChunkerCache (was accessing private _config) - Rewrite _build_line_offsets with regex (was O(n) char loop) - Fix duplicate --human argparse registration - Fix LIKE %/_ semantic bug in path prefix search (use substr) - Pre-compile FTS token regex at module level - Use dataclasses.replace for frozen record updates - Remove decisions.md and workspace-findings.md from PR - ruff format + lint clean on workspace/
|
|
Hey! A few things that could simplify this significantly. 1. Use
|
Pins the target behavior for the Chonkie Pipeline migration: - Markdown files produce one ChunkRecord per modality with clean metadata - Small markdown files remain multimodal (no _single_chunk short-circuit) - Overlap context propagates and is a suffix of prior chunk's content - Deprecated strategy/threshold keys load without error - Signature change forces re-indexing Marked xfail(strict=True) until the migration lands in the next commit.
Replaces the manual MarkdownChef → chunker → OverlapRefinery wiring (and the _ChunkerCache / _apply_overlap / _group_overlap_runs / _neural_enforce_size layer) with three pre-built chonkie.Pipeline instances dispatched by file suffix. - Drop 'semantic' and 'neural' chunking strategies and their model pins; RecursiveChunker is the only prose chunker. BM25/FTS5 doesn't benefit from topical coherence. - Drop ChunkingConfig.strategy and ChunkingConfig.threshold. No legacy compatibility — old keys in user configs are silently ignored. - Drop block_index / src / link / row_count / column_count metadata as not load-bearing for keyword search. - Drop _single_chunk short-circuit so small files still flow through the Pipeline (preserves prose/code split for tiny multimodal markdown). - Drop manual _apply_overlap; OverlapRefinery.refine_document handles prose chunks inside the Pipeline, populating Chunk.context directly. - Bump CHUNKING_PLAN_VERSION from v1 to v2 so existing indexes get re-built cleanly on upgrade.
Code-review follow-up to the chonkie Pipeline migration: - Delete _extract_first_heading and _kind_from_suffix (only callers were _single_chunk, which was removed in the migration). - Drop unused `import pytest` in test_indexer_pipeline.py (all xfail markers removed post-migration). - Update test_indexer_pipeline.py docstrings to describe post-migration invariants rather than "current impl" behavior that no longer exists.
Follow-up to the chonkie Pipeline migration. Parallel black-box verification
surfaced four bugs and three minor follow-ups:
- Add PRAGMA busy_timeout via sqlite3.connect(timeout=5.0) — fixes
concurrent index crashes exposed when _build_pipelines removed the
lazy-init skew that previously hid the race.
- Resolve path_prefix in search_workspace (Python API entry) so it
matches the indexer's resolved stored paths, mirroring what the CLI
already does in commands.py.
- Hardcode .hermesignore exclusion in discovery, and add it to
DEFAULT_IGNORE_PATTERNS belt-and-suspenders.
- Extend DiscoveryResult with filtered_count and roll it into
files_skipped so empty/oversized files stop vanishing from summaries.
- Tighten pipelines dict type from dict[str, Any] to dict[str, Pipeline].
- Drop dead {"language": lang} branch in _process_code — CodeChunker
language="auto" never populates the attribute.
- Tighten test_small_markdown_file_is_split_into_modalities to assert
prose doesn't swallow the code fence.
Adds four regression tests covering each fix.
- Narrow markdown pipeline result with isinstance assert (static type correctness; .code/.tables/.images access no longer leaks Document abstraction). - Simplify _execute_with_lock_retry to 5 linear-backoff attempts; the helper was tuned for WAL schema bootstrap, not repeated retry. - Add Literal['markdown','code','plain'] alias for pipeline keys so typos at the dispatch site become type errors. - Fix misleading stage="discover" label on post-discovery errors; relabel as "read" where it actually applies. - Extract _make_config / _write into tests/workspace/conftest.py fixtures so the two test files share one source. - Factor str(Path(raw).resolve()) into workspace.constants.resolve_path_prefix and call from both search_workspace and the CLI command. - Drop a stale WHAT-comment on the retry backoff line.
The MarkdownDocument narrow in _process_markdown addressed one of three call sites. _process_code and _process_plain had the same Pyright gap — .chunks access on Document | list[Document]. Narrow with isinstance assert, consistent with the markdown path.
- Move PipelineKind = Literal[...] below all imports in indexer.py - Add 'from __future__ import annotations' to store.py so the SQLiteFTS5Store self-reference in __enter__'s return annotation resolves without the explicit string quote.
Drop 'from __future__ import annotations' from store.py and use quoted forward refs instead. Add matching quotes in config.py for KnowledgebaseConfig.from_dict and WorkspaceConfig.from_dict — these were raising NameError at class-definition time after an earlier lint pass stripped the future import that had been masking them.
Plugin contract for workspace backends. Implementations must define __init__(config), index(), and search(). status() is optional.
…xer everywhere Delete the old workspace/indexer.py and workspace/search.py modules. All test imports now use workspace.default.DefaultIndexer directly. Backwards-compat re-exports removed from workspace/__init__.py.
- Remove `from __future__ import annotations` from all new workspace files - Convert TYPE_CHECKING imports to real imports in base.py (no circular deps) - Quote self-referential forward ref in config.py model_validator - Add null checks on spec.loader in plugin discovery to satisfy ty
Adds a workspace indexer plugin backed by @llamaindex/semtools, a Rust CLI that does semantic search via model2vec. The plugin auto-installs semtools on first use and delegates indexing to semtools' lazy approach (embed-on-search). Includes 23 tests covering subclass contract, plugin discovery, factory integration, CLI invocation, result parsing, filtering, error handling, and edge cases.
Semtools is an external CLI tool — integration tests would require npm + network access and add CI complexity for minimal value.
Complete workspace CLI coverage with status, list, retrieve, delete commands. Add 6 separate agent tools (workspace_search, workspace_index, workspace_status, workspace_list, workspace_retrieve, workspace_delete) with per-tool schemas and check_fn gating on workspace.enabled. Wire /workspace slash command in interactive REPL with Rich formatting. - workspace_search and workspace_index are default-enabled in core tools - Full workspace toolset available for opt-in via /tools enable - BaseIndexer ABC extended with list_files(), retrieve(), delete() - SQLiteFTS5Store gains list_files() and get_chunks_for_file()
# Conflicts: # toolsets.py # uv.lock
|
Implements list_files() via workspace file discovery, delete() via semtools workspace prune, and enriches status() with root_dir and total_documents from 'semtools workspace status --json'. retrieve() intentionally left as no-op default — semtools uses embeddings, not chunked content storage. All 6 agent tools now route through plugin dispatch for both default and semtools backends with zero tool-layer changes. Also fix pyproject.toml merge: include both tui_gateway and workspace packages.
Adds _handle_workspace_command() to GatewayRunner with centralized dispatch for status, search, list, retrieve, delete, index, and roots subcommands. Works across Telegram, Discord, Slack, Signal, WhatsApp, Matrix, and all other platforms (dispatch is centralized in gateway/run.py, not per-platform). Uses run_in_executor for blocking SQLite I/O. Reuses workspace.commands helpers for roots add/remove to stay DRY with the CLI surface.
|
New build_workspace_guidance(available_tools) returns a single guidance block that grows with which workspace tools are present. Core paragraph appears when workspace_search is available; retrieve/list/index add their own paragraphs when those tools are also available. workspace_delete is intentionally not prompted (destructive). Follows the existing MEMORY_GUIDANCE/SESSION_SEARCH_GUIDANCE/SKILLS_GUIDANCE pattern in agent/prompt_builder.py.
Wires build_workspace_guidance() into AIAgent._build_system_prompt() alongside the existing memory/session_search/skills hooks. Guidance is assembled dynamically from self.valid_tool_names and appended to the tool_guidance section. Closes the 0/6 workspace_search discoverability gap measured in the 2026-04-20 A/B dogfood.
M1: self.valid_tool_names is already a set[str], no need to wrap. M3: Add 'assert out is not None' guards so pyright correctly narrows the str|None return type in tests that then use in/index()/lower(). Code review cleanup from Task 2 review.
Makes the tool schema description explicitly tell the LLM to prefer workspace_search over grep/find/cat. Pairs with the new system-prompt guidance from build_workspace_guidance() — belt-and-suspenders on tool discoverability.
|
Adds markitdown[pdf,docx,pptx] as a new [parsing] optional extra and wires it into the [workspace] extra via hermes-agent[parsing].
- Cache MarkItDown instance across _convert() calls instead of recreating per file (perf during batch indexing) - Log warning when configured backend name is unknown in build_parser() - Normalize suffix casing in CompositeParser.can_parse() - Remove unused import in test file
|
Summary
Stacked PR: builds a pluggable workspace indexing + search stack on top of the SQLite FTS5 foundation. Originally split across three PRs, now landed as one.
Layer 1 — FTS5 Foundation
Core storage, chunking, and CLI — the base layer that everything else sits on.
workspace/package: SQLite FTS5 store, file discovery, config loading, search, CLI commandsPipeline— three pre-built pipelines dispatched by file suffix:markdown(MarkdownChef + RecursiveChunker),code(CodeChunker),plain(RecursiveChunker). All three end with OverlapRefinery.kind(markdown_text/markdown_code/markdown_table/markdown_image) with per-modality metadata (language on code, alias text on images)contextcolumn, indexed into FTS via theretrieval_texttrigger so queries find matches that only appear in the overlap region.hermesignoresupport with full gitignore semantics viapathspec(precedence:.hermesignore>.gitignore> built-in defaults). The.hermesignorefile itself is never indexed.hermes workspace roots add/remove/list,hermes workspace index,hermes workspace search <query>with--path,--glob,--limitfilters. JSON output by default,--humanfor RichDesign decisions:
RecursiveChunkeris the only prose chunker; semantic/neural were considered and dropped (BM25 doesn't benefit from topical coherence, and the pinned model deps +_neural_enforce_sizewart weren't paying for themselves)context+chunk_metadatacolumnssqlite3.connect(timeout=5.0)plus bounded retry on the one-shot schema bootstrap (WAL-pragma init doesn't honor busy_timeout)chunk_size,overlap, overlap mode/method — any change triggers reindexSalvaged from: PRs #1324 (original design spec by @teknium1) and #5840 (modularized pipeline, salvaged from #5619). Trimmed to FTS5-only with clean code in a flat package, no plugin system.
Layer 2 — Pluggable Indexer Architecture
The FTS5 implementation becomes the
DefaultIndexerbehind a swappable interface. Community can write plugins that replace the entire backend (semantic search, vector store, etc.) without touching core code.BaseIndexerABC (workspace/base.py) — plugin contract withindex(),search(),status(),list_files(),retrieve(),delete(). Two required methods + four optional with no-op defaults.DefaultIndexer(workspace/default.py) — restructuredindexer.py+search.pyinto a class implementing the ABC. Internal methods (_build_pipelines,_process_file, etc.) are overridable for subclass-and-tweak patterns.plugins/workspace/<name>/mirroring the context_engine plugin pattern —register(ctx)function + filesystem discovery.get_indexer(config)factory picks plugin byindexer:config key; sentinel"default"skips plugin lookup.plugins/workspace/semtools/) — real-world validation of the abstractions. Wraps thesemtoolsRust CLI (model2vec semantic search) with idempotentnpm i -gbootstrap. Implements all 6BaseIndexermethods, withretrieve()intentionally left as no-op default (semtools uses embeddings, not chunked content storage).Layer 3 — CLI + Agent Tools + Slash Commands (all backends)
Full surface coverage — same backend exposed through four channels. All four route through
get_indexer(config)so switching theindexer:config key flips the whole stack transparently.CLI subcommands (
hermes workspace ...):status— file count, chunk count, DB size, DB pathlist— all indexed files with size + chunk countretrieve <path>— all chunks for a specific file (JSON or human)delete <path>— remove a file + its chunks from the indexindex,search,rootscommandsAgent tools (6 separate tools, one-job-each design):
workspace_searchworkspace_indexworkspace_statusworkspace_listworkspace_retrieveworkspace_deleteEach tool has its own focused schema (no monolithic action-dispatch). All gated on
workspace.enabledviacheck_fn. Full set available via theworkspacetoolset for opt-in.Interactive slash commands (CLI REPL,
/workspace ...):COMMAND_REGISTRYwith tab-completion for all subcommandshermes_cli/workspace_slash.pyget_indexer()factory as the CLIGateway slash commands (Telegram, Discord, Slack, Signal, WhatsApp, Matrix, etc.):
_handle_workspace_command()ingateway/run.py— centralized dispatch (works across all platforms, no per-platform wiring)run_in_executorto avoid blocking the event loopLayer 4 — File Parsing (PDF, DOCX, PPTX → Markdown)
In-memory parsing layer that converts non-text binary formats to markdown before chunking. Parsed content flows through the existing markdown chunking pipeline (headings, tables, code blocks extracted properly).
FileParserABC (workspace/parsers.py) — template-method pattern: abstract_convert(path) -> str+ concreteparse(path) -> str|Nonewith uniform error handling and loggingMarkitdownParser— wraps markitdown library (lazy import, cached instance across calls)PandocParser— wrapspandocCLI via subprocess with 120s timeoutCompositeParser— routes file extensions to the correct backend.build_parser(config)factory builds the routing table from config.ParsingConfig— Pydantic model underknowledgebase.parsingwithdefaultbackend + per-extensionoverrides(e.g., use markitdown for PDF but pandoc for DOCX)PARSEABLE_SUFFIXES—.pdf,.docx,.pptxcarved out ofBINARY_SUFFIXESin file discovery so they pass through to the parser instead of being skippedDefaultIndexerintegration — parsed files getsuffix = ".md"override to route through the markdown chunking pipeline. Config signature includes parsing config so backend changes trigger reindex.[parsing]extra in pyproject.toml:markitdown[pdf,docx,pptx]>=0.1.0. Wired into the[workspace]extra viahermes-agent[parsing].Backend Parity
Same 4 surfaces x any backend = zero coupling. Verified end-to-end:
hermes workspace ...CLIworkspace_*agent tools/workspaceCLI REPL slash/workspacegateway slashFlip
indexer: "semtools"in config.yaml and every surface transparently routes through the semtools plugin. Zero changes needed in the tool/slash/CLI layers.What comes next
File layout
workspace/__init__.pyget_indexer()factory + public APIworkspace/base.pyBaseIndexerABCworkspace/default.pyDefaultIndexer— Chonkie + SQLite FTS5workspace/config.pyWorkspaceConfig/KnowledgebaseConfig/ParsingConfigworkspace/constants.pyPARSEABLE_SUFFIXES, path helpersworkspace/types.pyFileRecord,ChunkRecord,SearchResult,IndexingError,IndexSummaryworkspace/store.pySQLiteFTS5Store— schema, CRUD, BM25 search, FTS5 triggers, WAL-safe bootstrapworkspace/files.pyworkspace/parsers.pyFileParserABC,MarkitdownParser,PandocParser,CompositeParser,build_parser()workspace/commands.pyplugins/workspace/__init__.pyplugins/workspace/semtools/tools/workspace_tools.pyhermes_cli/workspace_slash.py/workspaceinteractive CLI REPL slashModified files
hermes_cli/main.py— registershermes workspacesubcommand treehermes_cli/commands.py— addsCommandDef("workspace", ...)toCOMMAND_REGISTRYcli.py— dispatch branch for CLI REPL slash commandgateway/run.py— dispatch branch +_handle_workspace_command()for messaging platformshermes_cli/config.py— addsknowledgebasesection toDEFAULT_CONFIGtoolsets.py— adds workspace tools to_HERMES_CORE_TOOLS+ standaloneworkspacetoolsetpyproject.toml— addspathspec,charset-normalizerto core deps;chonkie[code]to[workspace]extra;markitdown[pdf,docx,pptx]to[parsing]extraTest plan
strategy/thresholdconfig keys silently ignored--pathprefix (symlink-aware),--globpattern,--limitclampingretrieval_texttrigger (ZEBRAFOO-in-overlap-only regression test)files_skipped(no silent drops at discovery)workspace_*tools transparently route toSemtoolsIndexerwhenindexer: "semtools"/workspace status,/workspace search,/workspace list,/workspace retrieve, tab-completion — all passhermes workspaceCLI/workspacefrom Telegram/Discord/Slack after gateway restart