Problem Statement
We need a way to index existing issues (10,000+) to seed the vector database. The current CLI lacks a bulk indexing capability.
Requirements
- Usage:
gh simili index --repo owner/name --config .github/simili.yaml
- Content Scope: Must index Title + Description + Comments.
- Why: Semantic search needs the full context of the discussion (e.g., resolutions found in comments).
- Chunking Strategy:
- Embedding models have token limits (Gemini: ~2048/3072 tokens).
- Implement Chunking: Split long issues/comments into overlapping segments (e.g., recursive character text splitter) or index comments individually.
- Performance & Reliability:
- Pagination: Fetch issues in chunks (e.g., 100/page).
- Concurrency: Use a worker pool pattern.
- Resiliency: Support
--since or state tracking to resume if interrupted.
- Rate Limits: Respect GitHub API secondary limits.
Feature Scope
Problem Statement
We need a way to index existing issues (10,000+) to seed the vector database. The current CLI lacks a bulk indexing capability.
Requirements
gh simili index --repo owner/name --config .github/simili.yaml--sinceor state tracking to resume if interrupted.Feature Scope