feat: Implement 'index' command for bulk issue backfilling#19
Conversation
…ore` to track it and `.simili-index-checkpoint.json`.
There was a problem hiding this comment.
Pull request overview
This PR implements the gh simili index command for bulk backfilling of GitHub issues into the Qdrant vector database. The implementation adds a CLI command with worker pool concurrency, text chunking for managing embedding context limits, and new GitHub client methods for fetching issues and comments.
Changes:
- Added
indexcommand with configurable worker pool, dry-run mode, and since-timestamp filtering - Implemented
RecursiveCharacterSplitterfor chunking long text into embedding-sized segments - Extended GitHub client with
ListIssuesandListCommentsmethods for bulk data retrieval
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/simili/commands/index.go | New CLI command implementing concurrent issue indexing with worker pool pattern and pagination |
| internal/utils/text/chunker.go | Text splitter for breaking long issues into embedding-compatible chunks with configurable size and overlap |
| internal/utils/text/chunker_test.go | Unit tests for the chunker functionality |
| internal/integrations/github/client.go | Added methods for listing issues and comments to support bulk operations |
| .gitignore | Added entries for binary, local config, and checkpoint files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Length function to count size (using runes for safety) | ||
| func (s *RecursiveCharacterSplitter) len(text string) int { | ||
| return utf8.RuneCountInString(text) | ||
| } |
There was a problem hiding this comment.
The len function defined here is never used in the code. The chunking logic uses Go's built-in len() function directly (e.g., line 65). This function should either be used consistently throughout the code or removed to avoid confusion.
| // Length function to count size (using runes for safety) | |
| func (s *RecursiveCharacterSplitter) len(text string) int { | |
| return utf8.RuneCountInString(text) | |
| } |
| const checkpointFile = ".simili-index-checkpoint.json" | ||
|
|
||
| type Checkpoint struct { | ||
| LastProcessedIssue int `json:"last_processed_issue"` | ||
| Timestamp time.Time `json:"timestamp"` | ||
| } |
There was a problem hiding this comment.
The Checkpoint struct is defined but never used. While checkpointFile constant is defined on line 35, the actual checkpoint functionality mentioned in the long description (line 50) is not implemented. Line 119 explicitly states this is omitted. Either remove these unused definitions or implement the checkpoint functionality as described in the command's help text and issue requirements (#18 mentions support for resuming if interrupted).
| func (c *Client) ListIssues(ctx context.Context, org, repo string, opts *github.IssueListByRepoOptions) ([]*github.Issue, *github.Response, error) { | ||
| if opts == nil { | ||
| opts = &github.IssueListByRepoOptions{ | ||
| State: "all", | ||
| } | ||
| } | ||
| return c.client.Issues.ListByRepo(ctx, org, repo, opts) | ||
| } | ||
|
|
||
| // ListComments fetches comments for a specific issue. | ||
| func (c *Client) ListComments(ctx context.Context, org, repo string, number int, opts *github.IssueListCommentsOptions) ([]*github.IssueComment, *github.Response, error) { | ||
| return c.client.Issues.ListComments(ctx, org, repo, number, opts) | ||
| } |
There was a problem hiding this comment.
The ListIssues and ListComments methods don't follow the error handling pattern established by other methods in this file. Compare with GetIssue (line 22-28), CreateComment (line 32-44), and AddLabels (line 48-57), which all wrap errors with descriptive messages using fmt.Errorf. These new methods simply return the raw error from the underlying GitHub client, which makes debugging harder and is inconsistent with the established pattern in this codebase.
| finalChunks = append(finalChunks, currentChunk) | ||
| // Apply overlap logic here if needed, for simplicity we just start new chunk | ||
| // Ideally, we should keep the last N chars equal to overlap | ||
| // Improving simple implementation to include overlap | ||
| // Start new chunk with overlap from previous | ||
| // Note: complex overlap logic with separators is tricky, | ||
| // for this implementation we'll keep it simple: just reset. | ||
| // To do it properly requires re-adding words. | ||
| // Let's stick to a simpler "accumulate until full" approach for now. | ||
| currentChunk = "" |
There was a problem hiding this comment.
The overlap implementation is incomplete. The code comments acknowledge this issue on lines 68-75, stating that overlap logic is not properly implemented. According to the PR description and issue #18, chunking with overlap is a requirement to ensure context is preserved across chunks. The current implementation resets to an empty chunk without carrying over any overlap from the previous chunk, which defeats the purpose of the ChunkOverlap configuration parameter.
| finalChunks = append(finalChunks, currentChunk) | |
| // Apply overlap logic here if needed, for simplicity we just start new chunk | |
| // Ideally, we should keep the last N chars equal to overlap | |
| // Improving simple implementation to include overlap | |
| // Start new chunk with overlap from previous | |
| // Note: complex overlap logic with separators is tricky, | |
| // for this implementation we'll keep it simple: just reset. | |
| // To do it properly requires re-adding words. | |
| // Let's stick to a simpler "accumulate until full" approach for now. | |
| currentChunk = "" | |
| prevChunk := currentChunk | |
| finalChunks = append(finalChunks, prevChunk) | |
| // Start new chunk with overlap from the end of the previous chunk, if configured. | |
| overlap := s.config.ChunkOverlap | |
| if overlap > 0 { | |
| runes := []rune(prevChunk) | |
| if overlap >= len(runes) { | |
| // If overlap is larger than the chunk, reuse the whole chunk. | |
| currentChunk = prevChunk | |
| } else { | |
| currentChunk = string(runes[len(runes)-overlap:]) | |
| } | |
| } else { | |
| currentChunk = "" | |
| } |
| if len(comments) > 0 { | ||
| sb.WriteString("Comments:\n") | ||
| for _, c := range comments { | ||
| sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody())) |
There was a problem hiding this comment.
Potential nil pointer dereference. When c.User is nil (which can happen for comments from deleted users or in certain GitHub API scenarios), calling c.User.GetLogin() will panic. You should add a nil check or use a default value like "unknown" or "deleted-user" when the user is nil.
| sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody())) | |
| author := "deleted-user" | |
| if c.User != nil { | |
| author = c.User.GetLogin() | |
| } | |
| sb.WriteString(fmt.Sprintf("- %s: %s\n", author, c.GetBody())) |
| // 1. Fetch Comments | ||
| comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{ | ||
| ListOptions: github.ListOptions{PerPage: 100}, | ||
| }) | ||
| if err != nil { | ||
| log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err) | ||
| return | ||
| } | ||
|
|
||
| // 2. Aggregate Text | ||
| var sb strings.Builder | ||
| sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle())) | ||
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody())) | ||
| if len(comments) > 0 { | ||
| sb.WriteString("Comments:\n") | ||
| for _, c := range comments { |
There was a problem hiding this comment.
The comment fetching doesn't handle pagination. The PerPage is set to 100, but issues with more than 100 comments will only have their first 100 comments indexed. Issue #18 requires proper pagination support. You should implement pagination for comments similar to how it's done for issues (lines 163-188), checking resp.NextPage and looping until all comments are fetched.
| // 1. Fetch Comments | |
| comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{ | |
| ListOptions: github.ListOptions{PerPage: 100}, | |
| }) | |
| if err != nil { | |
| log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err) | |
| return | |
| } | |
| // 2. Aggregate Text | |
| var sb strings.Builder | |
| sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle())) | |
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody())) | |
| if len(comments) > 0 { | |
| sb.WriteString("Comments:\n") | |
| for _, c := range comments { | |
| // 1. Fetch Comments (with pagination) | |
| var allComments []*github.IssueComment | |
| page := 1 | |
| for { | |
| comments, resp, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{ | |
| ListOptions: github.ListOptions{PerPage: 100, Page: page}, | |
| }) | |
| if err != nil { | |
| log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err) | |
| return | |
| } | |
| allComments = append(allComments, comments...) | |
| if resp == nil || resp.NextPage == 0 { | |
| break | |
| } | |
| page = resp.NextPage | |
| } | |
| // 2. Aggregate Text | |
| var sb strings.Builder | |
| sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle())) | |
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody())) | |
| if len(allComments) > 0 { | |
| sb.WriteString("Comments:\n") | |
| for _, c := range allComments { |
| @@ -0,0 +1,109 @@ | |||
| // Author: Kaviru Hapuarachchi | |||
| // GitHub: https://github.com/Kavirubc | |||
There was a problem hiding this comment.
The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.
| // GitHub: https://github.com/Kavirubc | |
| // GitHub: https://github.com/kavirubc |
| @@ -0,0 +1,47 @@ | |||
| // Author: Kaviru Hapuarachchi | |||
| // GitHub: https://github.com/Kavirubc | |||
There was a problem hiding this comment.
The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.
| // GitHub: https://github.com/Kavirubc | |
| // GitHub: https://github.com/kavirubc |
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody())) | ||
| if len(comments) > 0 { | ||
| sb.WriteString("Comments:\n") | ||
| for _, c := range comments { | ||
| sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody())) | ||
| } |
There was a problem hiding this comment.
The text aggregation could produce misleading results when issue body or comment body contains nil/empty values. When issue.GetBody() returns an empty string, line 208 will still add "Body: \n\n" to the aggregated text. Similarly, if a comment has an empty body, it will add "- username: \n" (line 212). This could inflate the chunk count unnecessarily and potentially confuse the embedding model with empty labeled sections. Consider checking for empty strings and omitting these sections when they're empty.
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody())) | |
| if len(comments) > 0 { | |
| sb.WriteString("Comments:\n") | |
| for _, c := range comments { | |
| sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody())) | |
| } | |
| issueBody := strings.TrimSpace(issue.GetBody()) | |
| if issueBody != "" { | |
| sb.WriteString(fmt.Sprintf("Body: %s\n\n", issueBody)) | |
| } | |
| hasCommentContent := false | |
| for _, c := range comments { | |
| commentBody := strings.TrimSpace(c.GetBody()) | |
| if commentBody == "" { | |
| continue | |
| } | |
| if !hasCommentContent { | |
| sb.WriteString("Comments:\n") | |
| hasCommentContent = true | |
| } | |
| sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), commentBody)) |
| } | ||
|
|
||
| // 5. Upsert | ||
| if indexDryRun { |
There was a problem hiding this comment.
The indexDryRun flag is referenced using a global variable that may not be accessible in this context. The processIssue function uses the global indexDryRun variable (line 229) without receiving it as a parameter. While this works in Go, it creates tight coupling and makes the function harder to test. Consider passing dryRun as a parameter to processIssue for better testability and clearer function contracts.
- Check error return from MarkFlagRequired - Remove unused checkpointFile constant - Remove unused len() method from RecursiveCharacterSplitter - Remove unused unicode/utf8 import Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Error Handling:** - Add descriptive error wrapping to ListIssues and ListComments methods **Robustness:** - Implement pagination for comment fetching (handles >100 comments) - Add nil check for deleted users in comment processing - Handle empty issue bodies and comments gracefully - Pass dryRun as parameter instead of using global variable **Text Chunking:** - Implement proper overlap logic in RecursiveCharacterSplitter - Preserve context across chunks using ChunkOverlap configuration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rent verification status.
Description
This PR implements the
gh simili indexcommand to enable bulk backfilling of GitHub issues into the Qdrant vector database (Issue #18).Changes
indexcommand incmd/simili/commands/index.go.RecursiveCharacterSplitterininternal/utils/text/chunker.gofor optimizing embedding context.internal/integrations/github/client.gowithListIssuesandListComments.github.com/google/uuidfor robust ID generation.Usage
Verified with unit tests and manual dry-run.