Skip to content

feat: Implement 'index' command for bulk issue backfilling#19

Merged
Kavirubc merged 8 commits intomainfrom
feature/bulk-index-0.1.0v
Feb 2, 2026
Merged

feat: Implement 'index' command for bulk issue backfilling#19
Kavirubc merged 8 commits intomainfrom
feature/bulk-index-0.1.0v

Conversation

@Kavirubc
Copy link
Copy Markdown
Contributor

@Kavirubc Kavirubc commented Feb 2, 2026

Description

This PR implements the gh simili index command to enable bulk backfilling of GitHub issues into the Qdrant vector database (Issue #18).

Changes

  • CLI: Added index command in cmd/simili/commands/index.go.
  • Core: Added RecursiveCharacterSplitter in internal/utils/text/chunker.go for optimizing embedding context.
  • GitHub Client: Enhanced internal/integrations/github/client.go with ListIssues and ListComments.
  • Deps: Added github.com/google/uuid for robust ID generation.

Usage

gh simili index --repo owner/repo

Verified with unit tests and manual dry-run.

…ore` to track it and `.simili-index-checkpoint.json`.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the gh simili index command for bulk backfilling of GitHub issues into the Qdrant vector database. The implementation adds a CLI command with worker pool concurrency, text chunking for managing embedding context limits, and new GitHub client methods for fetching issues and comments.

Changes:

  • Added index command with configurable worker pool, dry-run mode, and since-timestamp filtering
  • Implemented RecursiveCharacterSplitter for chunking long text into embedding-sized segments
  • Extended GitHub client with ListIssues and ListComments methods for bulk data retrieval

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
cmd/simili/commands/index.go New CLI command implementing concurrent issue indexing with worker pool pattern and pagination
internal/utils/text/chunker.go Text splitter for breaking long issues into embedding-compatible chunks with configurable size and overlap
internal/utils/text/chunker_test.go Unit tests for the chunker functionality
internal/integrations/github/client.go Added methods for listing issues and comments to support bulk operations
.gitignore Added entries for binary, local config, and checkpoint files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/utils/text/chunker.go Outdated
Comment on lines +105 to +109

// Length function to count size (using runes for safety)
func (s *RecursiveCharacterSplitter) len(text string) int {
return utf8.RuneCountInString(text)
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len function defined here is never used in the code. The chunking logic uses Go's built-in len() function directly (e.g., line 65). This function should either be used consistently throughout the code or removed to avoid confusion.

Suggested change
// Length function to count size (using runes for safety)
func (s *RecursiveCharacterSplitter) len(text string) int {
return utf8.RuneCountInString(text)
}

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/index.go Outdated
Comment on lines +35 to +40
const checkpointFile = ".simili-index-checkpoint.json"

type Checkpoint struct {
LastProcessedIssue int `json:"last_processed_issue"`
Timestamp time.Time `json:"timestamp"`
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Checkpoint struct is defined but never used. While checkpointFile constant is defined on line 35, the actual checkpoint functionality mentioned in the long description (line 50) is not implemented. Line 119 explicitly states this is omitted. Either remove these unused definitions or implement the checkpoint functionality as described in the command's help text and issue requirements (#18 mentions support for resuming if interrupted).

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +96
func (c *Client) ListIssues(ctx context.Context, org, repo string, opts *github.IssueListByRepoOptions) ([]*github.Issue, *github.Response, error) {
if opts == nil {
opts = &github.IssueListByRepoOptions{
State: "all",
}
}
return c.client.Issues.ListByRepo(ctx, org, repo, opts)
}

// ListComments fetches comments for a specific issue.
func (c *Client) ListComments(ctx context.Context, org, repo string, number int, opts *github.IssueListCommentsOptions) ([]*github.IssueComment, *github.Response, error) {
return c.client.Issues.ListComments(ctx, org, repo, number, opts)
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ListIssues and ListComments methods don't follow the error handling pattern established by other methods in this file. Compare with GetIssue (line 22-28), CreateComment (line 32-44), and AddLabels (line 48-57), which all wrap errors with descriptive messages using fmt.Errorf. These new methods simply return the raw error from the underlying GitHub client, which makes debugging harder and is inconsistent with the established pattern in this codebase.

Copilot uses AI. Check for mistakes.
Comment thread internal/utils/text/chunker.go Outdated
Comment on lines +67 to +76
finalChunks = append(finalChunks, currentChunk)
// Apply overlap logic here if needed, for simplicity we just start new chunk
// Ideally, we should keep the last N chars equal to overlap
// Improving simple implementation to include overlap
// Start new chunk with overlap from previous
// Note: complex overlap logic with separators is tricky,
// for this implementation we'll keep it simple: just reset.
// To do it properly requires re-adding words.
// Let's stick to a simpler "accumulate until full" approach for now.
currentChunk = ""
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap implementation is incomplete. The code comments acknowledge this issue on lines 68-75, stating that overlap logic is not properly implemented. According to the PR description and issue #18, chunking with overlap is a requirement to ensure context is preserved across chunks. The current implementation resets to an empty chunk without carrying over any overlap from the previous chunk, which defeats the purpose of the ChunkOverlap configuration parameter.

Suggested change
finalChunks = append(finalChunks, currentChunk)
// Apply overlap logic here if needed, for simplicity we just start new chunk
// Ideally, we should keep the last N chars equal to overlap
// Improving simple implementation to include overlap
// Start new chunk with overlap from previous
// Note: complex overlap logic with separators is tricky,
// for this implementation we'll keep it simple: just reset.
// To do it properly requires re-adding words.
// Let's stick to a simpler "accumulate until full" approach for now.
currentChunk = ""
prevChunk := currentChunk
finalChunks = append(finalChunks, prevChunk)
// Start new chunk with overlap from the end of the previous chunk, if configured.
overlap := s.config.ChunkOverlap
if overlap > 0 {
runes := []rune(prevChunk)
if overlap >= len(runes) {
// If overlap is larger than the chunk, reuse the whole chunk.
currentChunk = prevChunk
} else {
currentChunk = string(runes[len(runes)-overlap:])
}
} else {
currentChunk = ""
}

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/index.go Outdated
if len(comments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range comments {
sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer dereference. When c.User is nil (which can happen for comments from deleted users or in certain GitHub API scenarios), calling c.User.GetLogin() will panic. You should add a nil check or use a default value like "unknown" or "deleted-user" when the user is nil.

Suggested change
sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
author := "deleted-user"
if c.User != nil {
author = c.User.GetLogin()
}
sb.WriteString(fmt.Sprintf("- %s: %s\n", author, c.GetBody()))

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/index.go Outdated
Comment on lines +196 to +211
// 1. Fetch Comments
comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{
ListOptions: github.ListOptions{PerPage: 100},
})
if err != nil {
log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)
return
}

// 2. Aggregate Text
var sb strings.Builder
sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
if len(comments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range comments {
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment fetching doesn't handle pagination. The PerPage is set to 100, but issues with more than 100 comments will only have their first 100 comments indexed. Issue #18 requires proper pagination support. You should implement pagination for comments similar to how it's done for issues (lines 163-188), checking resp.NextPage and looping until all comments are fetched.

Suggested change
// 1. Fetch Comments
comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{
ListOptions: github.ListOptions{PerPage: 100},
})
if err != nil {
log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)
return
}
// 2. Aggregate Text
var sb strings.Builder
sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
if len(comments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range comments {
// 1. Fetch Comments (with pagination)
var allComments []*github.IssueComment
page := 1
for {
comments, resp, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{
ListOptions: github.ListOptions{PerPage: 100, Page: page},
})
if err != nil {
log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)
return
}
allComments = append(allComments, comments...)
if resp == nil || resp.NextPage == 0 {
break
}
page = resp.NextPage
}
// 2. Aggregate Text
var sb strings.Builder
sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
if len(allComments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range allComments {

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,109 @@
// Author: Kaviru Hapuarachchi
// GitHub: https://github.com/Kavirubc
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.

Suggested change
// GitHub: https://github.com/Kavirubc
// GitHub: https://github.com/kavirubc

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,47 @@
// Author: Kaviru Hapuarachchi
// GitHub: https://github.com/Kavirubc
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.

Suggested change
// GitHub: https://github.com/Kavirubc
// GitHub: https://github.com/kavirubc

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/index.go Outdated
Comment on lines +208 to +213
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
if len(comments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range comments {
sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text aggregation could produce misleading results when issue body or comment body contains nil/empty values. When issue.GetBody() returns an empty string, line 208 will still add "Body: \n\n" to the aggregated text. Similarly, if a comment has an empty body, it will add "- username: \n" (line 212). This could inflate the chunk count unnecessarily and potentially confuse the embedding model with empty labeled sections. Consider checking for empty strings and omitting these sections when they're empty.

Suggested change
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
if len(comments) > 0 {
sb.WriteString("Comments:\n")
for _, c := range comments {
sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
}
issueBody := strings.TrimSpace(issue.GetBody())
if issueBody != "" {
sb.WriteString(fmt.Sprintf("Body: %s\n\n", issueBody))
}
hasCommentContent := false
for _, c := range comments {
commentBody := strings.TrimSpace(c.GetBody())
if commentBody == "" {
continue
}
if !hasCommentContent {
sb.WriteString("Comments:\n")
hasCommentContent = true
}
sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), commentBody))

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/index.go Outdated
}

// 5. Upsert
if indexDryRun {
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexDryRun flag is referenced using a global variable that may not be accessible in this context. The processIssue function uses the global indexDryRun variable (line 229) without receiving it as a parameter. While this works in Go, it creates tight coupling and makes the function harder to test. Consider passing dryRun as a parameter to processIssue for better testability and clearer function contracts.

Copilot uses AI. Check for mistakes.
Kavirubc and others added 3 commits February 2, 2026 20:45
- Check error return from MarkFlagRequired
- Remove unused checkpointFile constant
- Remove unused len() method from RecursiveCharacterSplitter
- Remove unused unicode/utf8 import

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Error Handling:**
- Add descriptive error wrapping to ListIssues and ListComments methods

**Robustness:**
- Implement pagination for comment fetching (handles >100 comments)
- Add nil check for deleted users in comment processing
- Handle empty issue bodies and comments gracefully
- Pass dryRun as parameter instead of using global variable

**Text Chunking:**
- Implement proper overlap logic in RecursiveCharacterSplitter
- Preserve context across chunks using ChunkOverlap configuration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@Kavirubc Kavirubc merged commit 6cfe224 into main Feb 2, 2026
3 checks passed
@Kavirubc Kavirubc deleted the feature/bulk-index-0.1.0v branch February 3, 2026 08:41
@Kavirubc Kavirubc linked an issue Feb 3, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[v0.1.0] CLI: Implement 'index' command for bulk backfill

2 participants