feat: Implement 'index' command for bulk issue backfilling by Kavirubc · Pull Request #19 · similigh/simili-bot

Kavirubc · 2026-02-02T12:20:04Z

Description

This PR implements the gh simili index command to enable bulk backfilling of GitHub issues into the Qdrant vector database (Issue #18).

Changes

CLI: Added index command in cmd/simili/commands/index.go.
Core: Added RecursiveCharacterSplitter in internal/utils/text/chunker.go for optimizing embedding context.
GitHub Client: Enhanced internal/integrations/github/client.go with ListIssues and ListComments.
Deps: Added github.com/google/uuid for robust ID generation.

Usage

gh simili index --repo owner/repo

Verified with unit tests and manual dry-run.

…ore` to track it and `.simili-index-checkpoint.json`.

Copilot

Pull request overview

This PR implements the gh simili index command for bulk backfilling of GitHub issues into the Qdrant vector database. The implementation adds a CLI command with worker pool concurrency, text chunking for managing embedding context limits, and new GitHub client methods for fetching issues and comments.

Changes:

Added index command with configurable worker pool, dry-run mode, and since-timestamp filtering
Implemented RecursiveCharacterSplitter for chunking long text into embedding-sized segments
Extended GitHub client with ListIssues and ListComments methods for bulk data retrieval

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
cmd/simili/commands/index.go	New CLI command implementing concurrent issue indexing with worker pool pattern and pagination
internal/utils/text/chunker.go	Text splitter for breaking long issues into embedding-compatible chunks with configurable size and overlap
internal/utils/text/chunker_test.go	Unit tests for the chunker functionality
internal/integrations/github/client.go	Added methods for listing issues and comments to support bulk operations
.gitignore	Added entries for binary, local config, and checkpoint files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T12:28:19Z

+
+// Length function to count size (using runes for safety)
+func (s *RecursiveCharacterSplitter) len(text string) int {
+	return utf8.RuneCountInString(text)
+}


The len function defined here is never used in the code. The chunking logic uses Go's built-in len() function directly (e.g., line 65). This function should either be used consistently throughout the code or removed to avoid confusion.

Suggested change

// Length function to count size (using runes for safety)

func (s *RecursiveCharacterSplitter) len(text string) int {

return utf8.RuneCountInString(text)

}

Copilot · 2026-02-02T12:28:20Z

+const checkpointFile = ".simili-index-checkpoint.json"
+
+type Checkpoint struct {
+	LastProcessedIssue int       `json:"last_processed_issue"`
+	Timestamp          time.Time `json:"timestamp"`
+}


The Checkpoint struct is defined but never used. While checkpointFile constant is defined on line 35, the actual checkpoint functionality mentioned in the long description (line 50) is not implemented. Line 119 explicitly states this is omitted. Either remove these unused definitions or implement the checkpoint functionality as described in the command's help text and issue requirements (#18 mentions support for resuming if interrupted).

Copilot · 2026-02-02T12:28:20Z

+func (c *Client) ListIssues(ctx context.Context, org, repo string, opts *github.IssueListByRepoOptions) ([]*github.Issue, *github.Response, error) {
+	if opts == nil {
+		opts = &github.IssueListByRepoOptions{
+			State: "all",
+		}
+	}
+	return c.client.Issues.ListByRepo(ctx, org, repo, opts)
+}
+
+// ListComments fetches comments for a specific issue.
+func (c *Client) ListComments(ctx context.Context, org, repo string, number int, opts *github.IssueListCommentsOptions) ([]*github.IssueComment, *github.Response, error) {
+	return c.client.Issues.ListComments(ctx, org, repo, number, opts)
+}


The ListIssues and ListComments methods don't follow the error handling pattern established by other methods in this file. Compare with GetIssue (line 22-28), CreateComment (line 32-44), and AddLabels (line 48-57), which all wrap errors with descriptive messages using fmt.Errorf. These new methods simply return the raw error from the underlying GitHub client, which makes debugging harder and is inconsistent with the established pattern in this codebase.

Copilot · 2026-02-02T12:28:20Z

+				finalChunks = append(finalChunks, currentChunk)
+				// Apply overlap logic here if needed, for simplicity we just start new chunk
+				// Ideally, we should keep the last N chars equal to overlap
+				// Improving simple implementation to include overlap
+				// Start new chunk with overlap from previous
+				// Note: complex overlap logic with separators is tricky,
+				// for this implementation we'll keep it simple: just reset.
+				// To do it properly requires re-adding words.
+				// Let's stick to a simpler "accumulate until full" approach for now.
+				currentChunk = ""


The overlap implementation is incomplete. The code comments acknowledge this issue on lines 68-75, stating that overlap logic is not properly implemented. According to the PR description and issue #18, chunking with overlap is a requirement to ensure context is preserved across chunks. The current implementation resets to an empty chunk without carrying over any overlap from the previous chunk, which defeats the purpose of the ChunkOverlap configuration parameter.

Suggested change

finalChunks = append(finalChunks, currentChunk)

// Apply overlap logic here if needed, for simplicity we just start new chunk

// Ideally, we should keep the last N chars equal to overlap

// Improving simple implementation to include overlap

// Start new chunk with overlap from previous

// Note: complex overlap logic with separators is tricky,

// for this implementation we'll keep it simple: just reset.

// To do it properly requires re-adding words.

// Let's stick to a simpler "accumulate until full" approach for now.

currentChunk = ""

prevChunk := currentChunk

finalChunks = append(finalChunks, prevChunk)

// Start new chunk with overlap from the end of the previous chunk, if configured.

overlap := s.config.ChunkOverlap

if overlap > 0 {

runes := []rune(prevChunk)

if overlap >= len(runes) {

// If overlap is larger than the chunk, reuse the whole chunk.

currentChunk = prevChunk

} else {

currentChunk = string(runes[len(runes)-overlap:])

}

} else {

currentChunk = ""

}

Copilot · 2026-02-02T12:28:20Z

+	if len(comments) > 0 {
+		sb.WriteString("Comments:\n")
+		for _, c := range comments {
+			sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))


Potential nil pointer dereference. When c.User is nil (which can happen for comments from deleted users or in certain GitHub API scenarios), calling c.User.GetLogin() will panic. You should add a nil check or use a default value like "unknown" or "deleted-user" when the user is nil.

Suggested change

sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))

author := "deleted-user"

if c.User != nil {

author = c.User.GetLogin()

}

sb.WriteString(fmt.Sprintf("- %s: %s\n", author, c.GetBody()))

Copilot · 2026-02-02T12:28:23Z

+	// 1. Fetch Comments
+	comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{
+		ListOptions: github.ListOptions{PerPage: 100},
+	})
+	if err != nil {
+		log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)
+		return
+	}
+
+	// 2. Aggregate Text
+	var sb strings.Builder
+	sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))
+	sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
+	if len(comments) > 0 {
+		sb.WriteString("Comments:\n")
+		for _, c := range comments {


The comment fetching doesn't handle pagination. The PerPage is set to 100, but issues with more than 100 comments will only have their first 100 comments indexed. Issue #18 requires proper pagination support. You should implement pagination for comments similar to how it's done for issues (lines 163-188), checking resp.NextPage and looping until all comments are fetched.

Suggested change

// 1. Fetch Comments

comments, _, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{

ListOptions: github.ListOptions{PerPage: 100},

})

if err != nil {

log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)

return

}

// 2. Aggregate Text

var sb strings.Builder

sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))

sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))

if len(comments) > 0 {

sb.WriteString("Comments:\n")

for _, c := range comments {

// 1. Fetch Comments (with pagination)

var allComments []*github.IssueComment

page := 1

for {

comments, resp, err := gh.ListComments(ctx, org, repo, issue.GetNumber(), &github.IssueListCommentsOptions{

ListOptions: github.ListOptions{PerPage: 100, Page: page},

})

if err != nil {

log.Printf("[Worker %d] Error fetching comments for #%d: %v", workerID, issue.GetNumber(), err)

return

}

allComments = append(allComments, comments...)

if resp == nil || resp.NextPage == 0 {

break

}

page = resp.NextPage

}

// 2. Aggregate Text

var sb strings.Builder

sb.WriteString(fmt.Sprintf("Title: %s\n\n", issue.GetTitle()))

sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))

if len(allComments) > 0 {

sb.WriteString("Comments:\n")

for _, c := range allComments {

Copilot · 2026-02-02T12:28:23Z

@@ -0,0 +1,109 @@
+// Author: Kaviru Hapuarachchi
+// GitHub: https://github.com/Kavirubc


The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.

Suggested change

// GitHub: https://github.com/Kavirubc

// GitHub: https://github.com/kavirubc

Copilot · 2026-02-02T12:28:23Z

@@ -0,0 +1,47 @@
+// Author: Kaviru Hapuarachchi
+// GitHub: https://github.com/Kavirubc


The GitHub username in the header comment is inconsistent with other files. This file uses "Kavirubc" (capital K, lowercase b,c) while other files in the codebase use "kavirubc" (all lowercase). For example, see cmd/simili/commands/root.go:2 and cmd/simili/commands/process.go:2 which use "kavirubc". This inconsistency should be fixed for uniformity.

Suggested change

// GitHub: https://github.com/Kavirubc

// GitHub: https://github.com/kavirubc

Copilot · 2026-02-02T12:28:24Z

+	sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
+	if len(comments) > 0 {
+		sb.WriteString("Comments:\n")
+		for _, c := range comments {
+			sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
+		}


The text aggregation could produce misleading results when issue body or comment body contains nil/empty values. When issue.GetBody() returns an empty string, line 208 will still add "Body: \n\n" to the aggregated text. Similarly, if a comment has an empty body, it will add "- username: \n" (line 212). This could inflate the chunk count unnecessarily and potentially confuse the embedding model with empty labeled sections. Consider checking for empty strings and omitting these sections when they're empty.

Suggested change

sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))

if len(comments) > 0 {

sb.WriteString("Comments:\n")

for _, c := range comments {

sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))

}

issueBody := strings.TrimSpace(issue.GetBody())

if issueBody != "" {

sb.WriteString(fmt.Sprintf("Body: %s\n\n", issueBody))

}

hasCommentContent := false

for _, c := range comments {

commentBody := strings.TrimSpace(c.GetBody())

if commentBody == "" {

continue

}

if !hasCommentContent {

sb.WriteString("Comments:\n")

hasCommentContent = true

}

sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), commentBody))

Copilot · 2026-02-02T12:28:24Z

+	}
+
+	// 5. Upsert
+	if indexDryRun {


The indexDryRun flag is referenced using a global variable that may not be accessible in this context. The processIssue function uses the global indexDryRun variable (line 229) without receiving it as a parameter. While this works in Go, it creates tight coupling and makes the function harder to test. Consider passing dryRun as a parameter to processIssue for better testability and clearer function contracts.

- Check error return from MarkFlagRequired - Remove unused checkpointFile constant - Remove unused len() method from RecursiveCharacterSplitter - Remove unused unicode/utf8 import Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

**Error Handling:** - Add descriptive error wrapping to ListIssues and ListComments methods **Robustness:** - Implement pagination for comment fetching (handles >100 comments) - Add nil check for deleted users in comment processing - Handle empty issue bodies and comments gracefully - Pass dryRun as parameter instead of using global variable **Text Chunking:** - Implement proper overlap logic in RecursiveCharacterSplitter - Preserve context across chunks using ChunkOverlap configuration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…rent verification status.

Kavirubc added 4 commits February 2, 2026 17:47

feat(github): add ListIssues and ListComments methods

e3c12f6

feat(utils): add text chunker for embedding optimization

e4c2ca4

feat(cli): add 'index' command for bulk issue backfilling

ba366e3

chore: ignore local binaries and configs

5a0d995

Kavirubc requested a review from Copilot February 2, 2026 12:21

Copilot started reviewing on behalf of Kavirubc February 2, 2026 12:21 View session

feat: Introduce .simili.yaml configuration file and adjust `.gitign…

6afd239

…ore` to track it and `.simili-index-checkpoint.json`.

Copilot AI reviewed Feb 2, 2026

View reviewed changes

Kavirubc and others added 3 commits February 2, 2026 20:45

docs: Update session log with details on Copilot review fixes and cur…

6bc0afc

…rent verification status.

Kavirubc merged commit 6cfe224 into main Feb 2, 2026
3 checks passed

Kavirubc deleted the feature/bulk-index-0.1.0v branch February 3, 2026 08:41

Kavirubc linked an issue Feb 3, 2026 that may be closed by this pull request

[v0.1.0] CLI: Implement 'index' command for bulk backfill #18

Closed

gh-simili-bot mentioned this pull request Feb 13, 2026

[Performance]: Parallelize Batch Embedding Generation #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement 'index' command for bulk issue backfilling#19

feat: Implement 'index' command for bulk issue backfilling#19
Kavirubc merged 8 commits intomainfrom
feature/bulk-index-0.1.0v

Kavirubc commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-				finalChunks = append(finalChunks, currentChunk)
-				// Apply overlap logic here if needed, for simplicity we just start new chunk
-				// Ideally, we should keep the last N chars equal to overlap
-				// Improving simple implementation to include overlap
-				// Start new chunk with overlap from previous
-				// Note: complex overlap logic with separators is tricky,
-				// for this implementation we'll keep it simple: just reset.
-				// To do it properly requires re-adding words.
-				// Let's stick to a simpler "accumulate until full" approach for now.
-				currentChunk = ""
+				prevChunk := currentChunk
+				finalChunks = append(finalChunks, prevChunk)
+				// Start new chunk with overlap from the end of the previous chunk, if configured.
+				overlap := s.config.ChunkOverlap
+				if overlap > 0 {
+					runes := []rune(prevChunk)
+					if overlap >= len(runes) {
+						// If overlap is larger than the chunk, reuse the whole chunk.
+						currentChunk = prevChunk
+					} else {
+						currentChunk = string(runes[len(runes)-overlap:])
+					}
+				} else {
+					currentChunk = ""
+				}

-			sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
+			author := "deleted-user"
+			if c.User != nil {
+				author = c.User.GetLogin()
+			}
+			sb.WriteString(fmt.Sprintf("- %s: %s\n", author, c.GetBody()))

		@@ -0,0 +1,109 @@
		// Author: Kaviru Hapuarachchi
		// GitHub: https://github.com/Kavirubc

		@@ -0,0 +1,47 @@
		// Author: Kaviru Hapuarachchi
		// GitHub: https://github.com/Kavirubc

-	sb.WriteString(fmt.Sprintf("Body: %s\n\n", issue.GetBody()))
-	if len(comments) > 0 {
-		sb.WriteString("Comments:\n")
-		for _, c := range comments {
-			sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), c.GetBody()))
-		}
+	issueBody := strings.TrimSpace(issue.GetBody())
+	if issueBody != "" {
+		sb.WriteString(fmt.Sprintf("Body: %s\n\n", issueBody))
+	}
+	hasCommentContent := false
+	for _, c := range comments {
+		commentBody := strings.TrimSpace(c.GetBody())
+		if commentBody == "" {
+			continue
+		}
+		if !hasCommentContent {
+			sb.WriteString("Comments:\n")
+			hasCommentContent = true
+		}
+		sb.WriteString(fmt.Sprintf("- %s: %s\n", c.User.GetLogin(), commentBody))

Conversation

Kavirubc commented Feb 2, 2026

Description

Changes

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants