Skip to content

feat: CLI PR indexing (#46) & standardized embedding content (#47)#48

Merged
Kavirubc merged 6 commits intomainfrom
feature/cli-pr-index
Feb 13, 2026
Merged

feat: CLI PR indexing (#46) & standardized embedding content (#47)#48
Kavirubc merged 6 commits intomainfrom
feature/cli-pr-index

Conversation

@Kavirubc
Copy link
Copy Markdown
Contributor

@Kavirubc Kavirubc commented Feb 13, 2026

Summary

Changes

File Action
internal/utils/text/content.go New — shared BuildEmbeddingContent + Comment type
internal/utils/text/content_test.go New — 6 unit tests, all passing
cmd/simili/commands/index.go Modified — PR support, --include-prs flag, shared builder, dynamic type
internal/steps/indexer.go Modified — GitHub client injection, comment fetching, shared builder, type payload

Test plan

  • go test ./internal/utils/text/... -run TestBuildEmbeddingContent -v — all 6 cases pass
  • go test ./... — full suite green, zero regressions
  • go build ./... — clean build

Notes

Previously indexed items will have stale embeddings (no comments). Re-run simili index --repo <owner/repo> to backfill with richer embeddings.

To exclude PRs (old behaviour): simili index --repo <owner/repo> --include-prs=false

Closes #46
Closes #47

Summary by CodeRabbit

  • New Features
    • Added configuration option to include or exclude pull requests from indexing
    • Enhanced indexing to incorporate GitHub issue comments for richer embedding content
    • Index items now record whether they are issues or pull requests, improving search relevance
    • More stable, deterministic chunking for consistent indexing and retrieval behavior

Add BuildEmbeddingContent to the text package to consolidate
embedding content construction (title + body + comments) into
a single reusable function with full unit test coverage.
- Remove hard-coded PR skip; replace with --include-prs flag (default true)
- Use shared text.BuildEmbeddingContent instead of manual string building
- Set payload "type" dynamically to "issue" or "pull_request"
- Drop unused fmt import
- Inject GitHub client into Indexer for comment fetching at index time
- Use shared text.BuildEmbeddingContent for richer embeddings
- Add "type" field (EventType) to Qdrant payload
- Use map[string]any in place of map[string]interface{}
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Adds optional PR indexing and unified embedding content: a CLI flag to include/exclude pull requests, indexer-side GitHub comment fetching, a shared BuildEmbeddingContent helper, deterministic chunk IDs, and payload type metadata; updates payload typing to map[string]any.

Changes

Cohort / File(s) Summary
CLI Index Command
cmd/simili/commands/index.go
Adds indexIncludePRs flag to control PR inclusion; replaces hard-coded PR skip with conditional; computes itemType and includes type in chunk payloads; uses BuildEmbeddingContent for text aggregation; preserves dry-run and error behavior.
Indexer Step
internal/steps/indexer.go
Adds github client field and wires GitHub comments fetch (paginated); classifies items as "issue" or "pull_request"; uses BuildEmbeddingContent; sets deterministic chunk IDs (MD5 over org/repo/issue/chunk); changes payloads to map[string]any and includes type.
Text Utility
internal/utils/text/content.go
New public Comment type and BuildEmbeddingContent(title, body, comments) to assemble embedding-ready strings with title, optional body, and formatted comments (skips empty comment bodies).
Text Utility Tests
internal/utils/text/content_test.go
New tests covering title/body, comments inclusion, whitespace handling, skipping empty comments, and nil comment slices.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Index Command
    participant Indexer as Indexer Step
    participant GitHub as GitHub Client
    participant TextUtil as Text Utility
    participant Embedder as Gemini Embedder
    participant Store as Vector Store

    CLI->>Indexer: Start processing item (respect indexIncludePRs)
    Indexer->>GitHub: Fetch issue/PR comments (paginated)
    GitHub-->>Indexer: Return comments or error (warn & continue)
    Indexer->>TextUtil: BuildEmbeddingContent(title, body, comments)
    TextUtil-->>Indexer: Formatted embedding content
    Indexer->>Embedder: Request embedding(vector) for content
    Embedder-->>Indexer: Return embedding vector
    Indexer->>Store: Upsert vectors with payload (org, repo, issue_number, text, url, type, chunk_id)
    Store-->>Indexer: Acknowledge
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I stitched the threads of issue and PR,
Comments gathered near and far,
A flag, a hash, content so bright—
Embeddings hop into the night! 🥕

🚥 Pre-merge checks | ✅ 5 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects both main objectives: CLI PR indexing support (#46) and standardized embedding content (#47).
Linked Issues check ✅ Passed All coding requirements from #46 and #47 are met: PR indexing with --include-prs flag, shared BuildEmbeddingContent utility, type field in payloads, and GitHub comment integration.
Out of Scope Changes check ✅ Passed All changes are directly aligned with #46 (PR indexing with flag) and #47 (shared embedding content with comments); no out-of-scope modifications detected.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/cli-pr-index

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@cmd/simili/commands/index.go`:
- Around line 246-264: The code creates non-deterministic point IDs using
uuid.New() in the loop that builds points (points[i] = &qdrant.Point{ ID:
uuid.New().String(), ... }), which breaks idempotent re-indexing; change this to
a deterministic UUID (like the pattern used in internal/steps/indexer.go with
uuid.NewMD5) by generating the UUID from stable inputs (org, repo,
issue.GetNumber(), and the chunk index i — include chunk index so each chunk
gets a unique but repeatable ID) and assign that string to ID; add any needed
imports (e.g., fmt or bytes) to build the name/seed for uuid.NewMD5.

In `@internal/steps/indexer.go`:
- Line 112: The code in internal/steps/indexer.go currently uses
ctx.Issue.EventType for the "type" field which yields webhook event names (e.g.,
"issues", "issue_comment") instead of the required item types "issue" or
"pull_request"; change this by detecting whether the Issue is a PR and emitting
"pull_request" or otherwise "issue" — either (A) add an IsPullRequest bool to
pipeline.Issue and populate it during webhook parsing, then use
ctx.Issue.IsPullRequest to set the "type" value, or (B) add a helper like
determineItemType(issue *pipeline.Issue) string that inspects EventType or other
webhook fields to return "pull_request" or "issue", and replace uses of
ctx.Issue.EventType with that helper in the indexer (referencing ctx.Issue,
pipeline.Issue, and the indexer function in internal/steps/indexer.go).
- Around line 65-84: The comments fetching in the indexer (where
s.github.ListComments is used) only requests the first page; change it to
iterate pages using github.ListOptions (e.g., loop with opt :=
&github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and
append each page’s comment bodies to textComments, preserving the existing
trimming/author logic; also add the import
"github.com/google/go-github/v60/github" for the options type.
🧹 Nitpick comments (3)
internal/utils/text/content.go (1)

26-34: Inconsistent whitespace handling between body and comment bodies.

The main body is trimmed before checking emptiness (line 26), but comment bodies are checked without trimming (line 32). This means whitespace-only comment bodies will be included, unlike whitespace-only main bodies.

While callers currently trim comment bodies before passing them, the function behavior is inconsistent with its documented behavior ("Comments with empty bodies are skipped").

♻️ Proposed fix for consistency
 	hasHeader := false
 	for _, c := range comments {
-		if c.Body == "" {
+		if strings.TrimSpace(c.Body) == "" {
 			continue
 		}
internal/utils/text/content_test.go (1)

54-63: Consider adding a test for whitespace-only comment bodies.

The test suite covers whitespace-only main body (lines 46-53), but there's no test for whitespace-only comment bodies. Adding such a test would clarify the expected behavior and catch any future regressions if the trim logic is added to the function.

🧪 Suggested additional test case
{
    name:  "whitespace-only comment body skipped",
    title: "My Issue",
    body:  "Body",
    comments: []Comment{
        {Author: "alice", Body: "   \n  "},
        {Author: "bob", Body: "Valid comment"},
    },
    want: "Title: My Issue\n\nBody: Body\n\nComments:\n- bob: Valid comment\n",
},
cmd/simili/commands/index.go (1)

229-244: Consider handling chunk count changes on re-indexing.

When re-indexing with richer content (e.g., more comments), the number of chunks may change. If an issue previously had more chunks than the current run produces, stale chunk points will remain in Qdrant.

Consider adding a deletion step before upserting, or using a filter-based delete for the issue number before inserting new chunks.

Comment thread cmd/simili/commands/index.go
Comment thread internal/steps/indexer.go Outdated
Comment on lines +65 to +84
// Fetch comments for richer embedding content.
var textComments []text.Comment
if s.github != nil {
ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
if err != nil {
log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
} else {
for _, c := range ghComments {
body := strings.TrimSpace(c.GetBody())
if body == "" {
continue
}
author := "deleted-user"
if c.User != nil {
author = c.User.GetLogin()
}
textComments = append(textComments, text.Comment{Author: author, Body: body})
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing pagination when fetching comments.

Unlike cmd/simili/commands/index.go (lines 199-213) which paginates through all comments, this implementation only fetches the first page by passing nil options. Issues with many comments will have truncated embedding content.

🐛 Proposed fix to add pagination
 	// Fetch comments for richer embedding content.
 	var textComments []text.Comment
 	if s.github != nil {
-		ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
-		if err != nil {
-			log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
-		} else {
-			for _, c := range ghComments {
-				body := strings.TrimSpace(c.GetBody())
-				if body == "" {
-					continue
+		page := 1
+		for {
+			opts := &github.IssueListCommentsOptions{
+				ListOptions: github.ListOptions{PerPage: 100, Page: page},
+			}
+			ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts)
+			if err != nil {
+				log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
+				break
+			}
+			for _, c := range ghComments {
+				body := strings.TrimSpace(c.GetBody())
+				if body == "" {
+					continue
+				}
+				author := "deleted-user"
+				if c.User != nil {
+					author = c.User.GetLogin()
 				}
-				author := "deleted-user"
-				if c.User != nil {
-					author = c.User.GetLogin()
-				}
-				textComments = append(textComments, text.Comment{Author: author, Body: body})
+				textComments = append(textComments, text.Comment{Author: author, Body: body})
+			}
+			if resp == nil || resp.NextPage == 0 {
+				break
 			}
+			page = resp.NextPage
 		}
 	}

Note: This requires importing "github.com/google/go-github/v60/github" for the options type.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Fetch comments for richer embedding content.
var textComments []text.Comment
if s.github != nil {
ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
if err != nil {
log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
} else {
for _, c := range ghComments {
body := strings.TrimSpace(c.GetBody())
if body == "" {
continue
}
author := "deleted-user"
if c.User != nil {
author = c.User.GetLogin()
}
textComments = append(textComments, text.Comment{Author: author, Body: body})
}
}
}
// Fetch comments for richer embedding content.
var textComments []text.Comment
if s.github != nil {
page := 1
for {
opts := &github.IssueListCommentsOptions{
ListOptions: github.ListOptions{PerPage: 100, Page: page},
}
ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts)
if err != nil {
log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
break
}
for _, c := range ghComments {
body := strings.TrimSpace(c.GetBody())
if body == "" {
continue
}
author := "deleted-user"
if c.User != nil {
author = c.User.GetLogin()
}
textComments = append(textComments, text.Comment{Author: author, Body: body})
}
if resp == nil || resp.NextPage == 0 {
break
}
page = resp.NextPage
}
}
🤖 Prompt for AI Agents
In `@internal/steps/indexer.go` around lines 65 - 84, The comments fetching in the
indexer (where s.github.ListComments is used) only requests the first page;
change it to iterate pages using github.ListOptions (e.g., loop with opt :=
&github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and
append each page’s comment bodies to textComments, preserving the existing
trimming/author logic; also add the import
"github.com/google/go-github/v60/github" for the options type.

Comment thread internal/steps/indexer.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the CLI index command to index pull requests (optionally disabled via a flag) and standardizes embedding text construction across the CLI backfill path and the live bot indexer by introducing a shared BuildEmbeddingContent utility. It also adds a type field to Qdrant payloads to distinguish issues vs pull requests.

Changes:

  • Add internal/utils/text.BuildEmbeddingContent (and tests) to generate consistent embedding text as Title + Body + Comments.
  • Update simili index to include PRs by default, add --include-prs flag, and use the shared embedding builder.
  • Update the live bot Indexer to fetch comments at index time and store a type field in Qdrant payloads.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
internal/utils/text/content.go New shared embedding-content builder + Comment type.
internal/utils/text/content_test.go New unit tests for embedding-content formatting and skipping rules.
cmd/simili/commands/index.go CLI now optionally includes PRs and uses shared embedding builder; adds payload type.
internal/steps/indexer.go Bot indexer now fetches comments and uses shared builder; adds payload type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/utils/text/content.go Outdated
Comment on lines +32 to +39
if c.Body == "" {
continue
}
if !hasHeader {
sb.WriteString("Comments:\n")
hasHeader = true
}
fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildEmbeddingContent trims the issue/PR body but does not trim comment bodies; as written, a whitespace-only comment body (e.g. " \n") will be included in the output even though it’s effectively empty. Consider applying strings.TrimSpace to c.Body inside the builder and using the trimmed value for both the empty check and the rendered output to make the utility robust regardless of caller behavior.

Suggested change
if c.Body == "" {
continue
}
if !hasHeader {
sb.WriteString("Comments:\n")
hasHeader = true
}
fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body)
cb := strings.TrimSpace(c.Body)
if cb == "" {
continue
}
if !hasHeader {
sb.WriteString("Comments:\n")
hasHeader = true
}
fmt.Fprintf(&sb, "- %s: %s\n", c.Author, cb)

Copilot uses AI. Check for mistakes.
Comment thread internal/steps/indexer.go Outdated
Comment on lines +65 to +83
// Fetch comments for richer embedding content.
var textComments []text.Comment
if s.github != nil {
ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
if err != nil {
log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
} else {
for _, c := range ghComments {
body := strings.TrimSpace(c.GetBody())
if body == "" {
continue
}
author := "deleted-user"
if c.User != nil {
author = c.User.GetLogin()
}
textComments = append(textComments, text.Comment{Author: author, Body: body})
}
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListComments is called with opts=nil, so only the first page of comments will be fetched (GitHub API default page size) and additional pages are silently ignored. This makes the bot’s embeddings incomplete and inconsistent with the CLI backfill (which paginates). Consider paginating through all comment pages (or explicitly setting PerPage and looping on resp.NextPage).

Copilot uses AI. Check for mistakes.
Comment thread internal/steps/indexer.go
Comment on lines 103 to 113
@@ -83,6 +109,7 @@ func (s *Indexer) Run(ctx *pipeline.Context) error {
"state": ctx.Issue.State,
"author": ctx.Issue.Author,
"labels": ctx.Issue.Labels,
"type": ctx.Issue.EventType,
},
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the Qdrant payload type should be normalized to "issue" / "pull_request", but here it’s set to ctx.Issue.EventType (values like "issues", "issue_comment", "pr_comment", etc.). Consider mapping event types to the canonical stored types (e.g., PR + PR comments => pull_request, everything else => issue) so downstream consumers can reliably filter on type.

Copilot uses AI. Check for mistakes.
Apply strings.TrimSpace inside the builder itself so callers don't need
to pre-trim, and whitespace-only bodies are always skipped consistently.
- Paginate ListComments (100/page) to match CLI behaviour; no more
  silent truncation on issues with >30 comments
- Map EventType to canonical "issue" / "pull_request" so downstream
  consumers can reliably filter; previous use of EventType raw values
  ("issues", "issue_comment", "pr_comment") was inconsistent with CLI
Replace uuid.New() (random) with uuid.NewMD5 keyed on
"org/repo#number-chunk-N" so re-running the index command upserts
existing points rather than creating duplicates.
@Kavirubc Kavirubc merged commit f810f3a into main Feb 13, 2026
3 of 4 checks passed
@Kavirubc Kavirubc deleted the feature/cli-pr-index branch February 13, 2026 09:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@cmd/simili/commands/index.go`:
- Around line 28-33: The comment for indexSince promises issue-number support
that runIndex doesn't implement; update the indexSince declaration comment and
the CLI flag help text (where indexSince is registered) to state it only accepts
RFC3339 timestamps, and remove any mention of numeric/issue-number semantics;
alternatively, if you prefer to implement the feature, modify runIndex to detect
numeric indexSince (using strconv.Atoi/ParseInt) and treat it as an issue number
filter (compare against issue.Number when iterating), but do not claim both
behaviors—pick one and make indexSince's comment/help and runIndex behavior
consistent (refer to the indexSince variable and the runIndex function).

Comment on lines +28 to +33
indexRepo string
indexSince string // Can be a timestamp (ISO8601) or issue number (int)
indexWorkers int
indexToken string
indexDryRun bool
indexIncludePRs bool
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

indexSince doc mentions issue-number support, but it isn’t implemented.

runIndex only parses RFC3339 timestamps; numeric values are ignored. Either implement numeric filtering or adjust the comment/flag help to avoid promising behavior that doesn’t exist.

🤖 Prompt for AI Agents
In `@cmd/simili/commands/index.go` around lines 28 - 33, The comment for
indexSince promises issue-number support that runIndex doesn't implement; update
the indexSince declaration comment and the CLI flag help text (where indexSince
is registered) to state it only accepts RFC3339 timestamps, and remove any
mention of numeric/issue-number semantics; alternatively, if you prefer to
implement the feature, modify runIndex to detect numeric indexSince (using
strconv.Atoi/ParseInt) and treat it as an issue number filter (compare against
issue.Number when iterating), but do not claim both behaviors—pick one and make
indexSince's comment/help and runIndex behavior consistent (refer to the
indexSince variable and the runIndex function).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Standardize indexing content to include comments for both Issues and PRs [Feature]: Support indexing Pull Requests in CLI index command

2 participants