feat: CLI PR indexing (#46) & standardized embedding content (#47) by Kavirubc · Pull Request #48 · similigh/simili-bot

Kavirubc · 2026-02-13T09:18:46Z

Summary

[Feature]: Support indexing Pull Requests in CLI index command #46 — CLI PR indexing: The index command now indexes pull requests by default. A new --include-prs flag (default true) lets you opt out of PR indexing to restore the previous behaviour.
[Feature]: Standardize indexing content to include comments for both Issues and PRs #47 — Standardized embedding content: Both the CLI backfill pipeline and the live bot indexer now embed Title + Body + Comments using a shared BuildEmbeddingContent utility. The bot fetches comments via the GitHub client at index time.
A "type" field ("issue" / "pull_request") is now stored in every Qdrant payload point.

Changes

File	Action
`internal/utils/text/content.go`	New — shared `BuildEmbeddingContent` + `Comment` type
`internal/utils/text/content_test.go`	New — 6 unit tests, all passing
`cmd/simili/commands/index.go`	Modified — PR support, `--include-prs` flag, shared builder, dynamic `type`
`internal/steps/indexer.go`	Modified — GitHub client injection, comment fetching, shared builder, `type` payload

Test plan

go test ./internal/utils/text/... -run TestBuildEmbeddingContent -v — all 6 cases pass
go test ./... — full suite green, zero regressions
go build ./... — clean build

Notes

Previously indexed items will have stale embeddings (no comments). Re-run simili index --repo <owner/repo> to backfill with richer embeddings.

To exclude PRs (old behaviour): simili index --repo <owner/repo> --include-prs=false

Closes #46
Closes #47

Summary by CodeRabbit

New Features
- Added configuration option to include or exclude pull requests from indexing
- Enhanced indexing to incorporate GitHub issue comments for richer embedding content
- Index items now record whether they are issues or pull requests, improving search relevance
- More stable, deterministic chunking for consistent indexing and retrieval behavior

Add BuildEmbeddingContent to the text package to consolidate embedding content construction (title + body + comments) into a single reusable function with full unit test coverage.

- Remove hard-coded PR skip; replace with --include-prs flag (default true) - Use shared text.BuildEmbeddingContent instead of manual string building - Set payload "type" dynamically to "issue" or "pull_request" - Drop unused fmt import

- Inject GitHub client into Indexer for comment fetching at index time - Use shared text.BuildEmbeddingContent for richer embeddings - Add "type" field (EventType) to Qdrant payload - Use map[string]any in place of map[string]interface{}

coderabbitai · 2026-02-13T09:19:08Z

📝 Walkthrough

Walkthrough

Adds optional PR indexing and unified embedding content: a CLI flag to include/exclude pull requests, indexer-side GitHub comment fetching, a shared BuildEmbeddingContent helper, deterministic chunk IDs, and payload type metadata; updates payload typing to map[string]any.

Changes

Cohort / File(s)	Summary
CLI Index Command `cmd/simili/commands/index.go`	Adds `indexIncludePRs` flag to control PR inclusion; replaces hard-coded PR skip with conditional; computes `itemType` and includes `type` in chunk payloads; uses `BuildEmbeddingContent` for text aggregation; preserves dry-run and error behavior.
Indexer Step `internal/steps/indexer.go`	Adds `github` client field and wires GitHub comments fetch (paginated); classifies items as `"issue"` or `"pull_request"`; uses `BuildEmbeddingContent`; sets deterministic chunk IDs (MD5 over org/repo/issue/chunk); changes payloads to `map[string]any` and includes `type`.
Text Utility `internal/utils/text/content.go`	New public `Comment` type and `BuildEmbeddingContent(title, body, comments)` to assemble embedding-ready strings with title, optional body, and formatted comments (skips empty comment bodies).
Text Utility Tests `internal/utils/text/content_test.go`	New tests covering title/body, comments inclusion, whitespace handling, skipping empty comments, and nil comment slices.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Index Command
    participant Indexer as Indexer Step
    participant GitHub as GitHub Client
    participant TextUtil as Text Utility
    participant Embedder as Gemini Embedder
    participant Store as Vector Store

    CLI->>Indexer: Start processing item (respect indexIncludePRs)
    Indexer->>GitHub: Fetch issue/PR comments (paginated)
    GitHub-->>Indexer: Return comments or error (warn & continue)
    Indexer->>TextUtil: BuildEmbeddingContent(title, body, comments)
    TextUtil-->>Indexer: Formatted embedding content
    Indexer->>Embedder: Request embedding(vector) for content
    Embedder-->>Indexer: Return embedding vector
    Indexer->>Store: Upsert vectors with payload (org, repo, issue_number, text, url, type, chunk_id)
    Store-->>Indexer: Acknowledge

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I stitched the threads of issue and PR,
Comments gathered near and far,
A flag, a hash, content so bright—
Embeddings hop into the night! 🥕

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects both main objectives: CLI PR indexing support (`#46`) and standardized embedding content (`#47`).
Linked Issues check	✅ Passed	All coding requirements from `#46` and `#47` are met: PR indexing with --include-prs flag, shared BuildEmbeddingContent utility, type field in payloads, and GitHub comment integration.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with `#46` (PR indexing with flag) and `#47` (shared embedding content with comments); no out-of-scope modifications detected.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/cli-pr-index

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@cmd/simili/commands/index.go`:
- Around line 246-264: The code creates non-deterministic point IDs using
uuid.New() in the loop that builds points (points[i] = &qdrant.Point{ ID:
uuid.New().String(), ... }), which breaks idempotent re-indexing; change this to
a deterministic UUID (like the pattern used in internal/steps/indexer.go with
uuid.NewMD5) by generating the UUID from stable inputs (org, repo,
issue.GetNumber(), and the chunk index i — include chunk index so each chunk
gets a unique but repeatable ID) and assign that string to ID; add any needed
imports (e.g., fmt or bytes) to build the name/seed for uuid.NewMD5.

In `@internal/steps/indexer.go`:
- Line 112: The code in internal/steps/indexer.go currently uses
ctx.Issue.EventType for the "type" field which yields webhook event names (e.g.,
"issues", "issue_comment") instead of the required item types "issue" or
"pull_request"; change this by detecting whether the Issue is a PR and emitting
"pull_request" or otherwise "issue" — either (A) add an IsPullRequest bool to
pipeline.Issue and populate it during webhook parsing, then use
ctx.Issue.IsPullRequest to set the "type" value, or (B) add a helper like
determineItemType(issue *pipeline.Issue) string that inspects EventType or other
webhook fields to return "pull_request" or "issue", and replace uses of
ctx.Issue.EventType with that helper in the indexer (referencing ctx.Issue,
pipeline.Issue, and the indexer function in internal/steps/indexer.go).
- Around line 65-84: The comments fetching in the indexer (where
s.github.ListComments is used) only requests the first page; change it to
iterate pages using github.ListOptions (e.g., loop with opt :=
&github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and
append each page’s comment bodies to textComments, preserving the existing
trimming/author logic; also add the import
"github.com/google/go-github/v60/github" for the options type.

🧹 Nitpick comments (3)

internal/utils/text/content.go (1)
26-34: Inconsistent whitespace handling between body and comment bodies.

The main body is trimmed before checking emptiness (line 26), but comment bodies are checked without trimming (line 32). This means whitespace-only comment bodies will be included, unlike whitespace-only main bodies.

While callers currently trim comment bodies before passing them, the function behavior is inconsistent with its documented behavior ("Comments with empty bodies are skipped").
♻️ Proposed fix for consistency
 	hasHeader := false
 	for _, c := range comments {
-		if c.Body == "" {
+		if strings.TrimSpace(c.Body) == "" {
 			continue
 		}
internal/utils/text/content_test.go (1)
54-63: Consider adding a test for whitespace-only comment bodies.

The test suite covers whitespace-only main body (lines 46-53), but there's no test for whitespace-only comment bodies. Adding such a test would clarify the expected behavior and catch any future regressions if the trim logic is added to the function.
🧪 Suggested additional test case
{
    name:  "whitespace-only comment body skipped",
    title: "My Issue",
    body:  "Body",
    comments: []Comment{
        {Author: "alice", Body: "   \n  "},
        {Author: "bob", Body: "Valid comment"},
    },
    want: "Title: My Issue\n\nBody: Body\n\nComments:\n- bob: Valid comment\n",
},
cmd/simili/commands/index.go (1)

229-244: Consider handling chunk count changes on re-indexing.

When re-indexing with richer content (e.g., more comments), the number of chunks may change. If an issue previously had more chunks than the current run produces, stale chunk points will remain in Qdrant.

Consider adding a deletion step before upserting, or using a filter-based delete for the issue number before inserting new chunks.

coderabbitai · 2026-02-13T09:22:34Z

+	// Fetch comments for richer embedding content.
+	var textComments []text.Comment
+	if s.github != nil {
+		ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
+		if err != nil {
+			log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
+		} else {
+			for _, c := range ghComments {
+				body := strings.TrimSpace(c.GetBody())
+				if body == "" {
+					continue
+				}
+				author := "deleted-user"
+				if c.User != nil {
+					author = c.User.GetLogin()
+				}
+				textComments = append(textComments, text.Comment{Author: author, Body: body})
+			}
+		}
+	}


⚠️ Potential issue | 🟠 Major

Missing pagination when fetching comments.

Unlike cmd/simili/commands/index.go (lines 199-213) which paginates through all comments, this implementation only fetches the first page by passing nil options. Issues with many comments will have truncated embedding content.

🐛 Proposed fix to add pagination

// Fetch comments for richer embedding content. var textComments []text.Comment if s.github != nil { - ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil) - if err != nil { - log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) - } else { - for _, c := range ghComments { - body := strings.TrimSpace(c.GetBody()) - if body == "" { - continue + page := 1 + for { + opts := &github.IssueListCommentsOptions{ + ListOptions: github.ListOptions{PerPage: 100, Page: page}, + } + ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts) + if err != nil { + log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) + break + } + for _, c := range ghComments { + body := strings.TrimSpace(c.GetBody()) + if body == "" { + continue + } + author := "deleted-user" + if c.User != nil { + author = c.User.GetLogin() } - author := "deleted-user" - if c.User != nil { - author = c.User.GetLogin() - } - textComments = append(textComments, text.Comment{Author: author, Body: body}) + textComments = append(textComments, text.Comment{Author: author, Body: body}) + } + if resp == nil || resp.NextPage == 0 { + break } + page = resp.NextPage } }

Note: This requires importing "github.com/google/go-github/v60/github" for the options type.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Fetch comments for richer embedding content.

var textComments []text.Comment

if s.github != nil {

ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)

if err != nil {

log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)

} else {

for _, c := range ghComments {

body := strings.TrimSpace(c.GetBody())

if body == "" {

continue

}

author := "deleted-user"

if c.User != nil {

author = c.User.GetLogin()

}

textComments = append(textComments, text.Comment{Author: author, Body: body})

}

}

}

// Fetch comments for richer embedding content.

var textComments []text.Comment

if s.github != nil {

page := 1

for {

opts := &github.IssueListCommentsOptions{

ListOptions: github.ListOptions{PerPage: 100, Page: page},

}

ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts)

if err != nil {

log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)

break

}

for _, c := range ghComments {

body := strings.TrimSpace(c.GetBody())

if body == "" {

continue

}

author := "deleted-user"

if c.User != nil {

author = c.User.GetLogin()

}

textComments = append(textComments, text.Comment{Author: author, Body: body})

}

if resp == nil || resp.NextPage == 0 {

break

}

page = resp.NextPage

}

}

🤖 Prompt for AI Agents

In `@internal/steps/indexer.go` around lines 65 - 84, The comments fetching in the indexer (where s.github.ListComments is used) only requests the first page; change it to iterate pages using github.ListOptions (e.g., loop with opt := &github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and append each page’s comment bodies to textComments, preserving the existing trimming/author logic; also add the import "github.com/google/go-github/v60/github" for the options type.

Copilot

Pull request overview

This PR extends the CLI index command to index pull requests (optionally disabled via a flag) and standardizes embedding text construction across the CLI backfill path and the live bot indexer by introducing a shared BuildEmbeddingContent utility. It also adds a type field to Qdrant payloads to distinguish issues vs pull requests.

Changes:

Add internal/utils/text.BuildEmbeddingContent (and tests) to generate consistent embedding text as Title + Body + Comments.
Update simili index to include PRs by default, add --include-prs flag, and use the shared embedding builder.
Update the live bot Indexer to fetch comments at index time and store a type field in Qdrant payloads.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`internal/utils/text/content.go`	New shared embedding-content builder + `Comment` type.
`internal/utils/text/content_test.go`	New unit tests for embedding-content formatting and skipping rules.
`cmd/simili/commands/index.go`	CLI now optionally includes PRs and uses shared embedding builder; adds payload `type`.
`internal/steps/indexer.go`	Bot indexer now fetches comments and uses shared builder; adds payload `type`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T09:23:57Z

+		if c.Body == "" {
+			continue
+		}
+		if !hasHeader {
+			sb.WriteString("Comments:\n")
+			hasHeader = true
+		}
+		fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body)


BuildEmbeddingContent trims the issue/PR body but does not trim comment bodies; as written, a whitespace-only comment body (e.g. " \n") will be included in the output even though it’s effectively empty. Consider applying strings.TrimSpace to c.Body inside the builder and using the trimmed value for both the empty check and the rendered output to make the utility robust regardless of caller behavior.

Suggested change

if c.Body == "" {

continue

}

if !hasHeader {

sb.WriteString("Comments:\n")

hasHeader = true

}

fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body)

cb := strings.TrimSpace(c.Body)

if cb == "" {

continue

}

if !hasHeader {

sb.WriteString("Comments:\n")

hasHeader = true

}

fmt.Fprintf(&sb, "- %s: %s\n", c.Author, cb)

Copilot · 2026-02-13T09:23:57Z

+	// Fetch comments for richer embedding content.
+	var textComments []text.Comment
+	if s.github != nil {
+		ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
+		if err != nil {
+			log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
+		} else {
+			for _, c := range ghComments {
+				body := strings.TrimSpace(c.GetBody())
+				if body == "" {
+					continue
+				}
+				author := "deleted-user"
+				if c.User != nil {
+					author = c.User.GetLogin()
+				}
+				textComments = append(textComments, text.Comment{Author: author, Body: body})
+			}
+		}


ListComments is called with opts=nil, so only the first page of comments will be fetched (GitHub API default page size) and additional pages are silently ignored. This makes the bot’s embeddings incomplete and inconsistent with the CLI backfill (which paginates). Consider paginating through all comment pages (or explicitly setting PerPage and looping on resp.NextPage).

Copilot · 2026-02-13T09:23:58Z

@@ -83,6 +109,7 @@ func (s *Indexer) Run(ctx *pipeline.Context) error {
 			"state":  ctx.Issue.State,
 			"author": ctx.Issue.Author,
 			"labels": ctx.Issue.Labels,
+			"type":   ctx.Issue.EventType,
 		},


The PR description says the Qdrant payload type should be normalized to "issue" / "pull_request", but here it’s set to ctx.Issue.EventType (values like "issues", "issue_comment", "pr_comment", etc.). Consider mapping event types to the canonical stored types (e.g., PR + PR comments => pull_request, everything else => issue) so downstream consumers can reliably filter on type.

Apply strings.TrimSpace inside the builder itself so callers don't need to pre-trim, and whitespace-only bodies are always skipped consistently.

- Paginate ListComments (100/page) to match CLI behaviour; no more silent truncation on issues with >30 comments - Map EventType to canonical "issue" / "pull_request" so downstream consumers can reliably filter; previous use of EventType raw values ("issues", "issue_comment", "pr_comment") was inconsistent with CLI

Replace uuid.New() (random) with uuid.NewMD5 keyed on "org/repo#number-chunk-N" so re-running the index command upserts existing points rather than creating duplicates.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@cmd/simili/commands/index.go`:
- Around line 28-33: The comment for indexSince promises issue-number support
that runIndex doesn't implement; update the indexSince declaration comment and
the CLI flag help text (where indexSince is registered) to state it only accepts
RFC3339 timestamps, and remove any mention of numeric/issue-number semantics;
alternatively, if you prefer to implement the feature, modify runIndex to detect
numeric indexSince (using strconv.Atoi/ParseInt) and treat it as an issue number
filter (compare against issue.Number when iterating), but do not claim both
behaviors—pick one and make indexSince's comment/help and runIndex behavior
consistent (refer to the indexSince variable and the runIndex function).

coderabbitai · 2026-02-13T09:32:41Z

+	indexRepo       string
+	indexSince      string // Can be a timestamp (ISO8601) or issue number (int)
+	indexWorkers    int
+	indexToken      string
+	indexDryRun     bool
+	indexIncludePRs bool


⚠️ Potential issue | 🟠 Major

indexSince doc mentions issue-number support, but it isn’t implemented.

runIndex only parses RFC3339 timestamps; numeric values are ignored. Either implement numeric filtering or adjust the comment/flag help to avoid promising behavior that doesn’t exist.

🤖 Prompt for AI Agents

In `@cmd/simili/commands/index.go` around lines 28 - 33, The comment for indexSince promises issue-number support that runIndex doesn't implement; update the indexSince declaration comment and the CLI flag help text (where indexSince is registered) to state it only accepts RFC3339 timestamps, and remove any mention of numeric/issue-number semantics; alternatively, if you prefer to implement the feature, modify runIndex to detect numeric indexSince (using strconv.Atoi/ParseInt) and treat it as an issue number filter (compare against issue.Number when iterating), but do not claim both behaviors—pick one and make indexSince's comment/help and runIndex behavior consistent (refer to the indexSince variable and the runIndex function).

Kavirubc added 3 commits February 13, 2026 14:46

feat(text): add shared BuildEmbeddingContent utility

618a8e0

Add BuildEmbeddingContent to the text package to consolidate embedding content construction (title + body + comments) into a single reusable function with full unit test coverage.

Kavirubc requested a review from Copilot February 13, 2026 09:19

Copilot started reviewing on behalf of Kavirubc February 13, 2026 09:20 View session

coderabbitai Bot reviewed Feb 13, 2026

View reviewed changes

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Kavirubc mentioned this pull request Feb 13, 2026

[Enhancement] Visually differentiate Issues vs PRs in Similarity Search results #49

Closed

Kavirubc added 3 commits February 13, 2026 14:57

fix(text): trim whitespace-only comment bodies in BuildEmbeddingContent

e2109e5

Apply strings.TrimSpace inside the builder itself so callers don't need to pre-trim, and whitespace-only bodies are always skipped consistently.

fix(cli): use deterministic UUIDs for idempotent re-indexing

310580c

Replace uuid.New() (random) with uuid.NewMD5 keyed on "org/repo#number-chunk-N" so re-running the index command upserts existing points rather than creating duplicates.

Kavirubc merged commit f810f3a into main Feb 13, 2026
3 of 4 checks passed

Kavirubc deleted the feature/cli-pr-index branch February 13, 2026 09:32

coderabbitai Bot reviewed Feb 13, 2026

View reviewed changes

This was referenced Feb 14, 2026

[Bug]: Closed Issues and PRs are shown as Open in Simili Comment #59

Closed

fix: update stale issue/PR state in similar threads #61

Merged

coderabbitai Bot mentioned this pull request Feb 18, 2026

index and query prs #82

Closed

17 tasks

coderabbitai Bot mentioned this pull request May 8, 2026

fix: Pipeline stability and logic audit #114

Merged

Conversation

Kavirubc commented Feb 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kavirubc commented Feb 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 13, 2026 •

edited

Loading