feat: CLI PR indexing (#46) & standardized embedding content (#47)#48
feat: CLI PR indexing (#46) & standardized embedding content (#47)#48
Conversation
Add BuildEmbeddingContent to the text package to consolidate embedding content construction (title + body + comments) into a single reusable function with full unit test coverage.
- Remove hard-coded PR skip; replace with --include-prs flag (default true) - Use shared text.BuildEmbeddingContent instead of manual string building - Set payload "type" dynamically to "issue" or "pull_request" - Drop unused fmt import
- Inject GitHub client into Indexer for comment fetching at index time
- Use shared text.BuildEmbeddingContent for richer embeddings
- Add "type" field (EventType) to Qdrant payload
- Use map[string]any in place of map[string]interface{}
📝 WalkthroughWalkthroughAdds optional PR indexing and unified embedding content: a CLI flag to include/exclude pull requests, indexer-side GitHub comment fetching, a shared Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Index Command
participant Indexer as Indexer Step
participant GitHub as GitHub Client
participant TextUtil as Text Utility
participant Embedder as Gemini Embedder
participant Store as Vector Store
CLI->>Indexer: Start processing item (respect indexIncludePRs)
Indexer->>GitHub: Fetch issue/PR comments (paginated)
GitHub-->>Indexer: Return comments or error (warn & continue)
Indexer->>TextUtil: BuildEmbeddingContent(title, body, comments)
TextUtil-->>Indexer: Formatted embedding content
Indexer->>Embedder: Request embedding(vector) for content
Embedder-->>Indexer: Return embedding vector
Indexer->>Store: Upsert vectors with payload (org, repo, issue_number, text, url, type, chunk_id)
Store-->>Indexer: Acknowledge
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@cmd/simili/commands/index.go`:
- Around line 246-264: The code creates non-deterministic point IDs using
uuid.New() in the loop that builds points (points[i] = &qdrant.Point{ ID:
uuid.New().String(), ... }), which breaks idempotent re-indexing; change this to
a deterministic UUID (like the pattern used in internal/steps/indexer.go with
uuid.NewMD5) by generating the UUID from stable inputs (org, repo,
issue.GetNumber(), and the chunk index i — include chunk index so each chunk
gets a unique but repeatable ID) and assign that string to ID; add any needed
imports (e.g., fmt or bytes) to build the name/seed for uuid.NewMD5.
In `@internal/steps/indexer.go`:
- Line 112: The code in internal/steps/indexer.go currently uses
ctx.Issue.EventType for the "type" field which yields webhook event names (e.g.,
"issues", "issue_comment") instead of the required item types "issue" or
"pull_request"; change this by detecting whether the Issue is a PR and emitting
"pull_request" or otherwise "issue" — either (A) add an IsPullRequest bool to
pipeline.Issue and populate it during webhook parsing, then use
ctx.Issue.IsPullRequest to set the "type" value, or (B) add a helper like
determineItemType(issue *pipeline.Issue) string that inspects EventType or other
webhook fields to return "pull_request" or "issue", and replace uses of
ctx.Issue.EventType with that helper in the indexer (referencing ctx.Issue,
pipeline.Issue, and the indexer function in internal/steps/indexer.go).
- Around line 65-84: The comments fetching in the indexer (where
s.github.ListComments is used) only requests the first page; change it to
iterate pages using github.ListOptions (e.g., loop with opt :=
&github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and
append each page’s comment bodies to textComments, preserving the existing
trimming/author logic; also add the import
"github.com/google/go-github/v60/github" for the options type.
🧹 Nitpick comments (3)
internal/utils/text/content.go (1)
26-34: Inconsistent whitespace handling between body and comment bodies.The main
bodyis trimmed before checking emptiness (line 26), but comment bodies are checked without trimming (line 32). This means whitespace-only comment bodies will be included, unlike whitespace-only main bodies.While callers currently trim comment bodies before passing them, the function behavior is inconsistent with its documented behavior ("Comments with empty bodies are skipped").
♻️ Proposed fix for consistency
hasHeader := false for _, c := range comments { - if c.Body == "" { + if strings.TrimSpace(c.Body) == "" { continue }internal/utils/text/content_test.go (1)
54-63: Consider adding a test for whitespace-only comment bodies.The test suite covers whitespace-only main body (lines 46-53), but there's no test for whitespace-only comment bodies. Adding such a test would clarify the expected behavior and catch any future regressions if the trim logic is added to the function.
🧪 Suggested additional test case
{ name: "whitespace-only comment body skipped", title: "My Issue", body: "Body", comments: []Comment{ {Author: "alice", Body: " \n "}, {Author: "bob", Body: "Valid comment"}, }, want: "Title: My Issue\n\nBody: Body\n\nComments:\n- bob: Valid comment\n", },cmd/simili/commands/index.go (1)
229-244: Consider handling chunk count changes on re-indexing.When re-indexing with richer content (e.g., more comments), the number of chunks may change. If an issue previously had more chunks than the current run produces, stale chunk points will remain in Qdrant.
Consider adding a deletion step before upserting, or using a filter-based delete for the issue number before inserting new chunks.
| // Fetch comments for richer embedding content. | ||
| var textComments []text.Comment | ||
| if s.github != nil { | ||
| ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil) | ||
| if err != nil { | ||
| log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) | ||
| } else { | ||
| for _, c := range ghComments { | ||
| body := strings.TrimSpace(c.GetBody()) | ||
| if body == "" { | ||
| continue | ||
| } | ||
| author := "deleted-user" | ||
| if c.User != nil { | ||
| author = c.User.GetLogin() | ||
| } | ||
| textComments = append(textComments, text.Comment{Author: author, Body: body}) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing pagination when fetching comments.
Unlike cmd/simili/commands/index.go (lines 199-213) which paginates through all comments, this implementation only fetches the first page by passing nil options. Issues with many comments will have truncated embedding content.
🐛 Proposed fix to add pagination
// Fetch comments for richer embedding content.
var textComments []text.Comment
if s.github != nil {
- ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil)
- if err != nil {
- log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
- } else {
- for _, c := range ghComments {
- body := strings.TrimSpace(c.GetBody())
- if body == "" {
- continue
+ page := 1
+ for {
+ opts := &github.IssueListCommentsOptions{
+ ListOptions: github.ListOptions{PerPage: 100, Page: page},
+ }
+ ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts)
+ if err != nil {
+ log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err)
+ break
+ }
+ for _, c := range ghComments {
+ body := strings.TrimSpace(c.GetBody())
+ if body == "" {
+ continue
+ }
+ author := "deleted-user"
+ if c.User != nil {
+ author = c.User.GetLogin()
}
- author := "deleted-user"
- if c.User != nil {
- author = c.User.GetLogin()
- }
- textComments = append(textComments, text.Comment{Author: author, Body: body})
+ textComments = append(textComments, text.Comment{Author: author, Body: body})
+ }
+ if resp == nil || resp.NextPage == 0 {
+ break
}
+ page = resp.NextPage
}
}Note: This requires importing "github.com/google/go-github/v60/github" for the options type.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Fetch comments for richer embedding content. | |
| var textComments []text.Comment | |
| if s.github != nil { | |
| ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil) | |
| if err != nil { | |
| log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) | |
| } else { | |
| for _, c := range ghComments { | |
| body := strings.TrimSpace(c.GetBody()) | |
| if body == "" { | |
| continue | |
| } | |
| author := "deleted-user" | |
| if c.User != nil { | |
| author = c.User.GetLogin() | |
| } | |
| textComments = append(textComments, text.Comment{Author: author, Body: body}) | |
| } | |
| } | |
| } | |
| // Fetch comments for richer embedding content. | |
| var textComments []text.Comment | |
| if s.github != nil { | |
| page := 1 | |
| for { | |
| opts := &github.IssueListCommentsOptions{ | |
| ListOptions: github.ListOptions{PerPage: 100, Page: page}, | |
| } | |
| ghComments, resp, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, opts) | |
| if err != nil { | |
| log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) | |
| break | |
| } | |
| for _, c := range ghComments { | |
| body := strings.TrimSpace(c.GetBody()) | |
| if body == "" { | |
| continue | |
| } | |
| author := "deleted-user" | |
| if c.User != nil { | |
| author = c.User.GetLogin() | |
| } | |
| textComments = append(textComments, text.Comment{Author: author, Body: body}) | |
| } | |
| if resp == nil || resp.NextPage == 0 { | |
| break | |
| } | |
| page = resp.NextPage | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@internal/steps/indexer.go` around lines 65 - 84, The comments fetching in the
indexer (where s.github.ListComments is used) only requests the first page;
change it to iterate pages using github.ListOptions (e.g., loop with opt :=
&github.ListOptions{Page: p, PerPage: 100} and increment p until no results) and
append each page’s comment bodies to textComments, preserving the existing
trimming/author logic; also add the import
"github.com/google/go-github/v60/github" for the options type.
There was a problem hiding this comment.
Pull request overview
This PR extends the CLI index command to index pull requests (optionally disabled via a flag) and standardizes embedding text construction across the CLI backfill path and the live bot indexer by introducing a shared BuildEmbeddingContent utility. It also adds a type field to Qdrant payloads to distinguish issues vs pull requests.
Changes:
- Add
internal/utils/text.BuildEmbeddingContent(and tests) to generate consistent embedding text asTitle + Body + Comments. - Update
simili indexto include PRs by default, add--include-prsflag, and use the shared embedding builder. - Update the live bot
Indexerto fetch comments at index time and store atypefield in Qdrant payloads.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
internal/utils/text/content.go |
New shared embedding-content builder + Comment type. |
internal/utils/text/content_test.go |
New unit tests for embedding-content formatting and skipping rules. |
cmd/simili/commands/index.go |
CLI now optionally includes PRs and uses shared embedding builder; adds payload type. |
internal/steps/indexer.go |
Bot indexer now fetches comments and uses shared builder; adds payload type. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if c.Body == "" { | ||
| continue | ||
| } | ||
| if !hasHeader { | ||
| sb.WriteString("Comments:\n") | ||
| hasHeader = true | ||
| } | ||
| fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body) |
There was a problem hiding this comment.
BuildEmbeddingContent trims the issue/PR body but does not trim comment bodies; as written, a whitespace-only comment body (e.g. " \n") will be included in the output even though it’s effectively empty. Consider applying strings.TrimSpace to c.Body inside the builder and using the trimmed value for both the empty check and the rendered output to make the utility robust regardless of caller behavior.
| if c.Body == "" { | |
| continue | |
| } | |
| if !hasHeader { | |
| sb.WriteString("Comments:\n") | |
| hasHeader = true | |
| } | |
| fmt.Fprintf(&sb, "- %s: %s\n", c.Author, c.Body) | |
| cb := strings.TrimSpace(c.Body) | |
| if cb == "" { | |
| continue | |
| } | |
| if !hasHeader { | |
| sb.WriteString("Comments:\n") | |
| hasHeader = true | |
| } | |
| fmt.Fprintf(&sb, "- %s: %s\n", c.Author, cb) |
| // Fetch comments for richer embedding content. | ||
| var textComments []text.Comment | ||
| if s.github != nil { | ||
| ghComments, _, err := s.github.ListComments(ctx.Ctx, ctx.Issue.Org, ctx.Issue.Repo, ctx.Issue.Number, nil) | ||
| if err != nil { | ||
| log.Printf("[indexer] WARNING: failed to fetch comments for #%d: %v", ctx.Issue.Number, err) | ||
| } else { | ||
| for _, c := range ghComments { | ||
| body := strings.TrimSpace(c.GetBody()) | ||
| if body == "" { | ||
| continue | ||
| } | ||
| author := "deleted-user" | ||
| if c.User != nil { | ||
| author = c.User.GetLogin() | ||
| } | ||
| textComments = append(textComments, text.Comment{Author: author, Body: body}) | ||
| } | ||
| } |
There was a problem hiding this comment.
ListComments is called with opts=nil, so only the first page of comments will be fetched (GitHub API default page size) and additional pages are silently ignored. This makes the bot’s embeddings incomplete and inconsistent with the CLI backfill (which paginates). Consider paginating through all comment pages (or explicitly setting PerPage and looping on resp.NextPage).
| @@ -83,6 +109,7 @@ func (s *Indexer) Run(ctx *pipeline.Context) error { | |||
| "state": ctx.Issue.State, | |||
| "author": ctx.Issue.Author, | |||
| "labels": ctx.Issue.Labels, | |||
| "type": ctx.Issue.EventType, | |||
| }, | |||
There was a problem hiding this comment.
The PR description says the Qdrant payload type should be normalized to "issue" / "pull_request", but here it’s set to ctx.Issue.EventType (values like "issues", "issue_comment", "pr_comment", etc.). Consider mapping event types to the canonical stored types (e.g., PR + PR comments => pull_request, everything else => issue) so downstream consumers can reliably filter on type.
Apply strings.TrimSpace inside the builder itself so callers don't need to pre-trim, and whitespace-only bodies are always skipped consistently.
- Paginate ListComments (100/page) to match CLI behaviour; no more
silent truncation on issues with >30 comments
- Map EventType to canonical "issue" / "pull_request" so downstream
consumers can reliably filter; previous use of EventType raw values
("issues", "issue_comment", "pr_comment") was inconsistent with CLI
Replace uuid.New() (random) with uuid.NewMD5 keyed on "org/repo#number-chunk-N" so re-running the index command upserts existing points rather than creating duplicates.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cmd/simili/commands/index.go`:
- Around line 28-33: The comment for indexSince promises issue-number support
that runIndex doesn't implement; update the indexSince declaration comment and
the CLI flag help text (where indexSince is registered) to state it only accepts
RFC3339 timestamps, and remove any mention of numeric/issue-number semantics;
alternatively, if you prefer to implement the feature, modify runIndex to detect
numeric indexSince (using strconv.Atoi/ParseInt) and treat it as an issue number
filter (compare against issue.Number when iterating), but do not claim both
behaviors—pick one and make indexSince's comment/help and runIndex behavior
consistent (refer to the indexSince variable and the runIndex function).
| indexRepo string | ||
| indexSince string // Can be a timestamp (ISO8601) or issue number (int) | ||
| indexWorkers int | ||
| indexToken string | ||
| indexDryRun bool | ||
| indexIncludePRs bool |
There was a problem hiding this comment.
indexSince doc mentions issue-number support, but it isn’t implemented.
runIndex only parses RFC3339 timestamps; numeric values are ignored. Either implement numeric filtering or adjust the comment/flag help to avoid promising behavior that doesn’t exist.
🤖 Prompt for AI Agents
In `@cmd/simili/commands/index.go` around lines 28 - 33, The comment for
indexSince promises issue-number support that runIndex doesn't implement; update
the indexSince declaration comment and the CLI flag help text (where indexSince
is registered) to state it only accepts RFC3339 timestamps, and remove any
mention of numeric/issue-number semantics; alternatively, if you prefer to
implement the feature, modify runIndex to detect numeric indexSince (using
strconv.Atoi/ParseInt) and treat it as an issue number filter (compare against
issue.Number when iterating), but do not claim both behaviors—pick one and make
indexSince's comment/help and runIndex behavior consistent (refer to the
indexSince variable and the runIndex function).
Summary
indexcommand now indexes pull requests by default. A new--include-prsflag (defaulttrue) lets you opt out of PR indexing to restore the previous behaviour.Title + Body + Commentsusing a sharedBuildEmbeddingContentutility. The bot fetches comments via the GitHub client at index time."type"field ("issue"/"pull_request") is now stored in every Qdrant payload point.Changes
internal/utils/text/content.goBuildEmbeddingContent+Commenttypeinternal/utils/text/content_test.gocmd/simili/commands/index.go--include-prsflag, shared builder, dynamictypeinternal/steps/indexer.gotypepayloadTest plan
go test ./internal/utils/text/... -run TestBuildEmbeddingContent -v— all 6 cases passgo test ./...— full suite green, zero regressionsgo build ./...— clean buildNotes
Closes #46
Closes #47
Summary by CodeRabbit