feat(index): PR indexing with dedicated collection + simili pr-duplicate CLI by Kavirubc · Pull Request #97 · similigh/simili-bot

Kavirubc · 2026-03-05T04:36:45Z

Closes #43

Summary

Dedicated PR collection — qdrant.pr_collection (optional) in config routes PRs into their own Qdrant collection instead of mixing them with issues
processPullRequest worker — fetches full PR metadata + changed files via the GitHub API, builds a structured embedding (Title / Body / Changed Files), and upserts to pr_collection
Backward compatibility — when pr_collection is not set, PRs continue to be indexed in the main issues collection unchanged
simili pr-duplicate CLI — searches both collections, deduplicates results by (type, number), optionally runs an LLM duplicate verdict, and outputs JSON to stdout
Config — QdrantConfig.PRCollection field added; excluded from Validate() since it is optional; env-var expanded automatically via ${QDRANT_PR_COLLECTION}
.github/simili.yaml updated with the new field

Commits

#	Commit	Scope
1	`feat(config)`	Add `pr_collection` to `QdrantConfig` + merge + tests
2	`feat(github)`	`GetPullRequest` + `ListPullRequestFiles` (paginated)
3	`feat(index)`	`processPullRequest`, `buildPREmbeddingContent`, PR routing
4	`feat(cmd)`	`simili pr-duplicate` command + unit tests
5	`chore(config,test)`	YAML update + E2E test

Test plan

go test ./internal/core/config/... — TestPRCollectionIsOptional, TestPRCollectionEnvExpansion
go test ./cmd/simili/commands/... — TestBuildPREmbeddingContent, TestParsePRDuplicateOutput, TestMergeSearchResults*
go test ./tests/integration/... — TestPRDuplicateDetection
Smoke: simili index --repo owner/repo --include-prs --dry-run (requires QDRANT_PR_COLLECTION set)
Smoke: simili pr-duplicate --repo owner/repo --number 1 --dry-run

coderabbitai · 2026-03-05T04:36:58Z

Warning

Rate limit exceeded

@Kavirubc has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 14 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86a38c8b-c147-44d7-b345-283772b33f07

📥 Commits

Reviewing files that changed from the base of the PR and between f63d0ef and c72fe8f.

📒 Files selected for processing (9)

.github/simili.yaml
cmd/simili/commands/index.go
cmd/simili/commands/index_test.go
cmd/simili/commands/pr_duplicate.go
cmd/simili/commands/pr_duplicate_test.go
internal/core/config/config.go
internal/core/config/config_test.go
internal/integrations/github/client.go
tests/integration/e2e_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/pr-indexing-and-duplicate-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add PRCollection field to QdrantConfig with yaml tag pr_collection. Field is intentionally excluded from Validate() — empty value means PRs are indexed in the main collection (backward compatibility). mergeConfigs updated to propagate the field from child → parent. New tests cover optional validation and env-var expansion via ${QDRANT_PR_COLLECTION}. Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

GetPullRequest wraps PullRequests.Get for full PR metadata. ListPullRequestFiles paginates PullRequests.ListFiles to collect all changed file paths regardless of PR size. Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

…#43) - processPullRequest() fetches full PR details + changed files, builds PR-specific embedding content, and upserts to pr_collection - buildPREmbeddingContent() is a testable helper that produces "Title / Body / Changed Files" formatted text - Producer routes IsPullRequest() items to a separate prJobs channel (with its own worker pool) when PRCollection is configured; falls through to the main issues channel otherwise for backward compat - PR collection is auto-created alongside the issues collection on startup Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

New command fetches a PR, embeds its content, and searches both the issues collection and the optional pr_collection for duplicates. - Flags: --repo, --number (required), --token, --dry-run, --top-k, --threshold - mergeSearchResults() deduplicates hits from both collections by (type, number) key, excludes the queried PR from results, and sorts by score descending - Optional LLM verdict runs DetectDuplicate() on the top-3 candidates - Outputs structured JSON to stdout - Unit tests cover JSON round-trip, merge dedup/sort, and self-exclusion - index_test.go covers buildPREmbeddingContent with/without file lists Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

- .github/simili.yaml: add pr_collection field pointing to \${QDRANT_PR_COLLECTION}; empty env var disables dedicated PR indexing - e2e_test.go: TestPRDuplicateDetection verifies the config layer accepts pr_collection without requiring a live Qdrant instance Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

gh-simili-bot · 2026-03-05T04:38:14Z

Simili Triage Report

Note

Quality Score: 9.5/10 (Excellent)
The issue could be improved. See suggestions below.

Classification

Category	Value
Labels

Quality Improvements

Commit message in the 'Commits' section is truncated
Expand the truncated commit message for full clarity, though context makes it understandable.

Similar Threads

Similarity	Type	Thread	Status
92%	📝	#43 [0.2.0v][Feature] PR indexing with dedicated co...	—
90%	🔀	#40 Similar Issue	—
89%	🔀	#40 Index and query prs	—

Warning

Possible Duplicate (Confidence: 99%)
This pull request might be a duplicate of #43.
Reason: The current issue is an implementation of the feature requested in Issue #43, explicitly stating 'Closes #43'. Both issues describe the exact same feature: indexing pull requests into a dedicated Qdrant collection and providing a simili pr-duplicate CLI. Issue #40 (as described in Similar Issue 3) also describes this exact feature request.

⏳ This pull request will be automatically closed in 72 hours if no objections are raised. If you believe this is not a duplicate, please leave a comment explaining why.

_{Generated by Simili Bot}

Copilot

Pull request overview

Adds optional dedicated Qdrant collection support for indexing GitHub pull requests, plus a simili pr-duplicate CLI to search for semantically similar issues/PRs across collections.

Changes:

Introduces qdrant.pr_collection config field (optional) and merges it through inheritance/env expansion.
Extends GitHub integration + simili index to fetch PR metadata + changed files and upsert PR embeddings into a dedicated collection when configured.
Adds simili pr-duplicate command (with tests) to embed a PR and query/deduplicate candidates across issues + PR collections.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`internal/core/config/config.go`	Adds `PRCollection` to `QdrantConfig` and merges it during inheritance.
`internal/core/config/config_test.go`	Tests that `PRCollection` is optional and supports env expansion.
`internal/integrations/github/client.go`	Adds GitHub API helpers for fetching PR details and listing PR files (paginated).
`cmd/simili/commands/index.go`	Creates PR collection when configured; adds PR worker path (`processPullRequest`) and PR embedding content builder.
`cmd/simili/commands/index_test.go`	Unit tests for `buildPREmbeddingContent`.
`cmd/simili/commands/pr_duplicate.go`	New `pr-duplicate` command: fetch PR, embed, search both collections, merge/dedup results, optional LLM verdict, output JSON.
`cmd/simili/commands/pr_duplicate_test.go`	Unit tests for output JSON marshalling and search result merging logic.
`tests/integration/e2e_test.go`	Adds an integration test that validates config wiring for `pr_collection`.
`.github/simili.yaml`	Adds optional `pr_collection` config field placeholder.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T04:42:10Z

+	addHit := func(hit *qdrant.SearchResult) {
+		itemType, _ := hit.Payload["type"].(string)
+		if itemType == "" {
+			itemType = "issue"
+		}
+
+		number := payloadInt(hit.Payload, "issue_number", "pr_number")
+		title, _ := hit.Payload["title"].(string) //nolint:misspell
+		url, _ := hit.Payload["url"].(string)
+
+		// Exclude the PR being checked from its own results.
+		if itemType == "pull_request" && number == currentPRNumber {
+			return
+		}


mergeSearchResults only extracts the candidate number from payload keys issue_number / pr_number. However, other index paths in this repo store the identifier under number (e.g. internal/steps/indexer.go), which would make payloadInt return 0 and produce bogus candidates like issue #0 and incorrect deduping/exclusion. Consider also skipping hits where no valid number could be extracted (similar to internal/steps/similarity.go).

Copilot · 2026-03-05T04:42:10Z

+// TestPRDuplicateDetection verifies that the pr_collection field is wired correctly
+// through the config layer and passes validation without a real Qdrant instance.
+func TestPRDuplicateDetection(t *testing.T) {
+	cfg := &config.Config{
+		Qdrant: config.QdrantConfig{
+			URL:          "https://example.qdrant.io:6334",
+			APIKey:       "qdrant-key",
+			Collection:   "simili_bot_v1",
+			PRCollection: "simili_prs_v1",
+		},
+		Embedding: config.EmbeddingConfig{
+			APIKey: "embedding-key",
+		},
+	}
+
+	if err := cfg.Validate(); err != nil {
+		t.Fatalf("Config with pr_collection should be valid: %v", err)
+	}
+	if cfg.Qdrant.PRCollection != "simili_prs_v1" {
+		t.Errorf("Expected PRCollection 'simili_prs_v1', got %q", cfg.Qdrant.PRCollection)
+	}
+}


TestPRDuplicateDetection doesn’t exercise PR duplicate detection logic (it only checks config validation / field assignment) and largely duplicates the new unit tests in internal/core/config/config_test.go. This makes the integration suite misleading; consider renaming it to reflect what it asserts and/or moving/removing it in favor of the existing config-layer tests.

Copilot · 2026-03-05T04:42:10Z

+			top := out.Candidates
+			if len(top) > 3 {
+				top = top[:3]
+			}
+			similar := make([]ai.SimilarIssueInput, len(top))
+			for i, c := range top {
+				similar[i] = ai.SimilarIssueInput{
+					Number:     c.Number,
+					Title:      c.Title,
+					URL:        c.URL,
+					Similarity: c.Score,
+				}
+			}
+			dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{
+				CurrentIssue: &ai.IssueInput{
+					Title: pr.GetTitle(),
+					Body:  pr.GetBody(),
+				},
+				SimilarIssues: similar,
+			})
+			if dupErr == nil {
+				out.DuplicateDetected = dupResult.IsDuplicate
+				out.DuplicateOf = dupResult.DuplicateOf
+				out.Confidence = dupResult.Confidence
+				out.Reasoning = dupResult.Reasoning


The optional LLM verdict reuses ai.DetectDuplicate, whose prompt/contract is explicitly issue-focused (e.g. “duplicate detection system for GitHub issues” and DuplicateOf is documented as an issue number). Since pr-duplicate merges both PR and issue candidates, the LLM result can be ambiguous/misleading (e.g. duplicate_of: 5 without indicating whether that’s an issue or PR). Consider either restricting the LLM verdict to issue candidates only, or extending the duplicate-check input/output to include the candidate type and updating the prompt accordingly.

Suggested change

top := out.Candidates

if len(top) > 3 {

top = top[:3]

}

similar := make([]ai.SimilarIssueInput, len(top))

for i, c := range top {

similar[i] = ai.SimilarIssueInput{

Number: c.Number,

Title: c.Title,

URL: c.URL,

Similarity: c.Score,

}

}

dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{

CurrentIssue: &ai.IssueInput{

Title: pr.GetTitle(),

Body: pr.GetBody(),

},

SimilarIssues: similar,

})

if dupErr == nil {

out.DuplicateDetected = dupResult.IsDuplicate

out.DuplicateOf = dupResult.DuplicateOf

out.Confidence = dupResult.Confidence

out.Reasoning = dupResult.Reasoning

// Build up to 3 similar *issues* only, since the LLM duplicate

// detector is issue-focused and its contract assumes issues.

similar := make([]ai.SimilarIssueInput, 0, 3)

for _, c := range out.Candidates {

if c.Type != "issue" {

continue

}

similar = append(similar, ai.SimilarIssueInput{

Number: c.Number,

Title: c.Title,

URL: c.URL,

Similarity: c.Score,

})

if len(similar) == 3 {

break

}

}

if len(similar) > 0 {

dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{

CurrentIssue: &ai.IssueInput{

Title: pr.GetTitle(),

Body: pr.GetBody(),

},

SimilarIssues: similar,

})

if dupErr == nil {

out.DuplicateDetected = dupResult.IsDuplicate

out.DuplicateOf = dupResult.DuplicateOf

out.Confidence = dupResult.Confidence

out.Reasoning = dupResult.Reasoning

}

…ltsNilInputs SA9003 (staticcheck): the previous if-block had no body, causing the linter to flag it as an empty branch. Replace with a concrete assertion on len(candidates) instead. Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

gh-simili-bot · 2026-03-05T04:44:30Z

🧪 E2E Test

✅ Bot responded: yes

| Auto-closer (dry-run) | processed: 0 closed: 0 grace: 0 human: 0 |

Test repo → gh-simili-bot/simili-e2e-22702559434
Run → logs

_{Auto-generated by E2E pipeline}

1. payloadInt: add "number" as primary key lookup (used by internal/steps/indexer.go) alongside issue_number / pr_number, matching the pattern in internal/steps/similarity.go. Hits with no extractable number (number==0) are now silently skipped to prevent bogus issue #0 candidates and incorrect deduplication. 2. LLM verdict: restrict DetectDuplicate input to issue-type candidates only. The LLM prompt and DuplicateOf field are issue-focused; mixing in PR candidates would produce an ambiguous/misleading verdict. 3. e2e test: rename TestPRDuplicateDetection → TestPRCollectionConfigWiring to accurately reflect that the test validates config-layer wiring, not end-to-end duplicate detection logic. New test TestMergeSearchResultsNumberKeyVariants covers the "number" key variant and the zero-number skip behaviour. Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>

gh-simili-bot · 2026-03-05T04:55:11Z

🧪 E2E Test

✅ Bot responded: yes

| Auto-closer (dry-run) | processed: 0 closed: 0 grace: 0 human: 0 |

Test repo → gh-simili-bot/simili-e2e-22702874005
Run → logs

_{Auto-generated by E2E pipeline}

Kavirubc requested a review from Copilot March 5, 2026 04:37

Copilot started reviewing on behalf of Kavirubc March 5, 2026 04:37 View session

Kavirubc added 5 commits March 5, 2026 10:07

Kavirubc force-pushed the feature/pr-indexing-and-duplicate-cli branch from 2096442 to e3fdec8 Compare March 5, 2026 04:37

gh-simili-bot added the e2e label Mar 5, 2026

gh-simili-bot added the enhancement New feature or request label Mar 5, 2026

Copilot AI reviewed Mar 5, 2026

View reviewed changes

gh-simili-bot removed the e2e label Mar 5, 2026

gh-simili-bot approved these changes Mar 5, 2026

View reviewed changes

gh-simili-bot added the e2e label Mar 5, 2026

Kavirubc merged commit 7118d9e into main Mar 5, 2026
6 checks passed

Kavirubc deleted the feature/pr-indexing-and-duplicate-cli branch March 5, 2026 05:07

Kavirubc mentioned this pull request Mar 5, 2026

chore: modernise E2E test suite and action workflows for 0.2.0 milestone #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): PR indexing with dedicated collection + simili pr-duplicate CLI#97

feat(index): PR indexing with dedicated collection + simili pr-duplicate CLI#97
Kavirubc merged 7 commits intomainfrom
feature/pr-indexing-and-duplicate-cli

Kavirubc commented Mar 5, 2026

Uh oh!

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

gh-simili-bot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

gh-simili-bot commented Mar 5, 2026

Uh oh!

gh-simili-bot commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kavirubc commented Mar 5, 2026

Summary

Commits

Test plan

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gh-simili-bot commented Mar 5, 2026

Simili Triage Report

Classification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gh-simili-bot commented Mar 5, 2026

🧪 E2E Test

Uh oh!

gh-simili-bot commented Mar 5, 2026

🧪 E2E Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading