Skip to content

feat(index): PR indexing with dedicated collection + simili pr-duplicate CLI#97

Merged
Kavirubc merged 7 commits intomainfrom
feature/pr-indexing-and-duplicate-cli
Mar 5, 2026
Merged

feat(index): PR indexing with dedicated collection + simili pr-duplicate CLI#97
Kavirubc merged 7 commits intomainfrom
feature/pr-indexing-and-duplicate-cli

Conversation

@Kavirubc
Copy link
Copy Markdown
Contributor

@Kavirubc Kavirubc commented Mar 5, 2026

Closes #43

Summary

  • Dedicated PR collectionqdrant.pr_collection (optional) in config routes PRs into their own Qdrant collection instead of mixing them with issues
  • processPullRequest worker — fetches full PR metadata + changed files via the GitHub API, builds a structured embedding (Title / Body / Changed Files), and upserts to pr_collection
  • Backward compatibility — when pr_collection is not set, PRs continue to be indexed in the main issues collection unchanged
  • simili pr-duplicate CLI — searches both collections, deduplicates results by (type, number), optionally runs an LLM duplicate verdict, and outputs JSON to stdout
  • ConfigQdrantConfig.PRCollection field added; excluded from Validate() since it is optional; env-var expanded automatically via ${QDRANT_PR_COLLECTION}
  • .github/simili.yaml updated with the new field

Commits

# Commit Scope
1 feat(config) Add pr_collection to QdrantConfig + merge + tests
2 feat(github) GetPullRequest + ListPullRequestFiles (paginated)
3 feat(index) processPullRequest, buildPREmbeddingContent, PR routing
4 feat(cmd) simili pr-duplicate command + unit tests
5 chore(config,test) YAML update + E2E test

Test plan

  • go test ./internal/core/config/...TestPRCollectionIsOptional, TestPRCollectionEnvExpansion
  • go test ./cmd/simili/commands/...TestBuildPREmbeddingContent, TestParsePRDuplicateOutput, TestMergeSearchResults*
  • go test ./tests/integration/...TestPRDuplicateDetection
  • Smoke: simili index --repo owner/repo --include-prs --dry-run (requires QDRANT_PR_COLLECTION set)
  • Smoke: simili pr-duplicate --repo owner/repo --number 1 --dry-run

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 5, 2026

Warning

Rate limit exceeded

@Kavirubc has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 14 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86a38c8b-c147-44d7-b345-283772b33f07

📥 Commits

Reviewing files that changed from the base of the PR and between f63d0ef and c72fe8f.

📒 Files selected for processing (9)
  • .github/simili.yaml
  • cmd/simili/commands/index.go
  • cmd/simili/commands/index_test.go
  • cmd/simili/commands/pr_duplicate.go
  • cmd/simili/commands/pr_duplicate_test.go
  • internal/core/config/config.go
  • internal/core/config/config_test.go
  • internal/integrations/github/client.go
  • tests/integration/e2e_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/pr-indexing-and-duplicate-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Kavirubc added 5 commits March 5, 2026 10:07
Add PRCollection field to QdrantConfig with yaml tag pr_collection.
Field is intentionally excluded from Validate() — empty value means
PRs are indexed in the main collection (backward compatibility).

mergeConfigs updated to propagate the field from child → parent.
New tests cover optional validation and env-var expansion via
${QDRANT_PR_COLLECTION}.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
GetPullRequest wraps PullRequests.Get for full PR metadata.
ListPullRequestFiles paginates PullRequests.ListFiles to collect
all changed file paths regardless of PR size.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
…#43)

- processPullRequest() fetches full PR details + changed files,
  builds PR-specific embedding content, and upserts to pr_collection
- buildPREmbeddingContent() is a testable helper that produces
  "Title / Body / Changed Files" formatted text
- Producer routes IsPullRequest() items to a separate prJobs channel
  (with its own worker pool) when PRCollection is configured; falls
  through to the main issues channel otherwise for backward compat
- PR collection is auto-created alongside the issues collection on startup

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
New command fetches a PR, embeds its content, and searches both the
issues collection and the optional pr_collection for duplicates.

- Flags: --repo, --number (required), --token, --dry-run, --top-k, --threshold
- mergeSearchResults() deduplicates hits from both collections by
  (type, number) key, excludes the queried PR from results, and sorts
  by score descending
- Optional LLM verdict runs DetectDuplicate() on the top-3 candidates
- Outputs structured JSON to stdout
- Unit tests cover JSON round-trip, merge dedup/sort, and self-exclusion
- index_test.go covers buildPREmbeddingContent with/without file lists

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
- .github/simili.yaml: add pr_collection field pointing to
  \${QDRANT_PR_COLLECTION}; empty env var disables dedicated PR indexing
- e2e_test.go: TestPRDuplicateDetection verifies the config layer
  accepts pr_collection without requiring a live Qdrant instance

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
@Kavirubc Kavirubc force-pushed the feature/pr-indexing-and-duplicate-cli branch from 2096442 to e3fdec8 Compare March 5, 2026 04:37
@gh-simili-bot
Copy link
Copy Markdown
Contributor

Simili Triage Report

Note

Quality Score: 9.5/10 (Excellent)
The issue could be improved. See suggestions below.

Classification

Category Value
Labels
Quality Improvements
  • Commit message in the 'Commits' section is truncated
  • Expand the truncated commit message for full clarity, though context makes it understandable.
Similar Threads
Similarity Type Thread Status
92% 📝 #43 [0.2.0v][Feature] PR indexing with dedicated co...
90% 🔀 #40 Similar Issue
89% 🔀 #40 Index and query prs

Warning

Possible Duplicate (Confidence: 99%)
This pull request might be a duplicate of #43.
Reason: The current issue is an implementation of the feature requested in Issue #43, explicitly stating 'Closes #43'. Both issues describe the exact same feature: indexing pull requests into a dedicated Qdrant collection and providing a simili pr-duplicate CLI. Issue #40 (as described in Similar Issue 3) also describes this exact feature request.

This pull request will be automatically closed in 72 hours if no objections are raised. If you believe this is not a duplicate, please leave a comment explaining why.


Generated by Simili Bot

@gh-simili-bot gh-simili-bot added the enhancement New feature or request label Mar 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional dedicated Qdrant collection support for indexing GitHub pull requests, plus a simili pr-duplicate CLI to search for semantically similar issues/PRs across collections.

Changes:

  • Introduces qdrant.pr_collection config field (optional) and merges it through inheritance/env expansion.
  • Extends GitHub integration + simili index to fetch PR metadata + changed files and upsert PR embeddings into a dedicated collection when configured.
  • Adds simili pr-duplicate command (with tests) to embed a PR and query/deduplicate candidates across issues + PR collections.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/core/config/config.go Adds PRCollection to QdrantConfig and merges it during inheritance.
internal/core/config/config_test.go Tests that PRCollection is optional and supports env expansion.
internal/integrations/github/client.go Adds GitHub API helpers for fetching PR details and listing PR files (paginated).
cmd/simili/commands/index.go Creates PR collection when configured; adds PR worker path (processPullRequest) and PR embedding content builder.
cmd/simili/commands/index_test.go Unit tests for buildPREmbeddingContent.
cmd/simili/commands/pr_duplicate.go New pr-duplicate command: fetch PR, embed, search both collections, merge/dedup results, optional LLM verdict, output JSON.
cmd/simili/commands/pr_duplicate_test.go Unit tests for output JSON marshalling and search result merging logic.
tests/integration/e2e_test.go Adds an integration test that validates config wiring for pr_collection.
.github/simili.yaml Adds optional pr_collection config field placeholder.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +233 to +246
addHit := func(hit *qdrant.SearchResult) {
itemType, _ := hit.Payload["type"].(string)
if itemType == "" {
itemType = "issue"
}

number := payloadInt(hit.Payload, "issue_number", "pr_number")
title, _ := hit.Payload["title"].(string) //nolint:misspell
url, _ := hit.Payload["url"].(string)

// Exclude the PR being checked from its own results.
if itemType == "pull_request" && number == currentPRNumber {
return
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mergeSearchResults only extracts the candidate number from payload keys issue_number / pr_number. However, other index paths in this repo store the identifier under number (e.g. internal/steps/indexer.go), which would make payloadInt return 0 and produce bogus candidates like issue #0 and incorrect deduping/exclusion. Consider also skipping hits where no valid number could be extracted (similar to internal/steps/similarity.go).

Copilot uses AI. Check for mistakes.
Comment thread tests/integration/e2e_test.go Outdated
Comment on lines +40 to +61
// TestPRDuplicateDetection verifies that the pr_collection field is wired correctly
// through the config layer and passes validation without a real Qdrant instance.
func TestPRDuplicateDetection(t *testing.T) {
cfg := &config.Config{
Qdrant: config.QdrantConfig{
URL: "https://example.qdrant.io:6334",
APIKey: "qdrant-key",
Collection: "simili_bot_v1",
PRCollection: "simili_prs_v1",
},
Embedding: config.EmbeddingConfig{
APIKey: "embedding-key",
},
}

if err := cfg.Validate(); err != nil {
t.Fatalf("Config with pr_collection should be valid: %v", err)
}
if cfg.Qdrant.PRCollection != "simili_prs_v1" {
t.Errorf("Expected PRCollection 'simili_prs_v1', got %q", cfg.Qdrant.PRCollection)
}
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestPRDuplicateDetection doesn’t exercise PR duplicate detection logic (it only checks config validation / field assignment) and largely duplicates the new unit tests in internal/core/config/config_test.go. This makes the integration suite misleading; consider renaming it to reflect what it asserts and/or moving/removing it in favor of the existing config-layer tests.

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/pr_duplicate.go Outdated
Comment on lines +195 to +219
top := out.Candidates
if len(top) > 3 {
top = top[:3]
}
similar := make([]ai.SimilarIssueInput, len(top))
for i, c := range top {
similar[i] = ai.SimilarIssueInput{
Number: c.Number,
Title: c.Title,
URL: c.URL,
Similarity: c.Score,
}
}
dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{
CurrentIssue: &ai.IssueInput{
Title: pr.GetTitle(),
Body: pr.GetBody(),
},
SimilarIssues: similar,
})
if dupErr == nil {
out.DuplicateDetected = dupResult.IsDuplicate
out.DuplicateOf = dupResult.DuplicateOf
out.Confidence = dupResult.Confidence
out.Reasoning = dupResult.Reasoning
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional LLM verdict reuses ai.DetectDuplicate, whose prompt/contract is explicitly issue-focused (e.g. “duplicate detection system for GitHub issues” and DuplicateOf is documented as an issue number). Since pr-duplicate merges both PR and issue candidates, the LLM result can be ambiguous/misleading (e.g. duplicate_of: 5 without indicating whether that’s an issue or PR). Consider either restricting the LLM verdict to issue candidates only, or extending the duplicate-check input/output to include the candidate type and updating the prompt accordingly.

Suggested change
top := out.Candidates
if len(top) > 3 {
top = top[:3]
}
similar := make([]ai.SimilarIssueInput, len(top))
for i, c := range top {
similar[i] = ai.SimilarIssueInput{
Number: c.Number,
Title: c.Title,
URL: c.URL,
Similarity: c.Score,
}
}
dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{
CurrentIssue: &ai.IssueInput{
Title: pr.GetTitle(),
Body: pr.GetBody(),
},
SimilarIssues: similar,
})
if dupErr == nil {
out.DuplicateDetected = dupResult.IsDuplicate
out.DuplicateOf = dupResult.DuplicateOf
out.Confidence = dupResult.Confidence
out.Reasoning = dupResult.Reasoning
// Build up to 3 similar *issues* only, since the LLM duplicate
// detector is issue-focused and its contract assumes issues.
similar := make([]ai.SimilarIssueInput, 0, 3)
for _, c := range out.Candidates {
if c.Type != "issue" {
continue
}
similar = append(similar, ai.SimilarIssueInput{
Number: c.Number,
Title: c.Title,
URL: c.URL,
Similarity: c.Score,
})
if len(similar) == 3 {
break
}
}
if len(similar) > 0 {
dupResult, dupErr := llmClient.DetectDuplicate(ctx, &ai.DuplicateCheckInput{
CurrentIssue: &ai.IssueInput{
Title: pr.GetTitle(),
Body: pr.GetBody(),
},
SimilarIssues: similar,
})
if dupErr == nil {
out.DuplicateDetected = dupResult.IsDuplicate
out.DuplicateOf = dupResult.DuplicateOf
out.Confidence = dupResult.Confidence
out.Reasoning = dupResult.Reasoning
}

Copilot uses AI. Check for mistakes.
…ltsNilInputs

SA9003 (staticcheck): the previous if-block had no body, causing the
linter to flag it as an empty branch. Replace with a concrete assertion
on len(candidates) instead.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
@gh-simili-bot
Copy link
Copy Markdown
Contributor

🧪 E2E Test

Bot responded: yes

| Auto-closer (dry-run) | processed: 0 closed: 0 grace: 0 human: 0 |

Test repo → gh-simili-bot/simili-e2e-22702559434
Run → logs

Auto-generated by E2E pipeline

1. payloadInt: add "number" as primary key lookup (used by
   internal/steps/indexer.go) alongside issue_number / pr_number,
   matching the pattern in internal/steps/similarity.go. Hits with
   no extractable number (number==0) are now silently skipped to
   prevent bogus issue #0 candidates and incorrect deduplication.

2. LLM verdict: restrict DetectDuplicate input to issue-type candidates
   only. The LLM prompt and DuplicateOf field are issue-focused; mixing
   in PR candidates would produce an ambiguous/misleading verdict.

3. e2e test: rename TestPRDuplicateDetection → TestPRCollectionConfigWiring
   to accurately reflect that the test validates config-layer wiring,
   not end-to-end duplicate detection logic.

New test TestMergeSearchResultsNumberKeyVariants covers the "number"
key variant and the zero-number skip behaviour.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
@gh-simili-bot gh-simili-bot removed the e2e label Mar 5, 2026
@gh-simili-bot
Copy link
Copy Markdown
Contributor

🧪 E2E Test

Bot responded: yes

| Auto-closer (dry-run) | processed: 0 closed: 0 grace: 0 human: 0 |

Test repo → gh-simili-bot/simili-e2e-22702874005
Run → logs

Auto-generated by E2E pipeline

@Kavirubc Kavirubc merged commit 7118d9e into main Mar 5, 2026
6 checks passed
@Kavirubc Kavirubc deleted the feature/pr-indexing-and-duplicate-cli branch March 5, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[0.2.0v][Feature] PR indexing with dedicated collection and pr-duplicate CLI

3 participants