Skip to content

feat: implement hybrid routing with repository documentation learning#32

Merged
Kavirubc merged 6 commits intomainfrom
feature/hybrid-index-learn
Feb 5, 2026
Merged

feat: implement hybrid routing with repository documentation learning#32
Kavirubc merged 6 commits intomainfrom
feature/hybrid-index-learn

Conversation

@Kavirubc
Copy link
Copy Markdown
Contributor

@Kavirubc Kavirubc commented Feb 5, 2026

Overview

Implements hybrid routing system that combines repository documentation (~60% weight) with historical issue patterns (~40% weight) to solve the cold start problem and enable effective routing from Day 1.

Changes

1. Configuration Foundation

  • Added RepoCollection field to TransferConfig (default: "simili_repos")
  • Proper config inheritance and merge support

2. Data Structure Enhancement

  • Added Definition field to RepositoryCandidate for full repo documentation

3. Learn Command Implementation

  • New simili learn command to index repository documentation
  • Fetches files from GitHub API (README.md, CONTRIBUTING.md, etc.)
  • Generates embeddings and stores in separate Qdrant collection
  • Supports dry-run mode and comprehensive error handling

4. Enhanced System Prompt

  • Updated routing prompt with 60/40 weighting guidance
  • Includes full repository documentation when available
  • Truncation to prevent token explosion (max 2000 chars per repo)
  • Graceful fallback to config descriptions

5. LLM Router Enhancement

  • Semantic search to fetch relevant repo definitions
  • Multi-file support with automatic concatenation
  • Comprehensive error handling with graceful degradation
  • Non-blocking operations with detailed logging

Key Features

Cold Start Support: Routes effectively even with zero historical issues
Multi-File Support: Concatenates multiple docs per repo (README + CONTRIBUTING + docs/*)
Semantic Search: Uses issue embedding to find relevant repos (top 10, threshold 0.5)
Backward Compatible: All changes are additive, existing setups work without modifications
Graceful Degradation: Every error point has fallback path

Usage

# Learn repository documentation
simili learn --org my-org --repo backend --file README.md
simili learn --org my-org --repo backend --file CONTRIBUTING.md

# Dry-run mode
simili learn --org my-org --repo backend --file README.md --dry-run

# Process issues with hybrid routing
simili process --org my-org --repo backend --issue 123

Configuration

transfer:
  llm_routing_enabled: true
  repo_collection: "simili_repos"  # New: Separate collection for repo docs
  medium_confidence: 0.6

Testing

  • ✅ All existing tests pass
  • ✅ Tested with event-integrator test repos
  • ✅ Semantic search correctly identifies repositories
  • ✅ Multi-file support verified
  • ✅ Error handling tested

Commits

  1. b336d29 - Configuration field for repo collection
  2. a97d30a - Definition field in RepositoryCandidate
  3. 30db714 - Complete learn command implementation
  4. be1c0a6 - Enhanced routing prompt with 60/40 weighting
  5. 207b0cf - LLM router with semantic search integration

Summary by CodeRabbit

  • New Features
    • Added a "learn" command to fetch and index repository documentation into the system.
    • Enhanced routing to consult indexed repository documentation, improving recommendation accuracy and adding confidence scores and brief reasoning in outputs.
  • Configuration
    • Introduced a configurable repository-collection setting to enable/disable and target where docs are stored.

Add RepoCollection field to TransferConfig to support separate
Qdrant collection for repository documentation. Defaults to
"simili_repos" and includes proper merge logic for config inheritance.
Add Definition field to store full repository documentation
for enhanced LLM routing decisions in hybrid routing system.
Add new 'learn' command that fetches repository documentation
(README.md, CONTRIBUTING.md, etc.) from GitHub and indexes it
into a separate Qdrant collection for hybrid routing.

Supports:
- Fetching any file from GitHub API
- Embedding with Gemini
- Indexing to configurable repo collection
- Dry-run mode for testing
- Rich payload with org, repo, file, text, timestamp

This enables routing from Day 1 without historical issues.
Update buildRouteIssuePrompt to:
- Include full repository documentation when available
- Add qualitative 60/40 weighting guidance (docs vs history)
- Truncate long documentation to prevent token explosion
- Gracefully fall back to descriptions when no docs available
- Add indentText helper for proper formatting

This enables LLM to prioritize repository documentation as the
authoritative source for routing decisions while still considering
historical precedents.
Update LLMRouter to:
- Add embedder and vector store dependencies
- Fetch repo definitions via semantic search before routing
- Support multiple documentation files per repository
- Populate Definition field in candidates for enhanced LLM prompt
- Include comprehensive error handling with graceful fallbacks

The router now:
1. Checks if repo collection exists
2. Generates embedding for the issue
3. Searches for relevant repo documentation (top 10, threshold 0.5)
4. Concatenates multiple docs per repo with separators
5. Passes definitions to LLM for hybrid routing decision

All operations are non-blocking with detailed logging.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds a new simili learn Cobra command to fetch repo files from GitHub, embed them with Gemini, and upsert into Qdrant; adds RepoCollection config, enriches repository candidates with stored definitions, and wires an embedder/vector store into the LLM router for pre-routing enrichment.

Changes

Cohort / File(s) Summary
Learn command
cmd/simili/commands/learn.go
New Cobra command to fetch a file from GitHub, validate path, generate Gemini embedding, build deterministic SHA-256 ID, and upsert a point into a Qdrant collection; supports flags (org, repo, file, token, dry-run) and dry-run synopsis.
Config
internal/core/config/config.go
Adds RepoCollection string to TransferConfig, defaulting to "simili_repos" in applyDefaults and merged from child configs when provided.
Gemini model types
internal/integrations/gemini/llm.go
Adds Definition string field to RepositoryCandidate to carry full repository documentation.
Gemini prompts/utilities
internal/integrations/gemini/prompts.go
Adds truncate and indentText helpers; refactors route prompt builder to include per-repo Documentation blocks, conditional weighting guidance, new output format (confidence + reasoning), and consolidated truncate implementation.
LLM router enrichment
internal/steps/llm_router.go
Adds embedder and qdrant vector store fields; conditionally initializes embedder/store when RepoCollection configured; builds issue embedding, searches repo collection, aggregates per-repo definitions (with truncation), and enriches candidates before LLM routing.
Cross-cutting
...
New wiring and non-blocking logging for collection existence checks, embedding/search errors, and enriched candidate population; mostly additive logic and new private helpers.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GitHub as GitHub API
    participant Embedder as Gemini Embedder
    participant Qdrant as Qdrant

    User->>Embedder: simili learn (org, repo, file)
    Embedder->>GitHub: fetch file content
    GitHub-->>Embedder: file content
    Embedder->>Embedder: generate embedding
    Embedder->>Qdrant: ensure collection & upsert point (id, embedding, payload)
    Qdrant-->>Embedder: upsert result
    Embedder-->>User: success / synopsis (dry-run)
Loading
sequenceDiagram
    participant Issue as User Issue
    participant Router as LLMRouter
    participant Embedder as Gemini Embedder
    participant Qdrant as Qdrant
    participant LLM as LLM Engine

    Issue->>Router: route request
    Router->>Embedder: generate issue embedding
    Embedder-->>Router: embedding vector
    Router->>Qdrant: search repo collection (embedding)
    Qdrant-->>Router: matching repo documents
    Router->>Router: aggregate/truncate per-repo definitions
    Router->>LLM: invoke with enriched candidates
    LLM-->>Router: routing decision
    Router-->>Issue: routed repo
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

🐰 I hopped to GitHub, nibbed the README bright,
Embeddings I whispered into Qdrant at night,
Definitions tucked snug in each candidate's nest,
Routers now ponder with documentation zest,
Hop — learned and indexed — now routing feels right! 📚✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: implementing a hybrid routing system that combines repository documentation with historical issue patterns through a new learning capability.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/hybrid-index-learn

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/steps/llm_router.go (1)

66-79: ⚠️ Potential issue | 🟠 Major

Allow repos without descriptions so repo-docs can drive routing.

The repo.Description == "" filter drops repos that only have documentation in the repo collection, which defeats the new doc-based routing path. Consider allowing enabled repos and relying on definitions (or descriptions) later.

🔧 Suggested fix
-		if !repo.Enabled || repoName == currentRepo || repo.Description == "" {
+		if !repo.Enabled || repoName == currentRepo {
 			continue
 		}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a hybrid routing system that enhances issue routing by combining repository documentation (learned via a new simili learn command) with historical issue patterns. The system addresses the "cold start problem" by enabling effective routing even when no historical issues exist.

Changes:

  • Added a new simili learn command to index repository documentation (README.md, CONTRIBUTING.md, etc.) into a separate Qdrant collection
  • Enhanced the LLM router to perform semantic search on repository documentation and include full definitions in routing prompts
  • Updated routing prompts with 60/40 weighting guidance (60% documentation, 40% historical patterns)
  • Added configuration support for the repository documentation collection with proper defaults and merge logic
  • Extended the RepositoryCandidate data structure to include full documentation definitions

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
cmd/simili/commands/learn.go New command implementation for indexing repository documentation files from GitHub into Qdrant
internal/steps/llm_router.go Enhanced router with semantic search to fetch and include repository definitions in routing decisions
internal/integrations/gemini/prompts.go Updated routing prompt with documentation support, weighting guidance, and new helper functions (truncate moved, indentText added)
internal/integrations/gemini/llm.go Added Definition field to RepositoryCandidate struct for storing full repository documentation
internal/core/config/config.go Added RepoCollection field to TransferConfig with default value and merge support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +130 to +141
point := &qdrant.Point{
ID: uuid.New().String(),
Vector: embedding,
Payload: map[string]interface{}{
"org": learnOrg,
"repo": learnRepo,
"file": learnFile,
"text": contentStr,
"indexed_at": time.Now().Format(time.RFC3339),
"type": "repo_doc",
},
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID generation for point IDs could lead to duplicate documents in the collection if the same file is learned multiple times. Each time "simili learn" is run for the same org/repo/file combination, it creates a new point with a new UUID, resulting in duplicate entries. Consider using a deterministic ID based on org/repo/file (e.g., hash or concatenation) to enable idempotent operations where re-learning the same file updates the existing entry rather than creating duplicates.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +127
// 6. Generate Embedding
log.Printf("Generating embedding...")
embedding, err := embedder.Embed(ctx, contentStr)
if err != nil {
log.Fatalf("Failed to generate embedding: %v", err)
}
log.Printf("Generated embedding with %d dimensions", len(embedding))
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The learn command doesn't implement rate limiting for the Gemini embedding API. If users run multiple learn commands in quick succession or try to index many files, they could hit API rate limits and experience failures. Consider adding retry logic with exponential backoff or at least documenting the rate limit considerations in the command's help text. The index command likely has similar considerations since it's shown importing the same packages.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +140
if len(definition) > 2000 {
definition = definition[:2000] + "\n... (truncated)"
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same UTF-8 truncation issue exists here. When truncating the definition to 2000 characters using definition[:2000], you may split a multi-byte UTF-8 character, resulting in an invalid UTF-8 string. This could cause issues when the truncated definition is used in the prompt. Consider using a rune-aware truncation approach to ensure UTF-8 safety.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +145
// First document for this repo
if file != "README.md" {
repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
} else {
repoDefinitions[repoKey] = text
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses hardcoded file name matching logic that could be fragile. The special case for "README.md" (line 141) means that if the first file encountered is README.md, it gets no header, but if a non-README is encountered first, then README.md would get a header when added later. This inconsistency could make the output unpredictable. Consider either always including file headers for consistency, or ordering the results to ensure README.md is always processed first if present.

Suggested change
// First document for this repo
if file != "README.md" {
repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
} else {
repoDefinitions[repoKey] = text
}
// First document for this repo; include file header for consistency
repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +104
qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
if err != nil {
log.Fatalf("Failed to initialize Qdrant client: %v", err)
}
defer qdrantClient.Close()
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for Qdrant client initialization. If qdrant.NewClient returns an error but the client is still non-nil (though this is unlikely), the defer statement could attempt to close a partially initialized client. While the current Qdrant client likely handles this correctly, it would be more defensive to check the error before setting up the defer.

Copilot uses AI. Check for mistakes.
Comment thread internal/integrations/gemini/prompts.go Outdated
Comment on lines +13 to +18
// truncate limits a string to a maximum length.
func truncate(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "..."
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truncate function may split multi-byte UTF-8 characters, resulting in invalid UTF-8 strings. When slicing a string in Go by byte index (s[:maxLen]), if maxLen falls in the middle of a multi-byte character, the resulting string will be corrupted. This could cause issues when the truncated text is used in prompts or stored in the database. Consider using a rune-aware truncation approach or the strings.Cut function to ensure you don't break UTF-8 characters.

Suggested change
// truncate limits a string to a maximum length.
func truncate(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "..."
// truncate limits a string to a maximum length in runes, preserving valid UTF-8.
func truncate(s string, maxLen int) string {
runes := []rune(s)
if len(runes) <= maxLen {
return s
}
return string(runes[:maxLen]) + "..."

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +170
// Author: Kaviru Hapuarachchi
// GitHub: https://github.com/kavirubc
// Created: 2026-02-05
// Last Modified: 2026-02-05

package commands

import (
"context"
"fmt"
"log"
"os"
"time"

"github.com/google/uuid"
similiConfig "github.com/similigh/simili-bot/internal/core/config"
"github.com/similigh/simili-bot/internal/integrations/gemini"
similiGithub "github.com/similigh/simili-bot/internal/integrations/github"
"github.com/similigh/simili-bot/internal/integrations/qdrant"
"github.com/spf13/cobra"
)

var (
learnOrg string
learnRepo string
learnFile string
learnToken string
learnDryRun bool
)

// learnCmd represents the learn command
var learnCmd = &cobra.Command{
Use: "learn",
Short: "Index repository documentation for hybrid routing",
Long: `Learn repository documentation by fetching files from GitHub (README.md,
CONTRIBUTING.md, etc.) and indexing them into a separate Qdrant collection.

This enables the bot to understand what each repository is responsible for,
improving routing decisions even with zero historical issues (cold start).

Examples:
simili learn --org my-org --repo backend --file README.md
simili learn --org my-org --repo backend --file CONTRIBUTING.md
simili learn --org my-org --repo backend --file docs/ARCHITECTURE.md --dry-run`,
Run: runLearn,
}

func init() {
rootCmd.AddCommand(learnCmd)

learnCmd.Flags().StringVar(&learnOrg, "org", "", "Organization name (required)")
learnCmd.Flags().StringVar(&learnRepo, "repo", "", "Repository name (required)")
learnCmd.Flags().StringVar(&learnFile, "file", "README.md", "File path to learn (default: README.md)")
learnCmd.Flags().StringVar(&learnToken, "token", "", "GitHub token (optional, defaults to GITHUB_TOKEN env var)")
learnCmd.Flags().BoolVar(&learnDryRun, "dry-run", false, "Simulate without writing to database")

if err := learnCmd.MarkFlagRequired("org"); err != nil {
log.Fatalf("Failed to mark org flag as required: %v", err)
}
if err := learnCmd.MarkFlagRequired("repo"); err != nil {
log.Fatalf("Failed to mark repo flag as required: %v", err)
}
}

func runLearn(cmd *cobra.Command, args []string) {
ctx := context.Background()

// 1. Load Config
cfgPath := similiConfig.FindConfigPath(cfgFile)
if cfgPath == "" {
log.Fatalf("Config file not found. Please verify your setup.")
}
cfg, err := similiConfig.Load(cfgPath)
if err != nil {
log.Fatalf("Failed to load config: %v", err)
}

// 2. Initialize GitHub Client
token := learnToken
if token == "" {
token = os.Getenv("GITHUB_TOKEN")
}
if token == "" {
log.Fatalf("GitHub token is required (use --token or GITHUB_TOKEN env var)")
}

ghClient := similiGithub.NewClient(ctx, token)

// 3. Initialize Embedder
embedder, err := gemini.NewEmbedder(cfg.Embedding.APIKey, cfg.Embedding.Model)
if err != nil {
log.Fatalf("Failed to initialize Gemini embedder: %v", err)
}
defer embedder.Close()

// 4. Initialize Qdrant Client (unless dry-run)
var qdrantClient *qdrant.Client
if !learnDryRun {
qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
if err != nil {
log.Fatalf("Failed to initialize Qdrant client: %v", err)
}
defer qdrantClient.Close()
}

// 5. Fetch File from GitHub
log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo)
content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "")
if err != nil {
log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err)
}

if len(content) == 0 {
log.Printf("Warning: File %s is empty, skipping", learnFile)
return
}

contentStr := string(content)
log.Printf("Fetched %d characters from %s", len(contentStr), learnFile)

// 6. Generate Embedding
log.Printf("Generating embedding...")
embedding, err := embedder.Embed(ctx, contentStr)
if err != nil {
log.Fatalf("Failed to generate embedding: %v", err)
}
log.Printf("Generated embedding with %d dimensions", len(embedding))

// 7. Create Point with Rich Payload
point := &qdrant.Point{
ID: uuid.New().String(),
Vector: embedding,
Payload: map[string]interface{}{
"org": learnOrg,
"repo": learnRepo,
"file": learnFile,
"text": contentStr,
"indexed_at": time.Now().Format(time.RFC3339),
"type": "repo_doc",
},
}

// 8. Dry Run Check
if learnDryRun {
fmt.Printf("🔍 [DRY RUN] Would index %s/%s/%s\n", learnOrg, learnRepo, learnFile)
fmt.Printf("📊 Collection: %s\n", cfg.Transfer.RepoCollection)
fmt.Printf("📝 Content: %d characters\n", len(contentStr))
fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding))
return
}

// 9. Ensure Collection Exists
repoCollection := cfg.Transfer.RepoCollection
log.Printf("Ensuring collection '%s' exists...", repoCollection)
if err := qdrantClient.CreateCollection(ctx, repoCollection, cfg.Embedding.Dimensions); err != nil {
log.Fatalf("Failed to create/verify collection: %v", err)
}

// 10. Upsert to Qdrant
log.Printf("Indexing to collection '%s'...", repoCollection)
if err := qdrantClient.Upsert(ctx, repoCollection, []*qdrant.Point{point}); err != nil {
log.Fatalf("Failed to index document: %v", err)
}

// Success Output
fmt.Printf("✅ Successfully indexed %s/%s/%s\n", learnOrg, learnRepo, learnFile)
fmt.Printf("📊 Collection: %s\n", repoCollection)
fmt.Printf("📝 Content: %d characters\n", len(contentStr))
fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding))
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lacks test coverage for the new hybrid routing functionality. While there are existing tests for prompts (gemini_test.go), there are no tests for: (1) the new learn command, (2) the enhanced LLM router with semantic search and multi-file support, (3) the new helper functions (indentText), or (4) the updated routing prompt with weighting guidance. Adding tests would help ensure the hybrid routing feature works correctly and prevent regressions.

Copilot uses AI. Check for mistakes.
Comment thread internal/steps/llm_router.go Outdated
Comment on lines +135 to +146
// If multiple docs for same repo, concatenate with separators
if existing, ok := repoDefinitions[repoKey]; ok {
// Add file header for clarity
repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text
} else {
// First document for this repo
if file != "README.md" {
repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
} else {
repoDefinitions[repoKey] = text
}
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential unbounded memory growth when concatenating multiple documentation files for the same repository. If a repository has many documentation files that match the semantic search (e.g., 10 files with 2000 characters each), the concatenated definition could grow to 20,000+ characters. While individual files are truncated to 2000 characters, there's no limit on the total concatenated size. Consider adding a maximum total size limit for concatenated definitions to prevent token explosion and memory issues.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +161
// Fetch repository definitions from repo collection (if available)
repoDefinitions := make(map[string]string) // map[org/repo]full_definition

if s.embedder != nil && s.store != nil && ctx.Config.Transfer.RepoCollection != "" {
// Check if repo collection exists
exists, err := s.store.CollectionExists(ctx.Ctx, ctx.Config.Transfer.RepoCollection)
if err != nil {
log.Printf("[llm_router] Error checking repo collection: %v (non-blocking)", err)
} else if !exists {
log.Printf("[llm_router] Repo collection '%s' not found, using config descriptions",
ctx.Config.Transfer.RepoCollection)
} else {
// Generate embedding for issue to find relevant repos
issueContent := fmt.Sprintf("%s\n\n%s", ctx.Issue.Title, ctx.Issue.Body)
issueEmbedding, err := s.embedder.Embed(ctx.Ctx, issueContent)

if err != nil {
log.Printf("[llm_router] Error generating issue embedding: %v (non-blocking)", err)
} else {
// Search for relevant repository definitions
// Use broader threshold (0.5) and higher limit (10) to get context
results, err := s.store.Search(
ctx.Ctx,
ctx.Config.Transfer.RepoCollection,
issueEmbedding,
10, // Get top 10 relevant repo docs (more than we'll route to)
0.5, // Lower threshold than issues (0.65-0.7) for broader context
)

if err != nil {
log.Printf("[llm_router] Error searching repo collection: %v (non-blocking)", err)
} else if len(results) == 0 {
log.Printf("[llm_router] No repository definitions found")
} else {
// Collect definitions per repo (handle multiple files per repo)
for _, res := range results {
org, _ := res.Payload["org"].(string)
repo, _ := res.Payload["repo"].(string)
text, _ := res.Payload["text"].(string)
file, _ := res.Payload["file"].(string)

if org == "" || repo == "" || text == "" {
log.Printf("[llm_router] Invalid repo definition payload, skipping")
continue
}

repoKey := fmt.Sprintf("%s/%s", org, repo)

// If multiple docs for same repo, concatenate with separators
if existing, ok := repoDefinitions[repoKey]; ok {
// Add file header for clarity
repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text
} else {
// First document for this repo
if file != "README.md" {
repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
} else {
repoDefinitions[repoKey] = text
}
}
}
log.Printf("[llm_router] Loaded %d repository definitions from %d documents",
len(repoDefinitions), len(results))
}
}
}
}

// Enhance candidates with full definitions
for i := range candidates {
repoKey := fmt.Sprintf("%s/%s", candidates[i].Org, candidates[i].Repo)
if def, ok := repoDefinitions[repoKey]; ok {
candidates[i].Definition = def
}
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic search for repository definitions only searches in the repo collection, which means repositories that haven't been indexed via the learn command won't be found. However, all enabled repositories from the config are included in the candidates list (lines 70-80), so unindexed repositories will still be considered for routing but will only have their short config descriptions available. This is by design for graceful degradation, but it might be worth documenting this behavior more clearly in the code comments to help future maintainers understand the fallback logic.

Copilot uses AI. Check for mistakes.
Comment thread cmd/simili/commands/learn.go Outdated
Comment on lines +106 to +111
// 5. Fetch File from GitHub
log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo)
content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "")
if err != nil {
log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err)
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doesn't validate whether the specified file path is within the repository bounds, potentially allowing path traversal attempts. While the GitHub API itself likely validates this, adding explicit validation that the file path doesn't contain patterns like "../" would make the security boundary clearer and prevent potential issues if the GitHub client implementation changes. Consider adding path validation before calling GetFileContent.

Copilot uses AI. Check for mistakes.
Fixes based on Copilot and CodeRabbit review:
- Use deterministic IDs (SHA-256 hash of org/repo/file) for idempotent learn operations
- Add path validation to prevent path traversal attempts
- Fix UTF-8 safe truncation using runes instead of bytes
- Always include file headers for consistency in multi-file concatenation
- Add max total size limit (5000 chars) per repo to prevent unbounded memory growth
- Improve logging for truncation events

These changes ensure:
- Re-learning the same file updates existing entry instead of creating duplicates
- No UTF-8 character corruption in truncated strings
- Consistent formatting for all documentation files
- Protection against token explosion from large concatenated definitions
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@internal/integrations/gemini/prompts.go`:
- Around line 138-141: The truncation of r.Definition uses byte-length and
slicing which can break UTF-8; replace the manual byte-based check and slice
(len(definition) and definition[:2000]) with the existing rune-safe helper
truncate(), e.g. call truncate(definition, 2000) (or adjust truncate to accept a
custom suffix if you want "... (truncated)" instead of its default) so that
Definition is truncated safely without corrupting multibyte characters.

In `@internal/steps/llm_router.go`:
- Around line 141-158: The code currently truncates strings with byte slicing
(combined[:maxTotalSizePerRepo] and newDoc[:maxTotalSizePerRepo]) which can
split multi-byte UTF-8 runes; replace those slices with a rune-safe truncation
helper (e.g., truncateToByteLimitSafely) that iterates runes and accumulates
byte length until adding the next rune would exceed maxTotalSizePerRepo, then
returns the substring up to the last whole rune plus the "\n... (truncated)"
suffix; call this helper for both the combined and newDoc truncation branches
(referencing repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to
locate the changes).
🧹 Nitpick comments (2)
cmd/simili/commands/learn.go (2)

108-113: Path validation may be insufficient on some edge cases.

The current validation uses filepath.Clean followed by string checks. However, filepath.Clean may normalize paths differently across platforms, and some edge cases like foo/../bar would be cleaned to bar (passing the check) which is fine, but paths like ..foo (literal filename starting with dots) would be incorrectly rejected.

Consider using filepath.IsAbs and checking if the cleaned path escapes the intended root:

♻️ More robust path validation
 	// 5. Validate and fetch file from GitHub
 	// Clean path and validate (prevent path traversal)
 	cleanPath := filepath.Clean(learnFile)
-	if strings.Contains(cleanPath, "..") || strings.HasPrefix(cleanPath, "/") {
+	if filepath.IsAbs(cleanPath) || strings.HasPrefix(cleanPath, "..") {
 		log.Fatalf("Invalid file path: %s (path traversal not allowed)", learnFile)
 	}

This allows legitimate filenames containing .. (e.g., docs/foo..bar.md) while still blocking traversal attempts.


121-135: Consider adding a size limit for large files.

While the system handles empty files, there's no upper bound on file size. Very large files could cause memory issues during embedding generation or exceed API limits.

💡 Optional: Add maximum file size check
 	if len(content) == 0 {
 		log.Printf("Warning: File %s is empty, skipping", learnFile)
 		return
 	}
+
+	const maxFileSize = 100 * 1024 // 100KB limit
+	if len(content) > maxFileSize {
+		log.Printf("Warning: File %s exceeds size limit (%d > %d bytes), truncating", 
+			learnFile, len(content), maxFileSize)
+		content = content[:maxFileSize]
+	}

 	contentStr := string(content)

Comment on lines +138 to +141
definition := r.Definition
if len(definition) > 2000 {
definition = definition[:2000] + "\n... (truncated)"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use rune-based truncation for UTF-8 safety.

The truncation here uses byte-length (len(definition)) and byte slicing (definition[:2000]), which can corrupt multi-byte UTF-8 characters. The truncate() helper defined at lines 13-20 already handles this correctly.

🛠️ Proposed fix
-			definition := r.Definition
-			if len(definition) > 2000 {
-				definition = definition[:2000] + "\n... (truncated)"
-			}
+			definition := truncate(r.Definition, 2000)
+			if definition != r.Definition {
+				definition += "\n(truncated)"
+			}

Note: The truncate() function already appends "...", so you may want to adjust the suffix or modify truncate() to accept a custom suffix parameter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
definition := r.Definition
if len(definition) > 2000 {
definition = definition[:2000] + "\n... (truncated)"
}
definition := truncate(r.Definition, 2000)
if definition != r.Definition {
definition += "\n(truncated)"
}
🤖 Prompt for AI Agents
In `@internal/integrations/gemini/prompts.go` around lines 138 - 141, The
truncation of r.Definition uses byte-length and slicing which can break UTF-8;
replace the manual byte-based check and slice (len(definition) and
definition[:2000]) with the existing rune-safe helper truncate(), e.g. call
truncate(definition, 2000) (or adjust truncate to accept a custom suffix if you
want "... (truncated)" instead of its default) so that Definition is truncated
safely without corrupting multibyte characters.

Comment on lines +141 to +158
// If multiple docs for same repo, concatenate with size limit
if existing, ok := repoDefinitions[repoKey]; ok {
combined := existing + "\n\n" + newDoc
// Enforce max total size per repo
if len(combined) > maxTotalSizePerRepo {
log.Printf("[llm_router] Truncating %s definition (size: %d > %d)",
repoKey, len(combined), maxTotalSizePerRepo)
combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)"
}
repoDefinitions[repoKey] = combined
} else {
// First document for this repo
if len(newDoc) > maxTotalSizePerRepo {
newDoc = newDoc[:maxTotalSizePerRepo] + "\n... (truncated)"
}
repoDefinitions[repoKey] = newDoc
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Byte-based truncation may corrupt UTF-8 characters.

The truncation at lines 145-148 and 153-154 uses byte slicing (combined[:maxTotalSizePerRepo]), which can split multi-byte UTF-8 characters. This is the same issue as in prompts.go.

🛠️ Proposed fix using rune-safe truncation
+// truncateRunes safely truncates a string to maxLen runes.
+func truncateRunes(s string, maxLen int) string {
+	runes := []rune(s)
+	if len(runes) <= maxLen {
+		return s
+	}
+	return string(runes[:maxLen])
+}

 // Inside the enrichment loop:
 						// Enforce max total size per repo
-						if len(combined) > maxTotalSizePerRepo {
+						if len([]rune(combined)) > maxTotalSizePerRepo {
 							log.Printf("[llm_router] Truncating %s definition (size: %d > %d)",
 								repoKey, len(combined), maxTotalSizePerRepo)
-							combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)"
+							combined = truncateRunes(combined, maxTotalSizePerRepo) + "\n... (truncated)"
 						}
🤖 Prompt for AI Agents
In `@internal/steps/llm_router.go` around lines 141 - 158, The code currently
truncates strings with byte slicing (combined[:maxTotalSizePerRepo] and
newDoc[:maxTotalSizePerRepo]) which can split multi-byte UTF-8 runes; replace
those slices with a rune-safe truncation helper (e.g.,
truncateToByteLimitSafely) that iterates runes and accumulates byte length until
adding the next rune would exceed maxTotalSizePerRepo, then returns the
substring up to the last whole rune plus the "\n... (truncated)" suffix; call
this helper for both the combined and newDoc truncation branches (referencing
repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to locate the
changes).

@Kavirubc Kavirubc merged commit 7b1ab83 into main Feb 5, 2026
4 checks passed
@Kavirubc Kavirubc deleted the feature/hybrid-index-learn branch February 5, 2026 05:51
@Kavirubc Kavirubc linked an issue Feb 5, 2026 that may be closed by this pull request
Kavirubc added a commit to similigh/docs that referenced this pull request Feb 5, 2026
Add reference documentation for the new 'simili learn' command used to index
repository documentation for hybrid routing.

Update TransferConfig schema to document new fields:
- high_confidence: threshold for auto-transferring issues
- medium_confidence: threshold for suggesting transfers
- repo_collection: Qdrant collection for repository documentation

Related PRs: similigh/simili-bot#32 (hybrid routing implementation)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Hybrid Routing with Repository Knowledge Base

2 participants