feat: implement hybrid routing with repository documentation learning by Kavirubc · Pull Request #32 · similigh/simili-bot

Kavirubc · 2026-02-05T05:20:11Z

Overview

Implements hybrid routing system that combines repository documentation (~60% weight) with historical issue patterns (~40% weight) to solve the cold start problem and enable effective routing from Day 1.

Changes

1. Configuration Foundation

Added RepoCollection field to TransferConfig (default: "simili_repos")
Proper config inheritance and merge support

2. Data Structure Enhancement

Added Definition field to RepositoryCandidate for full repo documentation

3. Learn Command Implementation

New simili learn command to index repository documentation
Fetches files from GitHub API (README.md, CONTRIBUTING.md, etc.)
Generates embeddings and stores in separate Qdrant collection
Supports dry-run mode and comprehensive error handling

4. Enhanced System Prompt

Updated routing prompt with 60/40 weighting guidance
Includes full repository documentation when available
Truncation to prevent token explosion (max 2000 chars per repo)
Graceful fallback to config descriptions

5. LLM Router Enhancement

Semantic search to fetch relevant repo definitions
Multi-file support with automatic concatenation
Comprehensive error handling with graceful degradation
Non-blocking operations with detailed logging

Key Features

✅ Cold Start Support: Routes effectively even with zero historical issues
✅ Multi-File Support: Concatenates multiple docs per repo (README + CONTRIBUTING + docs/*)
✅ Semantic Search: Uses issue embedding to find relevant repos (top 10, threshold 0.5)
✅ Backward Compatible: All changes are additive, existing setups work without modifications
✅ Graceful Degradation: Every error point has fallback path

Usage

# Learn repository documentation
simili learn --org my-org --repo backend --file README.md
simili learn --org my-org --repo backend --file CONTRIBUTING.md

# Dry-run mode
simili learn --org my-org --repo backend --file README.md --dry-run

# Process issues with hybrid routing
simili process --org my-org --repo backend --issue 123

Configuration

transfer:
  llm_routing_enabled: true
  repo_collection: "simili_repos"  # New: Separate collection for repo docs
  medium_confidence: 0.6

Testing

✅ All existing tests pass
✅ Tested with event-integrator test repos
✅ Semantic search correctly identifies repositories
✅ Multi-file support verified
✅ Error handling tested

Commits

b336d29 - Configuration field for repo collection
a97d30a - Definition field in RepositoryCandidate
30db714 - Complete learn command implementation
be1c0a6 - Enhanced routing prompt with 60/40 weighting
207b0cf - LLM router with semantic search integration

Summary by CodeRabbit

New Features
- Added a "learn" command to fetch and index repository documentation into the system.
- Enhanced routing to consult indexed repository documentation, improving recommendation accuracy and adding confidence scores and brief reasoning in outputs.
Configuration
- Introduced a configurable repository-collection setting to enable/disable and target where docs are stored.

Add RepoCollection field to TransferConfig to support separate Qdrant collection for repository documentation. Defaults to "simili_repos" and includes proper merge logic for config inheritance.

Add Definition field to store full repository documentation for enhanced LLM routing decisions in hybrid routing system.

Add new 'learn' command that fetches repository documentation (README.md, CONTRIBUTING.md, etc.) from GitHub and indexes it into a separate Qdrant collection for hybrid routing. Supports: - Fetching any file from GitHub API - Embedding with Gemini - Indexing to configurable repo collection - Dry-run mode for testing - Rich payload with org, repo, file, text, timestamp This enables routing from Day 1 without historical issues.

Update buildRouteIssuePrompt to: - Include full repository documentation when available - Add qualitative 60/40 weighting guidance (docs vs history) - Truncate long documentation to prevent token explosion - Gracefully fall back to descriptions when no docs available - Add indentText helper for proper formatting This enables LLM to prioritize repository documentation as the authoritative source for routing decisions while still considering historical precedents.

Update LLMRouter to: - Add embedder and vector store dependencies - Fetch repo definitions via semantic search before routing - Support multiple documentation files per repository - Populate Definition field in candidates for enhanced LLM prompt - Include comprehensive error handling with graceful fallbacks The router now: 1. Checks if repo collection exists 2. Generates embedding for the issue 3. Searches for relevant repo documentation (top 10, threshold 0.5) 4. Concatenates multiple docs per repo with separators 5. Passes definitions to LLM for hybrid routing decision All operations are non-blocking with detailed logging.

coderabbitai · 2026-02-05T05:20:43Z

📝 Walkthrough

Walkthrough

Adds a new simili learn Cobra command to fetch repo files from GitHub, embed them with Gemini, and upsert into Qdrant; adds RepoCollection config, enriches repository candidates with stored definitions, and wires an embedder/vector store into the LLM router for pre-routing enrichment.

Changes

Cohort / File(s)	Summary
Learn command `cmd/simili/commands/learn.go`	New Cobra command to fetch a file from GitHub, validate path, generate Gemini embedding, build deterministic SHA-256 ID, and upsert a point into a Qdrant collection; supports flags (org, repo, file, token, dry-run) and dry-run synopsis.
Config `internal/core/config/config.go`	Adds `RepoCollection string` to `TransferConfig`, defaulting to `"simili_repos"` in applyDefaults and merged from child configs when provided.
Gemini model types `internal/integrations/gemini/llm.go`	Adds `Definition string` field to `RepositoryCandidate` to carry full repository documentation.
Gemini prompts/utilities `internal/integrations/gemini/prompts.go`	Adds `truncate` and `indentText` helpers; refactors route prompt builder to include per-repo Documentation blocks, conditional weighting guidance, new output format (confidence + reasoning), and consolidated truncate implementation.
LLM router enrichment `internal/steps/llm_router.go`	Adds embedder and qdrant vector store fields; conditionally initializes embedder/store when `RepoCollection` configured; builds issue embedding, searches repo collection, aggregates per-repo definitions (with truncation), and enriches candidates before LLM routing.
Cross-cutting `...`	New wiring and non-blocking logging for collection existence checks, embedding/search errors, and enriched candidate population; mostly additive logic and new private helpers.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GitHub as GitHub API
    participant Embedder as Gemini Embedder
    participant Qdrant as Qdrant

    User->>Embedder: simili learn (org, repo, file)
    Embedder->>GitHub: fetch file content
    GitHub-->>Embedder: file content
    Embedder->>Embedder: generate embedding
    Embedder->>Qdrant: ensure collection & upsert point (id, embedding, payload)
    Qdrant-->>Embedder: upsert result
    Embedder-->>User: success / synopsis (dry-run)

sequenceDiagram
    participant Issue as User Issue
    participant Router as LLMRouter
    participant Embedder as Gemini Embedder
    participant Qdrant as Qdrant
    participant LLM as LLM Engine

    Issue->>Router: route request
    Router->>Embedder: generate issue embedding
    Embedder-->>Router: embedding vector
    Router->>Qdrant: search repo collection (embedding)
    Qdrant-->>Router: matching repo documents
    Router->>Router: aggregate/truncate per-repo definitions
    Router->>LLM: invoke with enriched candidates
    LLM-->>Router: routing decision
    Router-->>Issue: routed repo

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Feature: Hybrid Routing with Repository Knowledge Base #30: Matches implementation of the simili learn command and the repo-definition enrichment wiring into the LLM router described in the issue.

Poem

🐰 I hopped to GitHub, nibbed the README bright,
Embeddings I whispered into Qdrant at night,
Definitions tucked snug in each candidate's nest,
Routers now ponder with documentation zest,
Hop — learned and indexed — now routing feels right! 📚✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: implementing a hybrid routing system that combines repository documentation with historical issue patterns through a new learning capability.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/hybrid-index-learn

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/steps/llm_router.go (1)
66-79: ⚠️ Potential issue | 🟠 Major

Allow repos without descriptions so repo-docs can drive routing.

The repo.Description == "" filter drops repos that only have documentation in the repo collection, which defeats the new doc-based routing path. Consider allowing enabled repos and relying on definitions (or descriptions) later.
🔧 Suggested fix
-		if !repo.Enabled || repoName == currentRepo || repo.Description == "" {
+		if !repo.Enabled || repoName == currentRepo {
 			continue
 		}

Copilot

Pull request overview

This PR implements a hybrid routing system that enhances issue routing by combining repository documentation (learned via a new simili learn command) with historical issue patterns. The system addresses the "cold start problem" by enabling effective routing even when no historical issues exist.

Changes:

Added a new simili learn command to index repository documentation (README.md, CONTRIBUTING.md, etc.) into a separate Qdrant collection
Enhanced the LLM router to perform semantic search on repository documentation and include full definitions in routing prompts
Updated routing prompts with 60/40 weighting guidance (60% documentation, 40% historical patterns)
Added configuration support for the repository documentation collection with proper defaults and merge logic
Extended the RepositoryCandidate data structure to include full documentation definitions

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
cmd/simili/commands/learn.go	New command implementation for indexing repository documentation files from GitHub into Qdrant
internal/steps/llm_router.go	Enhanced router with semantic search to fetch and include repository definitions in routing decisions
internal/integrations/gemini/prompts.go	Updated routing prompt with documentation support, weighting guidance, and new helper functions (truncate moved, indentText added)
internal/integrations/gemini/llm.go	Added `Definition` field to `RepositoryCandidate` struct for storing full repository documentation
internal/core/config/config.go	Added `RepoCollection` field to `TransferConfig` with default value and merge support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T05:26:12Z

+	point := &qdrant.Point{
+		ID:     uuid.New().String(),
+		Vector: embedding,
+		Payload: map[string]interface{}{
+			"org":        learnOrg,
+			"repo":       learnRepo,
+			"file":       learnFile,
+			"text":       contentStr,
+			"indexed_at": time.Now().Format(time.RFC3339),
+			"type":       "repo_doc",
+		},
+	}


The UUID generation for point IDs could lead to duplicate documents in the collection if the same file is learned multiple times. Each time "simili learn" is run for the same org/repo/file combination, it creates a new point with a new UUID, resulting in duplicate entries. Consider using a deterministic ID based on org/repo/file (e.g., hash or concatenation) to enable idempotent operations where re-learning the same file updates the existing entry rather than creating duplicates.

Copilot · 2026-02-05T05:26:12Z

+	// 6. Generate Embedding
+	log.Printf("Generating embedding...")
+	embedding, err := embedder.Embed(ctx, contentStr)
+	if err != nil {
+		log.Fatalf("Failed to generate embedding: %v", err)
+	}
+	log.Printf("Generated embedding with %d dimensions", len(embedding))


The learn command doesn't implement rate limiting for the Gemini embedding API. If users run multiple learn commands in quick succession or try to index many files, they could hit API rate limits and experience failures. Consider adding retry logic with exponential backoff or at least documenting the rate limit considerations in the command's help text. The index command likely has similar considerations since it's shown importing the same packages.

Copilot · 2026-02-05T05:26:13Z

+			if len(definition) > 2000 {
+				definition = definition[:2000] + "\n... (truncated)"
+			}


The same UTF-8 truncation issue exists here. When truncating the definition to 2000 characters using definition[:2000], you may split a multi-byte UTF-8 character, resulting in an invalid UTF-8 string. This could cause issues when the truncated definition is used in the prompt. Consider using a rune-aware truncation approach to ensure UTF-8 safety.

Copilot · 2026-02-05T05:26:13Z

+							// First document for this repo
+							if file != "README.md" {
+								repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
+							} else {
+								repoDefinitions[repoKey] = text
+							}


The code uses hardcoded file name matching logic that could be fragile. The special case for "README.md" (line 141) means that if the first file encountered is README.md, it gets no header, but if a non-README is encountered first, then README.md would get a header when added later. This inconsistency could make the output unpredictable. Consider either always including file headers for consistency, or ordering the results to ensure README.md is always processed first if present.

Suggested change

// First document for this repo

if file != "README.md" {

repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text

} else {

repoDefinitions[repoKey] = text

}

// First document for this repo; include file header for consistency

repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text

Copilot · 2026-02-05T05:26:13Z

+		qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
+		if err != nil {
+			log.Fatalf("Failed to initialize Qdrant client: %v", err)
+		}
+		defer qdrantClient.Close()
+	}


Missing error handling for Qdrant client initialization. If qdrant.NewClient returns an error but the client is still non-nil (though this is unlikely), the defer statement could attempt to close a partially initialized client. While the current Qdrant client likely handles this correctly, it would be more defensive to check the error before setting up the defer.

Copilot · 2026-02-05T05:26:14Z

+// truncate limits a string to a maximum length.
+func truncate(s string, maxLen int) string {
+	if len(s) <= maxLen {
+		return s
+	}
+	return s[:maxLen] + "..."


The truncate function may split multi-byte UTF-8 characters, resulting in invalid UTF-8 strings. When slicing a string in Go by byte index (s[:maxLen]), if maxLen falls in the middle of a multi-byte character, the resulting string will be corrupted. This could cause issues when the truncated text is used in prompts or stored in the database. Consider using a rune-aware truncation approach or the strings.Cut function to ensure you don't break UTF-8 characters.

Suggested change

// truncate limits a string to a maximum length.

func truncate(s string, maxLen int) string {

if len(s) <= maxLen {

return s

}

return s[:maxLen] + "..."

// truncate limits a string to a maximum length in runes, preserving valid UTF-8.

func truncate(s string, maxLen int) string {

runes := []rune(s)

if len(runes) <= maxLen {

return s

}

return string(runes[:maxLen]) + "..."

Copilot · 2026-02-05T05:26:15Z

+// Author: Kaviru Hapuarachchi
+// GitHub: https://github.com/kavirubc
+// Created: 2026-02-05
+// Last Modified: 2026-02-05
+
+package commands
+
+import (
+	"context"
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	"github.com/google/uuid"
+	similiConfig "github.com/similigh/simili-bot/internal/core/config"
+	"github.com/similigh/simili-bot/internal/integrations/gemini"
+	similiGithub "github.com/similigh/simili-bot/internal/integrations/github"
+	"github.com/similigh/simili-bot/internal/integrations/qdrant"
+	"github.com/spf13/cobra"
+)
+
+var (
+	learnOrg    string
+	learnRepo   string
+	learnFile   string
+	learnToken  string
+	learnDryRun bool
+)
+
+// learnCmd represents the learn command
+var learnCmd = &cobra.Command{
+	Use:   "learn",
+	Short: "Index repository documentation for hybrid routing",
+	Long: `Learn repository documentation by fetching files from GitHub (README.md,
+CONTRIBUTING.md, etc.) and indexing them into a separate Qdrant collection.
+
+This enables the bot to understand what each repository is responsible for,
+improving routing decisions even with zero historical issues (cold start).
+
+Examples:
+  simili learn --org my-org --repo backend --file README.md
+  simili learn --org my-org --repo backend --file CONTRIBUTING.md
+  simili learn --org my-org --repo backend --file docs/ARCHITECTURE.md --dry-run`,
+	Run: runLearn,
+}
+
+func init() {
+	rootCmd.AddCommand(learnCmd)
+
+	learnCmd.Flags().StringVar(&learnOrg, "org", "", "Organization name (required)")
+	learnCmd.Flags().StringVar(&learnRepo, "repo", "", "Repository name (required)")
+	learnCmd.Flags().StringVar(&learnFile, "file", "README.md", "File path to learn (default: README.md)")
+	learnCmd.Flags().StringVar(&learnToken, "token", "", "GitHub token (optional, defaults to GITHUB_TOKEN env var)")
+	learnCmd.Flags().BoolVar(&learnDryRun, "dry-run", false, "Simulate without writing to database")
+
+	if err := learnCmd.MarkFlagRequired("org"); err != nil {
+		log.Fatalf("Failed to mark org flag as required: %v", err)
+	}
+	if err := learnCmd.MarkFlagRequired("repo"); err != nil {
+		log.Fatalf("Failed to mark repo flag as required: %v", err)
+	}
+}
+
+func runLearn(cmd *cobra.Command, args []string) {
+	ctx := context.Background()
+
+	// 1. Load Config
+	cfgPath := similiConfig.FindConfigPath(cfgFile)
+	if cfgPath == "" {
+		log.Fatalf("Config file not found. Please verify your setup.")
+	}
+	cfg, err := similiConfig.Load(cfgPath)
+	if err != nil {
+		log.Fatalf("Failed to load config: %v", err)
+	}
+
+	// 2. Initialize GitHub Client
+	token := learnToken
+	if token == "" {
+		token = os.Getenv("GITHUB_TOKEN")
+	}
+	if token == "" {
+		log.Fatalf("GitHub token is required (use --token or GITHUB_TOKEN env var)")
+	}
+
+	ghClient := similiGithub.NewClient(ctx, token)
+
+	// 3. Initialize Embedder
+	embedder, err := gemini.NewEmbedder(cfg.Embedding.APIKey, cfg.Embedding.Model)
+	if err != nil {
+		log.Fatalf("Failed to initialize Gemini embedder: %v", err)
+	}
+	defer embedder.Close()
+
+	// 4. Initialize Qdrant Client (unless dry-run)
+	var qdrantClient *qdrant.Client
+	if !learnDryRun {
+		qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
+		if err != nil {
+			log.Fatalf("Failed to initialize Qdrant client: %v", err)
+		}
+		defer qdrantClient.Close()
+	}
+
+	// 5. Fetch File from GitHub
+	log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo)
+	content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "")
+	if err != nil {
+		log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err)
+	}
+
+	if len(content) == 0 {
+		log.Printf("Warning: File %s is empty, skipping", learnFile)
+		return
+	}
+
+	contentStr := string(content)
+	log.Printf("Fetched %d characters from %s", len(contentStr), learnFile)
+
+	// 6. Generate Embedding
+	log.Printf("Generating embedding...")
+	embedding, err := embedder.Embed(ctx, contentStr)
+	if err != nil {
+		log.Fatalf("Failed to generate embedding: %v", err)
+	}
+	log.Printf("Generated embedding with %d dimensions", len(embedding))
+
+	// 7. Create Point with Rich Payload
+	point := &qdrant.Point{
+		ID:     uuid.New().String(),
+		Vector: embedding,
+		Payload: map[string]interface{}{
+			"org":        learnOrg,
+			"repo":       learnRepo,
+			"file":       learnFile,
+			"text":       contentStr,
+			"indexed_at": time.Now().Format(time.RFC3339),
+			"type":       "repo_doc",
+		},
+	}
+
+	// 8. Dry Run Check
+	if learnDryRun {
+		fmt.Printf("🔍 [DRY RUN] Would index %s/%s/%s\n", learnOrg, learnRepo, learnFile)
+		fmt.Printf("📊 Collection: %s\n", cfg.Transfer.RepoCollection)
+		fmt.Printf("📝 Content: %d characters\n", len(contentStr))
+		fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding))
+		return
+	}
+
+	// 9. Ensure Collection Exists
+	repoCollection := cfg.Transfer.RepoCollection
+	log.Printf("Ensuring collection '%s' exists...", repoCollection)
+	if err := qdrantClient.CreateCollection(ctx, repoCollection, cfg.Embedding.Dimensions); err != nil {
+		log.Fatalf("Failed to create/verify collection: %v", err)
+	}
+
+	// 10. Upsert to Qdrant
+	log.Printf("Indexing to collection '%s'...", repoCollection)
+	if err := qdrantClient.Upsert(ctx, repoCollection, []*qdrant.Point{point}); err != nil {
+		log.Fatalf("Failed to index document: %v", err)
+	}
+
+	// Success Output
+	fmt.Printf("✅ Successfully indexed %s/%s/%s\n", learnOrg, learnRepo, learnFile)
+	fmt.Printf("📊 Collection: %s\n", repoCollection)
+	fmt.Printf("📝 Content: %d characters\n", len(contentStr))
+	fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding))
+}


The code lacks test coverage for the new hybrid routing functionality. While there are existing tests for prompts (gemini_test.go), there are no tests for: (1) the new learn command, (2) the enhanced LLM router with semantic search and multi-file support, (3) the new helper functions (indentText), or (4) the updated routing prompt with weighting guidance. Adding tests would help ensure the hybrid routing feature works correctly and prevent regressions.

Copilot · 2026-02-05T05:26:15Z

+						// If multiple docs for same repo, concatenate with separators
+						if existing, ok := repoDefinitions[repoKey]; ok {
+							// Add file header for clarity
+							repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text
+						} else {
+							// First document for this repo
+							if file != "README.md" {
+								repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
+							} else {
+								repoDefinitions[repoKey] = text
+							}
+						}


Potential unbounded memory growth when concatenating multiple documentation files for the same repository. If a repository has many documentation files that match the semantic search (e.g., 10 files with 2000 characters each), the concatenated definition could grow to 20,000+ characters. While individual files are truncated to 2000 characters, there's no limit on the total concatenated size. Consider adding a maximum total size limit for concatenated definitions to prevent token explosion and memory issues.

Copilot · 2026-02-05T05:26:15Z

+	// Fetch repository definitions from repo collection (if available)
+	repoDefinitions := make(map[string]string) // map[org/repo]full_definition
+
+	if s.embedder != nil && s.store != nil && ctx.Config.Transfer.RepoCollection != "" {
+		// Check if repo collection exists
+		exists, err := s.store.CollectionExists(ctx.Ctx, ctx.Config.Transfer.RepoCollection)
+		if err != nil {
+			log.Printf("[llm_router] Error checking repo collection: %v (non-blocking)", err)
+		} else if !exists {
+			log.Printf("[llm_router] Repo collection '%s' not found, using config descriptions",
+				ctx.Config.Transfer.RepoCollection)
+		} else {
+			// Generate embedding for issue to find relevant repos
+			issueContent := fmt.Sprintf("%s\n\n%s", ctx.Issue.Title, ctx.Issue.Body)
+			issueEmbedding, err := s.embedder.Embed(ctx.Ctx, issueContent)
+
+			if err != nil {
+				log.Printf("[llm_router] Error generating issue embedding: %v (non-blocking)", err)
+			} else {
+				// Search for relevant repository definitions
+				// Use broader threshold (0.5) and higher limit (10) to get context
+				results, err := s.store.Search(
+					ctx.Ctx,
+					ctx.Config.Transfer.RepoCollection,
+					issueEmbedding,
+					10,  // Get top 10 relevant repo docs (more than we'll route to)
+					0.5, // Lower threshold than issues (0.65-0.7) for broader context
+				)
+
+				if err != nil {
+					log.Printf("[llm_router] Error searching repo collection: %v (non-blocking)", err)
+				} else if len(results) == 0 {
+					log.Printf("[llm_router] No repository definitions found")
+				} else {
+					// Collect definitions per repo (handle multiple files per repo)
+					for _, res := range results {
+						org, _ := res.Payload["org"].(string)
+						repo, _ := res.Payload["repo"].(string)
+						text, _ := res.Payload["text"].(string)
+						file, _ := res.Payload["file"].(string)
+
+						if org == "" || repo == "" || text == "" {
+							log.Printf("[llm_router] Invalid repo definition payload, skipping")
+							continue
+						}
+
+						repoKey := fmt.Sprintf("%s/%s", org, repo)
+
+						// If multiple docs for same repo, concatenate with separators
+						if existing, ok := repoDefinitions[repoKey]; ok {
+							// Add file header for clarity
+							repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text
+						} else {
+							// First document for this repo
+							if file != "README.md" {
+								repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text
+							} else {
+								repoDefinitions[repoKey] = text
+							}
+						}
+					}
+					log.Printf("[llm_router] Loaded %d repository definitions from %d documents",
+						len(repoDefinitions), len(results))
+				}
+			}
+		}
+	}
+
+	// Enhance candidates with full definitions
+	for i := range candidates {
+		repoKey := fmt.Sprintf("%s/%s", candidates[i].Org, candidates[i].Repo)
+		if def, ok := repoDefinitions[repoKey]; ok {
+			candidates[i].Definition = def
+		}
+	}


The semantic search for repository definitions only searches in the repo collection, which means repositories that haven't been indexed via the learn command won't be found. However, all enabled repositories from the config are included in the candidates list (lines 70-80), so unindexed repositories will still be considered for routing but will only have their short config descriptions available. This is by design for graceful degradation, but it might be worth documenting this behavior more clearly in the code comments to help future maintainers understand the fallback logic.

Copilot · 2026-02-05T05:26:15Z

+	// 5. Fetch File from GitHub
+	log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo)
+	content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "")
+	if err != nil {
+		log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err)
+	}


The code doesn't validate whether the specified file path is within the repository bounds, potentially allowing path traversal attempts. While the GitHub API itself likely validates this, adding explicit validation that the file path doesn't contain patterns like "../" would make the security boundary clearer and prevent potential issues if the GitHub client implementation changes. Consider adding path validation before calling GetFileContent.

Fixes based on Copilot and CodeRabbit review: - Use deterministic IDs (SHA-256 hash of org/repo/file) for idempotent learn operations - Add path validation to prevent path traversal attempts - Fix UTF-8 safe truncation using runes instead of bytes - Always include file headers for consistency in multi-file concatenation - Add max total size limit (5000 chars) per repo to prevent unbounded memory growth - Improve logging for truncation events These changes ensure: - Re-learning the same file updates existing entry instead of creating duplicates - No UTF-8 character corruption in truncated strings - Consistent formatting for all documentation files - Protection against token explosion from large concatenated definitions

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@internal/integrations/gemini/prompts.go`:
- Around line 138-141: The truncation of r.Definition uses byte-length and
slicing which can break UTF-8; replace the manual byte-based check and slice
(len(definition) and definition[:2000]) with the existing rune-safe helper
truncate(), e.g. call truncate(definition, 2000) (or adjust truncate to accept a
custom suffix if you want "... (truncated)" instead of its default) so that
Definition is truncated safely without corrupting multibyte characters.

In `@internal/steps/llm_router.go`:
- Around line 141-158: The code currently truncates strings with byte slicing
(combined[:maxTotalSizePerRepo] and newDoc[:maxTotalSizePerRepo]) which can
split multi-byte UTF-8 runes; replace those slices with a rune-safe truncation
helper (e.g., truncateToByteLimitSafely) that iterates runes and accumulates
byte length until adding the next rune would exceed maxTotalSizePerRepo, then
returns the substring up to the last whole rune plus the "\n... (truncated)"
suffix; call this helper for both the combined and newDoc truncation branches
(referencing repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to
locate the changes).

🧹 Nitpick comments (2)

cmd/simili/commands/learn.go (2)
108-113: Path validation may be insufficient on some edge cases.

The current validation uses filepath.Clean followed by string checks. However, filepath.Clean may normalize paths differently across platforms, and some edge cases like foo/../bar would be cleaned to bar (passing the check) which is fine, but paths like ..foo (literal filename starting with dots) would be incorrectly rejected.

Consider using filepath.IsAbs and checking if the cleaned path escapes the intended root:
♻️ More robust path validation
 	// 5. Validate and fetch file from GitHub
 	// Clean path and validate (prevent path traversal)
 	cleanPath := filepath.Clean(learnFile)
-	if strings.Contains(cleanPath, "..") || strings.HasPrefix(cleanPath, "/") {
+	if filepath.IsAbs(cleanPath) || strings.HasPrefix(cleanPath, "..") {
 		log.Fatalf("Invalid file path: %s (path traversal not allowed)", learnFile)
 	}
This allows legitimate filenames containing .. (e.g., docs/foo..bar.md) while still blocking traversal attempts.
121-135: Consider adding a size limit for large files.

While the system handles empty files, there's no upper bound on file size. Very large files could cause memory issues during embedding generation or exceed API limits.
💡 Optional: Add maximum file size check
 	if len(content) == 0 {
 		log.Printf("Warning: File %s is empty, skipping", learnFile)
 		return
 	}
+
+	const maxFileSize = 100 * 1024 // 100KB limit
+	if len(content) > maxFileSize {
+		log.Printf("Warning: File %s exceeds size limit (%d > %d bytes), truncating", 
+			learnFile, len(content), maxFileSize)
+		content = content[:maxFileSize]
+	}

 	contentStr := string(content)

coderabbitai · 2026-02-05T05:49:32Z

+			definition := r.Definition
+			if len(definition) > 2000 {
+				definition = definition[:2000] + "\n... (truncated)"
+			}


⚠️ Potential issue | 🟡 Minor

Use rune-based truncation for UTF-8 safety.

The truncation here uses byte-length (len(definition)) and byte slicing (definition[:2000]), which can corrupt multi-byte UTF-8 characters. The truncate() helper defined at lines 13-20 already handles this correctly.

🛠️ Proposed fix

- definition := r.Definition - if len(definition) > 2000 { - definition = definition[:2000] + "\n... (truncated)" - } + definition := truncate(r.Definition, 2000) + if definition != r.Definition { + definition += "\n(truncated)" + }

Note: The truncate() function already appends "...", so you may want to adjust the suffix or modify truncate() to accept a custom suffix parameter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

definition := r.Definition

if len(definition) > 2000 {

definition = definition[:2000] + "\n... (truncated)"

}

definition := truncate(r.Definition, 2000)

if definition != r.Definition {

definition += "\n(truncated)"

}

🤖 Prompt for AI Agents

In `@internal/integrations/gemini/prompts.go` around lines 138 - 141, The truncation of r.Definition uses byte-length and slicing which can break UTF-8; replace the manual byte-based check and slice (len(definition) and definition[:2000]) with the existing rune-safe helper truncate(), e.g. call truncate(definition, 2000) (or adjust truncate to accept a custom suffix if you want "... (truncated)" instead of its default) so that Definition is truncated safely without corrupting multibyte characters.

coderabbitai · 2026-02-05T05:49:32Z

+						// If multiple docs for same repo, concatenate with size limit
+						if existing, ok := repoDefinitions[repoKey]; ok {
+							combined := existing + "\n\n" + newDoc
+							// Enforce max total size per repo
+							if len(combined) > maxTotalSizePerRepo {
+								log.Printf("[llm_router] Truncating %s definition (size: %d > %d)",
+									repoKey, len(combined), maxTotalSizePerRepo)
+								combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)"
+							}
+							repoDefinitions[repoKey] = combined
+						} else {
+							// First document for this repo
+							if len(newDoc) > maxTotalSizePerRepo {
+								newDoc = newDoc[:maxTotalSizePerRepo] + "\n... (truncated)"
+							}
+							repoDefinitions[repoKey] = newDoc
+						}
+					}


⚠️ Potential issue | 🟡 Minor

Byte-based truncation may corrupt UTF-8 characters.

The truncation at lines 145-148 and 153-154 uses byte slicing (combined[:maxTotalSizePerRepo]), which can split multi-byte UTF-8 characters. This is the same issue as in prompts.go.

🛠️ Proposed fix using rune-safe truncation

+// truncateRunes safely truncates a string to maxLen runes. +func truncateRunes(s string, maxLen int) string { + runes := []rune(s) + if len(runes) <= maxLen { + return s + } + return string(runes[:maxLen]) +} // Inside the enrichment loop: // Enforce max total size per repo - if len(combined) > maxTotalSizePerRepo { + if len([]rune(combined)) > maxTotalSizePerRepo { log.Printf("[llm_router] Truncating %s definition (size: %d > %d)", repoKey, len(combined), maxTotalSizePerRepo) - combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)" + combined = truncateRunes(combined, maxTotalSizePerRepo) + "\n... (truncated)" }

🤖 Prompt for AI Agents

In `@internal/steps/llm_router.go` around lines 141 - 158, The code currently truncates strings with byte slicing (combined[:maxTotalSizePerRepo] and newDoc[:maxTotalSizePerRepo]) which can split multi-byte UTF-8 runes; replace those slices with a rune-safe truncation helper (e.g., truncateToByteLimitSafely) that iterates runes and accumulates byte length until adding the next rune would exceed maxTotalSizePerRepo, then returns the substring up to the last whole rune plus the "\n... (truncated)" suffix; call this helper for both the combined and newDoc truncation branches (referencing repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to locate the changes).

Add reference documentation for the new 'simili learn' command used to index repository documentation for hybrid routing. Update TransferConfig schema to document new fields: - high_confidence: threshold for auto-transferring issues - medium_confidence: threshold for suggesting transfers - repo_collection: Qdrant collection for repository documentation Related PRs: similigh/simili-bot#32 (hybrid routing implementation) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Kavirubc added 5 commits February 5, 2026 09:34

feat: add RepoCollection config field for hybrid routing

b336d29

Add RepoCollection field to TransferConfig to support separate Qdrant collection for repository documentation. Defaults to "simili_repos" and includes proper merge logic for config inheritance.

feat: add Definition field to RepositoryCandidate

a97d30a

Add Definition field to store full repository documentation for enhanced LLM routing decisions in hybrid routing system.

Kavirubc requested a review from Copilot February 5, 2026 05:20

Copilot started reviewing on behalf of Kavirubc February 5, 2026 05:21 View session

coderabbitai Bot reviewed Feb 5, 2026

View reviewed changes

Copilot AI reviewed Feb 5, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 5, 2026

View reviewed changes

Kavirubc merged commit 7b1ab83 into main Feb 5, 2026
4 checks passed

Kavirubc deleted the feature/hybrid-index-learn branch February 5, 2026 05:51

Kavirubc linked an issue Feb 5, 2026 that may be closed by this pull request

Feature: Hybrid Routing with Repository Knowledge Base #30

Closed

coderabbitai Bot mentioned this pull request Feb 5, 2026

feat: redesign message templates with professional branding #33

Merged

gh-simili-bot mentioned this pull request Feb 17, 2026

chore: configure DCO remediation via comment for retroactive sign-off #70

Closed

3 tasks

coderabbitai Bot mentioned this pull request Feb 27, 2026

feat(transfer): hybrid rule-based + VDB semantic routing (closes #23) #90

Merged

3 tasks

gh-simili-bot mentioned this pull request Mar 5, 2026

fix(embedder): replace text-embedding-004 with gemini-embedding-001 #98

Merged

3 tasks

Conversation

Kavirubc commented Feb 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

1. Configuration Foundation

2. Data Structure Enhancement

3. Learn Command Implementation

4. Enhanced System Prompt

5. LLM Router Enhancement

Key Features

Usage

Configuration

Testing

Commits

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Kavirubc commented Feb 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 5, 2026 •

edited

Loading