feat: implement hybrid routing with repository documentation learning#32
feat: implement hybrid routing with repository documentation learning#32
Conversation
Add RepoCollection field to TransferConfig to support separate Qdrant collection for repository documentation. Defaults to "simili_repos" and includes proper merge logic for config inheritance.
Add Definition field to store full repository documentation for enhanced LLM routing decisions in hybrid routing system.
Add new 'learn' command that fetches repository documentation (README.md, CONTRIBUTING.md, etc.) from GitHub and indexes it into a separate Qdrant collection for hybrid routing. Supports: - Fetching any file from GitHub API - Embedding with Gemini - Indexing to configurable repo collection - Dry-run mode for testing - Rich payload with org, repo, file, text, timestamp This enables routing from Day 1 without historical issues.
Update buildRouteIssuePrompt to: - Include full repository documentation when available - Add qualitative 60/40 weighting guidance (docs vs history) - Truncate long documentation to prevent token explosion - Gracefully fall back to descriptions when no docs available - Add indentText helper for proper formatting This enables LLM to prioritize repository documentation as the authoritative source for routing decisions while still considering historical precedents.
Update LLMRouter to: - Add embedder and vector store dependencies - Fetch repo definitions via semantic search before routing - Support multiple documentation files per repository - Populate Definition field in candidates for enhanced LLM prompt - Include comprehensive error handling with graceful fallbacks The router now: 1. Checks if repo collection exists 2. Generates embedding for the issue 3. Searches for relevant repo documentation (top 10, threshold 0.5) 4. Concatenates multiple docs per repo with separators 5. Passes definitions to LLM for hybrid routing decision All operations are non-blocking with detailed logging.
📝 WalkthroughWalkthroughAdds a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GitHub as GitHub API
participant Embedder as Gemini Embedder
participant Qdrant as Qdrant
User->>Embedder: simili learn (org, repo, file)
Embedder->>GitHub: fetch file content
GitHub-->>Embedder: file content
Embedder->>Embedder: generate embedding
Embedder->>Qdrant: ensure collection & upsert point (id, embedding, payload)
Qdrant-->>Embedder: upsert result
Embedder-->>User: success / synopsis (dry-run)
sequenceDiagram
participant Issue as User Issue
participant Router as LLMRouter
participant Embedder as Gemini Embedder
participant Qdrant as Qdrant
participant LLM as LLM Engine
Issue->>Router: route request
Router->>Embedder: generate issue embedding
Embedder-->>Router: embedding vector
Router->>Qdrant: search repo collection (embedding)
Qdrant-->>Router: matching repo documents
Router->>Router: aggregate/truncate per-repo definitions
Router->>LLM: invoke with enriched candidates
LLM-->>Router: routing decision
Router-->>Issue: routed repo
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/steps/llm_router.go (1)
66-79:⚠️ Potential issue | 🟠 MajorAllow repos without descriptions so repo-docs can drive routing.
The
repo.Description == ""filter drops repos that only have documentation in the repo collection, which defeats the new doc-based routing path. Consider allowing enabled repos and relying on definitions (or descriptions) later.🔧 Suggested fix
- if !repo.Enabled || repoName == currentRepo || repo.Description == "" { + if !repo.Enabled || repoName == currentRepo { continue }
There was a problem hiding this comment.
Pull request overview
This PR implements a hybrid routing system that enhances issue routing by combining repository documentation (learned via a new simili learn command) with historical issue patterns. The system addresses the "cold start problem" by enabling effective routing even when no historical issues exist.
Changes:
- Added a new
simili learncommand to index repository documentation (README.md, CONTRIBUTING.md, etc.) into a separate Qdrant collection - Enhanced the LLM router to perform semantic search on repository documentation and include full definitions in routing prompts
- Updated routing prompts with 60/40 weighting guidance (60% documentation, 40% historical patterns)
- Added configuration support for the repository documentation collection with proper defaults and merge logic
- Extended the
RepositoryCandidatedata structure to include full documentation definitions
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/simili/commands/learn.go | New command implementation for indexing repository documentation files from GitHub into Qdrant |
| internal/steps/llm_router.go | Enhanced router with semantic search to fetch and include repository definitions in routing decisions |
| internal/integrations/gemini/prompts.go | Updated routing prompt with documentation support, weighting guidance, and new helper functions (truncate moved, indentText added) |
| internal/integrations/gemini/llm.go | Added Definition field to RepositoryCandidate struct for storing full repository documentation |
| internal/core/config/config.go | Added RepoCollection field to TransferConfig with default value and merge support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| point := &qdrant.Point{ | ||
| ID: uuid.New().String(), | ||
| Vector: embedding, | ||
| Payload: map[string]interface{}{ | ||
| "org": learnOrg, | ||
| "repo": learnRepo, | ||
| "file": learnFile, | ||
| "text": contentStr, | ||
| "indexed_at": time.Now().Format(time.RFC3339), | ||
| "type": "repo_doc", | ||
| }, | ||
| } |
There was a problem hiding this comment.
The UUID generation for point IDs could lead to duplicate documents in the collection if the same file is learned multiple times. Each time "simili learn" is run for the same org/repo/file combination, it creates a new point with a new UUID, resulting in duplicate entries. Consider using a deterministic ID based on org/repo/file (e.g., hash or concatenation) to enable idempotent operations where re-learning the same file updates the existing entry rather than creating duplicates.
| // 6. Generate Embedding | ||
| log.Printf("Generating embedding...") | ||
| embedding, err := embedder.Embed(ctx, contentStr) | ||
| if err != nil { | ||
| log.Fatalf("Failed to generate embedding: %v", err) | ||
| } | ||
| log.Printf("Generated embedding with %d dimensions", len(embedding)) |
There was a problem hiding this comment.
The learn command doesn't implement rate limiting for the Gemini embedding API. If users run multiple learn commands in quick succession or try to index many files, they could hit API rate limits and experience failures. Consider adding retry logic with exponential backoff or at least documenting the rate limit considerations in the command's help text. The index command likely has similar considerations since it's shown importing the same packages.
| if len(definition) > 2000 { | ||
| definition = definition[:2000] + "\n... (truncated)" | ||
| } |
There was a problem hiding this comment.
The same UTF-8 truncation issue exists here. When truncating the definition to 2000 characters using definition[:2000], you may split a multi-byte UTF-8 character, resulting in an invalid UTF-8 string. This could cause issues when the truncated definition is used in the prompt. Consider using a rune-aware truncation approach to ensure UTF-8 safety.
| // First document for this repo | ||
| if file != "README.md" { | ||
| repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text | ||
| } else { | ||
| repoDefinitions[repoKey] = text | ||
| } |
There was a problem hiding this comment.
The code uses hardcoded file name matching logic that could be fragile. The special case for "README.md" (line 141) means that if the first file encountered is README.md, it gets no header, but if a non-README is encountered first, then README.md would get a header when added later. This inconsistency could make the output unpredictable. Consider either always including file headers for consistency, or ordering the results to ensure README.md is always processed first if present.
| // First document for this repo | |
| if file != "README.md" { | |
| repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text | |
| } else { | |
| repoDefinitions[repoKey] = text | |
| } | |
| // First document for this repo; include file header for consistency | |
| repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text |
| qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey) | ||
| if err != nil { | ||
| log.Fatalf("Failed to initialize Qdrant client: %v", err) | ||
| } | ||
| defer qdrantClient.Close() | ||
| } |
There was a problem hiding this comment.
Missing error handling for Qdrant client initialization. If qdrant.NewClient returns an error but the client is still non-nil (though this is unlikely), the defer statement could attempt to close a partially initialized client. While the current Qdrant client likely handles this correctly, it would be more defensive to check the error before setting up the defer.
| // truncate limits a string to a maximum length. | ||
| func truncate(s string, maxLen int) string { | ||
| if len(s) <= maxLen { | ||
| return s | ||
| } | ||
| return s[:maxLen] + "..." |
There was a problem hiding this comment.
The truncate function may split multi-byte UTF-8 characters, resulting in invalid UTF-8 strings. When slicing a string in Go by byte index (s[:maxLen]), if maxLen falls in the middle of a multi-byte character, the resulting string will be corrupted. This could cause issues when the truncated text is used in prompts or stored in the database. Consider using a rune-aware truncation approach or the strings.Cut function to ensure you don't break UTF-8 characters.
| // truncate limits a string to a maximum length. | |
| func truncate(s string, maxLen int) string { | |
| if len(s) <= maxLen { | |
| return s | |
| } | |
| return s[:maxLen] + "..." | |
| // truncate limits a string to a maximum length in runes, preserving valid UTF-8. | |
| func truncate(s string, maxLen int) string { | |
| runes := []rune(s) | |
| if len(runes) <= maxLen { | |
| return s | |
| } | |
| return string(runes[:maxLen]) + "..." |
| // Author: Kaviru Hapuarachchi | ||
| // GitHub: https://github.com/kavirubc | ||
| // Created: 2026-02-05 | ||
| // Last Modified: 2026-02-05 | ||
|
|
||
| package commands | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
| "log" | ||
| "os" | ||
| "time" | ||
|
|
||
| "github.com/google/uuid" | ||
| similiConfig "github.com/similigh/simili-bot/internal/core/config" | ||
| "github.com/similigh/simili-bot/internal/integrations/gemini" | ||
| similiGithub "github.com/similigh/simili-bot/internal/integrations/github" | ||
| "github.com/similigh/simili-bot/internal/integrations/qdrant" | ||
| "github.com/spf13/cobra" | ||
| ) | ||
|
|
||
| var ( | ||
| learnOrg string | ||
| learnRepo string | ||
| learnFile string | ||
| learnToken string | ||
| learnDryRun bool | ||
| ) | ||
|
|
||
| // learnCmd represents the learn command | ||
| var learnCmd = &cobra.Command{ | ||
| Use: "learn", | ||
| Short: "Index repository documentation for hybrid routing", | ||
| Long: `Learn repository documentation by fetching files from GitHub (README.md, | ||
| CONTRIBUTING.md, etc.) and indexing them into a separate Qdrant collection. | ||
|
|
||
| This enables the bot to understand what each repository is responsible for, | ||
| improving routing decisions even with zero historical issues (cold start). | ||
|
|
||
| Examples: | ||
| simili learn --org my-org --repo backend --file README.md | ||
| simili learn --org my-org --repo backend --file CONTRIBUTING.md | ||
| simili learn --org my-org --repo backend --file docs/ARCHITECTURE.md --dry-run`, | ||
| Run: runLearn, | ||
| } | ||
|
|
||
| func init() { | ||
| rootCmd.AddCommand(learnCmd) | ||
|
|
||
| learnCmd.Flags().StringVar(&learnOrg, "org", "", "Organization name (required)") | ||
| learnCmd.Flags().StringVar(&learnRepo, "repo", "", "Repository name (required)") | ||
| learnCmd.Flags().StringVar(&learnFile, "file", "README.md", "File path to learn (default: README.md)") | ||
| learnCmd.Flags().StringVar(&learnToken, "token", "", "GitHub token (optional, defaults to GITHUB_TOKEN env var)") | ||
| learnCmd.Flags().BoolVar(&learnDryRun, "dry-run", false, "Simulate without writing to database") | ||
|
|
||
| if err := learnCmd.MarkFlagRequired("org"); err != nil { | ||
| log.Fatalf("Failed to mark org flag as required: %v", err) | ||
| } | ||
| if err := learnCmd.MarkFlagRequired("repo"); err != nil { | ||
| log.Fatalf("Failed to mark repo flag as required: %v", err) | ||
| } | ||
| } | ||
|
|
||
| func runLearn(cmd *cobra.Command, args []string) { | ||
| ctx := context.Background() | ||
|
|
||
| // 1. Load Config | ||
| cfgPath := similiConfig.FindConfigPath(cfgFile) | ||
| if cfgPath == "" { | ||
| log.Fatalf("Config file not found. Please verify your setup.") | ||
| } | ||
| cfg, err := similiConfig.Load(cfgPath) | ||
| if err != nil { | ||
| log.Fatalf("Failed to load config: %v", err) | ||
| } | ||
|
|
||
| // 2. Initialize GitHub Client | ||
| token := learnToken | ||
| if token == "" { | ||
| token = os.Getenv("GITHUB_TOKEN") | ||
| } | ||
| if token == "" { | ||
| log.Fatalf("GitHub token is required (use --token or GITHUB_TOKEN env var)") | ||
| } | ||
|
|
||
| ghClient := similiGithub.NewClient(ctx, token) | ||
|
|
||
| // 3. Initialize Embedder | ||
| embedder, err := gemini.NewEmbedder(cfg.Embedding.APIKey, cfg.Embedding.Model) | ||
| if err != nil { | ||
| log.Fatalf("Failed to initialize Gemini embedder: %v", err) | ||
| } | ||
| defer embedder.Close() | ||
|
|
||
| // 4. Initialize Qdrant Client (unless dry-run) | ||
| var qdrantClient *qdrant.Client | ||
| if !learnDryRun { | ||
| qdrantClient, err = qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey) | ||
| if err != nil { | ||
| log.Fatalf("Failed to initialize Qdrant client: %v", err) | ||
| } | ||
| defer qdrantClient.Close() | ||
| } | ||
|
|
||
| // 5. Fetch File from GitHub | ||
| log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo) | ||
| content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "") | ||
| if err != nil { | ||
| log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err) | ||
| } | ||
|
|
||
| if len(content) == 0 { | ||
| log.Printf("Warning: File %s is empty, skipping", learnFile) | ||
| return | ||
| } | ||
|
|
||
| contentStr := string(content) | ||
| log.Printf("Fetched %d characters from %s", len(contentStr), learnFile) | ||
|
|
||
| // 6. Generate Embedding | ||
| log.Printf("Generating embedding...") | ||
| embedding, err := embedder.Embed(ctx, contentStr) | ||
| if err != nil { | ||
| log.Fatalf("Failed to generate embedding: %v", err) | ||
| } | ||
| log.Printf("Generated embedding with %d dimensions", len(embedding)) | ||
|
|
||
| // 7. Create Point with Rich Payload | ||
| point := &qdrant.Point{ | ||
| ID: uuid.New().String(), | ||
| Vector: embedding, | ||
| Payload: map[string]interface{}{ | ||
| "org": learnOrg, | ||
| "repo": learnRepo, | ||
| "file": learnFile, | ||
| "text": contentStr, | ||
| "indexed_at": time.Now().Format(time.RFC3339), | ||
| "type": "repo_doc", | ||
| }, | ||
| } | ||
|
|
||
| // 8. Dry Run Check | ||
| if learnDryRun { | ||
| fmt.Printf("🔍 [DRY RUN] Would index %s/%s/%s\n", learnOrg, learnRepo, learnFile) | ||
| fmt.Printf("📊 Collection: %s\n", cfg.Transfer.RepoCollection) | ||
| fmt.Printf("📝 Content: %d characters\n", len(contentStr)) | ||
| fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding)) | ||
| return | ||
| } | ||
|
|
||
| // 9. Ensure Collection Exists | ||
| repoCollection := cfg.Transfer.RepoCollection | ||
| log.Printf("Ensuring collection '%s' exists...", repoCollection) | ||
| if err := qdrantClient.CreateCollection(ctx, repoCollection, cfg.Embedding.Dimensions); err != nil { | ||
| log.Fatalf("Failed to create/verify collection: %v", err) | ||
| } | ||
|
|
||
| // 10. Upsert to Qdrant | ||
| log.Printf("Indexing to collection '%s'...", repoCollection) | ||
| if err := qdrantClient.Upsert(ctx, repoCollection, []*qdrant.Point{point}); err != nil { | ||
| log.Fatalf("Failed to index document: %v", err) | ||
| } | ||
|
|
||
| // Success Output | ||
| fmt.Printf("✅ Successfully indexed %s/%s/%s\n", learnOrg, learnRepo, learnFile) | ||
| fmt.Printf("📊 Collection: %s\n", repoCollection) | ||
| fmt.Printf("📝 Content: %d characters\n", len(contentStr)) | ||
| fmt.Printf("🔢 Embedding: %d dimensions\n", len(embedding)) | ||
| } |
There was a problem hiding this comment.
The code lacks test coverage for the new hybrid routing functionality. While there are existing tests for prompts (gemini_test.go), there are no tests for: (1) the new learn command, (2) the enhanced LLM router with semantic search and multi-file support, (3) the new helper functions (indentText), or (4) the updated routing prompt with weighting guidance. Adding tests would help ensure the hybrid routing feature works correctly and prevent regressions.
| // If multiple docs for same repo, concatenate with separators | ||
| if existing, ok := repoDefinitions[repoKey]; ok { | ||
| // Add file header for clarity | ||
| repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text | ||
| } else { | ||
| // First document for this repo | ||
| if file != "README.md" { | ||
| repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text | ||
| } else { | ||
| repoDefinitions[repoKey] = text | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential unbounded memory growth when concatenating multiple documentation files for the same repository. If a repository has many documentation files that match the semantic search (e.g., 10 files with 2000 characters each), the concatenated definition could grow to 20,000+ characters. While individual files are truncated to 2000 characters, there's no limit on the total concatenated size. Consider adding a maximum total size limit for concatenated definitions to prevent token explosion and memory issues.
| // Fetch repository definitions from repo collection (if available) | ||
| repoDefinitions := make(map[string]string) // map[org/repo]full_definition | ||
|
|
||
| if s.embedder != nil && s.store != nil && ctx.Config.Transfer.RepoCollection != "" { | ||
| // Check if repo collection exists | ||
| exists, err := s.store.CollectionExists(ctx.Ctx, ctx.Config.Transfer.RepoCollection) | ||
| if err != nil { | ||
| log.Printf("[llm_router] Error checking repo collection: %v (non-blocking)", err) | ||
| } else if !exists { | ||
| log.Printf("[llm_router] Repo collection '%s' not found, using config descriptions", | ||
| ctx.Config.Transfer.RepoCollection) | ||
| } else { | ||
| // Generate embedding for issue to find relevant repos | ||
| issueContent := fmt.Sprintf("%s\n\n%s", ctx.Issue.Title, ctx.Issue.Body) | ||
| issueEmbedding, err := s.embedder.Embed(ctx.Ctx, issueContent) | ||
|
|
||
| if err != nil { | ||
| log.Printf("[llm_router] Error generating issue embedding: %v (non-blocking)", err) | ||
| } else { | ||
| // Search for relevant repository definitions | ||
| // Use broader threshold (0.5) and higher limit (10) to get context | ||
| results, err := s.store.Search( | ||
| ctx.Ctx, | ||
| ctx.Config.Transfer.RepoCollection, | ||
| issueEmbedding, | ||
| 10, // Get top 10 relevant repo docs (more than we'll route to) | ||
| 0.5, // Lower threshold than issues (0.65-0.7) for broader context | ||
| ) | ||
|
|
||
| if err != nil { | ||
| log.Printf("[llm_router] Error searching repo collection: %v (non-blocking)", err) | ||
| } else if len(results) == 0 { | ||
| log.Printf("[llm_router] No repository definitions found") | ||
| } else { | ||
| // Collect definitions per repo (handle multiple files per repo) | ||
| for _, res := range results { | ||
| org, _ := res.Payload["org"].(string) | ||
| repo, _ := res.Payload["repo"].(string) | ||
| text, _ := res.Payload["text"].(string) | ||
| file, _ := res.Payload["file"].(string) | ||
|
|
||
| if org == "" || repo == "" || text == "" { | ||
| log.Printf("[llm_router] Invalid repo definition payload, skipping") | ||
| continue | ||
| } | ||
|
|
||
| repoKey := fmt.Sprintf("%s/%s", org, repo) | ||
|
|
||
| // If multiple docs for same repo, concatenate with separators | ||
| if existing, ok := repoDefinitions[repoKey]; ok { | ||
| // Add file header for clarity | ||
| repoDefinitions[repoKey] = existing + fmt.Sprintf("\n\n--- %s ---\n\n", file) + text | ||
| } else { | ||
| // First document for this repo | ||
| if file != "README.md" { | ||
| repoDefinitions[repoKey] = fmt.Sprintf("--- %s ---\n\n", file) + text | ||
| } else { | ||
| repoDefinitions[repoKey] = text | ||
| } | ||
| } | ||
| } | ||
| log.Printf("[llm_router] Loaded %d repository definitions from %d documents", | ||
| len(repoDefinitions), len(results)) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Enhance candidates with full definitions | ||
| for i := range candidates { | ||
| repoKey := fmt.Sprintf("%s/%s", candidates[i].Org, candidates[i].Repo) | ||
| if def, ok := repoDefinitions[repoKey]; ok { | ||
| candidates[i].Definition = def | ||
| } | ||
| } |
There was a problem hiding this comment.
The semantic search for repository definitions only searches in the repo collection, which means repositories that haven't been indexed via the learn command won't be found. However, all enabled repositories from the config are included in the candidates list (lines 70-80), so unindexed repositories will still be considered for routing but will only have their short config descriptions available. This is by design for graceful degradation, but it might be worth documenting this behavior more clearly in the code comments to help future maintainers understand the fallback logic.
| // 5. Fetch File from GitHub | ||
| log.Printf("Fetching %s from %s/%s...", learnFile, learnOrg, learnRepo) | ||
| content, err := ghClient.GetFileContent(ctx, learnOrg, learnRepo, learnFile, "") | ||
| if err != nil { | ||
| log.Fatalf("Failed to fetch %s from %s/%s: %v", learnFile, learnOrg, learnRepo, err) | ||
| } |
There was a problem hiding this comment.
The code doesn't validate whether the specified file path is within the repository bounds, potentially allowing path traversal attempts. While the GitHub API itself likely validates this, adding explicit validation that the file path doesn't contain patterns like "../" would make the security boundary clearer and prevent potential issues if the GitHub client implementation changes. Consider adding path validation before calling GetFileContent.
Fixes based on Copilot and CodeRabbit review: - Use deterministic IDs (SHA-256 hash of org/repo/file) for idempotent learn operations - Add path validation to prevent path traversal attempts - Fix UTF-8 safe truncation using runes instead of bytes - Always include file headers for consistency in multi-file concatenation - Add max total size limit (5000 chars) per repo to prevent unbounded memory growth - Improve logging for truncation events These changes ensure: - Re-learning the same file updates existing entry instead of creating duplicates - No UTF-8 character corruption in truncated strings - Consistent formatting for all documentation files - Protection against token explosion from large concatenated definitions
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@internal/integrations/gemini/prompts.go`:
- Around line 138-141: The truncation of r.Definition uses byte-length and
slicing which can break UTF-8; replace the manual byte-based check and slice
(len(definition) and definition[:2000]) with the existing rune-safe helper
truncate(), e.g. call truncate(definition, 2000) (or adjust truncate to accept a
custom suffix if you want "... (truncated)" instead of its default) so that
Definition is truncated safely without corrupting multibyte characters.
In `@internal/steps/llm_router.go`:
- Around line 141-158: The code currently truncates strings with byte slicing
(combined[:maxTotalSizePerRepo] and newDoc[:maxTotalSizePerRepo]) which can
split multi-byte UTF-8 runes; replace those slices with a rune-safe truncation
helper (e.g., truncateToByteLimitSafely) that iterates runes and accumulates
byte length until adding the next rune would exceed maxTotalSizePerRepo, then
returns the substring up to the last whole rune plus the "\n... (truncated)"
suffix; call this helper for both the combined and newDoc truncation branches
(referencing repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to
locate the changes).
🧹 Nitpick comments (2)
cmd/simili/commands/learn.go (2)
108-113: Path validation may be insufficient on some edge cases.The current validation uses
filepath.Cleanfollowed by string checks. However,filepath.Cleanmay normalize paths differently across platforms, and some edge cases likefoo/../barwould be cleaned tobar(passing the check) which is fine, but paths like..foo(literal filename starting with dots) would be incorrectly rejected.Consider using
filepath.IsAbsand checking if the cleaned path escapes the intended root:♻️ More robust path validation
// 5. Validate and fetch file from GitHub // Clean path and validate (prevent path traversal) cleanPath := filepath.Clean(learnFile) - if strings.Contains(cleanPath, "..") || strings.HasPrefix(cleanPath, "/") { + if filepath.IsAbs(cleanPath) || strings.HasPrefix(cleanPath, "..") { log.Fatalf("Invalid file path: %s (path traversal not allowed)", learnFile) }This allows legitimate filenames containing
..(e.g.,docs/foo..bar.md) while still blocking traversal attempts.
121-135: Consider adding a size limit for large files.While the system handles empty files, there's no upper bound on file size. Very large files could cause memory issues during embedding generation or exceed API limits.
💡 Optional: Add maximum file size check
if len(content) == 0 { log.Printf("Warning: File %s is empty, skipping", learnFile) return } + + const maxFileSize = 100 * 1024 // 100KB limit + if len(content) > maxFileSize { + log.Printf("Warning: File %s exceeds size limit (%d > %d bytes), truncating", + learnFile, len(content), maxFileSize) + content = content[:maxFileSize] + } contentStr := string(content)
| definition := r.Definition | ||
| if len(definition) > 2000 { | ||
| definition = definition[:2000] + "\n... (truncated)" | ||
| } |
There was a problem hiding this comment.
Use rune-based truncation for UTF-8 safety.
The truncation here uses byte-length (len(definition)) and byte slicing (definition[:2000]), which can corrupt multi-byte UTF-8 characters. The truncate() helper defined at lines 13-20 already handles this correctly.
🛠️ Proposed fix
- definition := r.Definition
- if len(definition) > 2000 {
- definition = definition[:2000] + "\n... (truncated)"
- }
+ definition := truncate(r.Definition, 2000)
+ if definition != r.Definition {
+ definition += "\n(truncated)"
+ }Note: The truncate() function already appends "...", so you may want to adjust the suffix or modify truncate() to accept a custom suffix parameter.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| definition := r.Definition | |
| if len(definition) > 2000 { | |
| definition = definition[:2000] + "\n... (truncated)" | |
| } | |
| definition := truncate(r.Definition, 2000) | |
| if definition != r.Definition { | |
| definition += "\n(truncated)" | |
| } |
🤖 Prompt for AI Agents
In `@internal/integrations/gemini/prompts.go` around lines 138 - 141, The
truncation of r.Definition uses byte-length and slicing which can break UTF-8;
replace the manual byte-based check and slice (len(definition) and
definition[:2000]) with the existing rune-safe helper truncate(), e.g. call
truncate(definition, 2000) (or adjust truncate to accept a custom suffix if you
want "... (truncated)" instead of its default) so that Definition is truncated
safely without corrupting multibyte characters.
| // If multiple docs for same repo, concatenate with size limit | ||
| if existing, ok := repoDefinitions[repoKey]; ok { | ||
| combined := existing + "\n\n" + newDoc | ||
| // Enforce max total size per repo | ||
| if len(combined) > maxTotalSizePerRepo { | ||
| log.Printf("[llm_router] Truncating %s definition (size: %d > %d)", | ||
| repoKey, len(combined), maxTotalSizePerRepo) | ||
| combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)" | ||
| } | ||
| repoDefinitions[repoKey] = combined | ||
| } else { | ||
| // First document for this repo | ||
| if len(newDoc) > maxTotalSizePerRepo { | ||
| newDoc = newDoc[:maxTotalSizePerRepo] + "\n... (truncated)" | ||
| } | ||
| repoDefinitions[repoKey] = newDoc | ||
| } | ||
| } |
There was a problem hiding this comment.
Byte-based truncation may corrupt UTF-8 characters.
The truncation at lines 145-148 and 153-154 uses byte slicing (combined[:maxTotalSizePerRepo]), which can split multi-byte UTF-8 characters. This is the same issue as in prompts.go.
🛠️ Proposed fix using rune-safe truncation
+// truncateRunes safely truncates a string to maxLen runes.
+func truncateRunes(s string, maxLen int) string {
+ runes := []rune(s)
+ if len(runes) <= maxLen {
+ return s
+ }
+ return string(runes[:maxLen])
+}
// Inside the enrichment loop:
// Enforce max total size per repo
- if len(combined) > maxTotalSizePerRepo {
+ if len([]rune(combined)) > maxTotalSizePerRepo {
log.Printf("[llm_router] Truncating %s definition (size: %d > %d)",
repoKey, len(combined), maxTotalSizePerRepo)
- combined = combined[:maxTotalSizePerRepo] + "\n... (truncated)"
+ combined = truncateRunes(combined, maxTotalSizePerRepo) + "\n... (truncated)"
}🤖 Prompt for AI Agents
In `@internal/steps/llm_router.go` around lines 141 - 158, The code currently
truncates strings with byte slicing (combined[:maxTotalSizePerRepo] and
newDoc[:maxTotalSizePerRepo]) which can split multi-byte UTF-8 runes; replace
those slices with a rune-safe truncation helper (e.g.,
truncateToByteLimitSafely) that iterates runes and accumulates byte length until
adding the next rune would exceed maxTotalSizePerRepo, then returns the
substring up to the last whole rune plus the "\n... (truncated)" suffix; call
this helper for both the combined and newDoc truncation branches (referencing
repoDefinitions, combined, newDoc, and maxTotalSizePerRepo to locate the
changes).
Add reference documentation for the new 'simili learn' command used to index repository documentation for hybrid routing. Update TransferConfig schema to document new fields: - high_confidence: threshold for auto-transferring issues - medium_confidence: threshold for suggesting transfers - repo_collection: Qdrant collection for repository documentation Related PRs: similigh/simili-bot#32 (hybrid routing implementation) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Overview
Implements hybrid routing system that combines repository documentation (~60% weight) with historical issue patterns (~40% weight) to solve the cold start problem and enable effective routing from Day 1.
Changes
1. Configuration Foundation
RepoCollectionfield toTransferConfig(default:"simili_repos")2. Data Structure Enhancement
Definitionfield toRepositoryCandidatefor full repo documentation3. Learn Command Implementation
simili learncommand to index repository documentation4. Enhanced System Prompt
5. LLM Router Enhancement
Key Features
✅ Cold Start Support: Routes effectively even with zero historical issues
✅ Multi-File Support: Concatenates multiple docs per repo (README + CONTRIBUTING + docs/*)
✅ Semantic Search: Uses issue embedding to find relevant repos (top 10, threshold 0.5)
✅ Backward Compatible: All changes are additive, existing setups work without modifications
✅ Graceful Degradation: Every error point has fallback path
Usage
Configuration
Testing
Commits
b336d29- Configuration field for repo collectiona97d30a- Definition field in RepositoryCandidate30db714- Complete learn command implementationbe1c0a6- Enhanced routing prompt with 60/40 weighting207b0cf- LLM router with semantic search integrationSummary by CodeRabbit