feat: add batch CLI and web UI for issue analysis#36
Conversation
Add simili batch command to process multiple issues from JSON files against the Qdrant vector database in dry-run mode. This enables testing bot logic on historical data without making any GitHub writes. Features: - Process issues from JSON array input - Support JSON and CSV output formats - Concurrent processing with worker pool pattern - Override collection, thresholds, and top-k via flags - Force dry-run mode to prevent any side effects Use cases: - Test bot logic on historical issues - Generate analysis reports for stakeholders - Validate similarity search and duplicate detection - Audit quality assessment without repo write access Implementation: - Extracted ExecutePipeline() function for reusability - Worker pattern following index.go architecture - Exclude indexer step to prevent VDB writes - Support configuration overrides via CLI flags Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed duplicate detection incorrectly marking related issues as duplicates by ensuring full issue bodies are passed to the LLM for analysis, not just titles. Changes: - Add Body field to SimilarIssue struct to store full text - Extract and populate text field from Qdrant payload - Pass issue bodies to duplicate detection LLM prompt - Rewrite duplicate detection prompt to be more conservative - Distinguish between DUPLICATE vs RELATED issues clearly - Raise duplicate confidence threshold to 0.85 - Add DuplicateReason field to capture LLM reasoning - Update embedding model to gemini-embedding-001 Root cause: Similarity search was retrieving the text field from Qdrant but discarding it, only passing titles to the duplicate detector. This caused the LLM to mark issues with similar titles as duplicates even when their bodies described different problems. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update documentation to reflect: - New simili batch command with usage examples - Improved duplicate detection with body analysis - Corrected embedding model name to gemini-embedding-001 - CSV/JSON output format examples Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add simili-web command with minimal shadcn-style web interface for analyzing GitHub issues against the vector database. Provides real-time analysis results without requiring GitHub write permissions. Features: - Interactive web form for submitting issues - Real-time similarity search against indexed issues - Duplicate detection with confidence scores and reasoning - Quality assessment with actionable feedback - Label suggestions based on issue content - Transfer recommendations with explanations - Minimal black/white design with green/red/yellow accents - Embedded static files for single-binary deployment Technical details: - Go HTTP server with embedded static files - Same pipeline as CLI (dry-run mode enforced) - REST API with /api/analyze endpoint - No writes to GitHub or vector database - Supports collection overrides via env vars Usage: export GEMINI_API_KEY=xxx export QDRANT_URL=xxx export QDRANT_API_KEY=xxx export QDRANT_COLLECTION=xxx ./simili-web Then open http://localhost:8080 in browser. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a Simili web UI and CLI batch command, extracts pipeline execution into a reusable API, enriches similarity/duplicate detection with full-body context and reasoning, and expands embedding configuration keys. Includes frontend assets, backend server, batch processing, tests, and documentation. Changes
Sequence DiagramsequenceDiagram
participant Browser
participant WebServer as Web Server
participant Config as Config Loader
participant Deps as Dependencies
participant Pipeline as Pipeline Executor
participant Qdrant as Qdrant Vector DB
participant Gemini as Gemini LLM
participant Response as Response Builder
Browser->>WebServer: POST /api/analyze (IssueRequest)
activate WebServer
WebServer->>Config: Load .simili.yaml and env overrides
Config-->>WebServer: Config
WebServer->>Deps: Initialize embedder, vector store, LLM (dry-run)
Deps-->>WebServer: Dependencies ready
WebServer->>Pipeline: ExecutePipeline(issue, cfg, deps, steps..., silent=true)
activate Pipeline
Pipeline->>Qdrant: Search similar issues
Qdrant-->>Pipeline: Similar issues + metadata
Pipeline->>Gemini: Duplicate analysis (includes bodies)
Gemini-->>Pipeline: DuplicateResult + reasoning
Pipeline-->>WebServer: Pipeline.Result
deactivate Pipeline
WebServer->>Response: Build AnalysisResponse JSON
Response-->>WebServer: Response
WebServer-->>Browser: 200 OK + AnalysisResponse
deactivate WebServer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/steps/duplicate_detector.go (1)
94-98:⚠️ Potential issue | 🟠 MajorAlign code default with LLM prompt threshold.
The LLM prompt in
internal/integrations/gemini/prompts.goline 309 instructs the model to use a 0.85 confidence threshold ("ONLY set is_duplicate to true if confidence >= 0.85"), but the code default ininternal/steps/duplicate_detector.goline 97 andinternal/core/config/config.goline 226 falls back to 0.8. Update the default to 0.85 to match the prompt guidance.
🤖 Fix all issues with AI agents
In `@cmd/simili-web/main.go`:
- Around line 160-163: The JSON encoding calls in handlers like handleHealth
(and the other handlers around lines referenced) ignore errors; update each
json.NewEncoder(w).Encode(...) to capture its returned error, and if non-nil
respond with an appropriate HTTP error and log the failure (e.g., using
log.Printf or the existing logger) before returning. Specifically, in
handleHealth and the handlers that call json.NewEncoder(w).Encode (the ones
noted in the comment), assign the result to err, check if err != nil, write an
http.Error (or set proper status and message), and log the encoding error with
contextual info including the handler name.
- Around line 105-107: The Println call that outputs the stop message includes
an explicit trailing newline which is redundant and triggers go vet; update the
output in main (the fmt.Println(" Press Ctrl+C to stop\n") statement) to
remove the trailing "\n" (either use fmt.Println(" Press Ctrl+C to stop") or
fmt.Print without the extra newline) so that only one newline is emitted; ensure
you modify the fmt.Println invocation near the port/collection prints (the lines
that call fmt.Printf for port and cfg.Qdrant.Collection) to resolve the CI/vet
failure.
In `@cmd/simili-web/static/app.js`:
- Around line 116-123: The issue is that raw issue.URL values are inserted into
the href and can be crafted (e.g., javascript:) to cause XSS; add a helper
function named sanitizeUrl (placed near the existing escapeHtml helper) that
parses the URL with new URL(url, window.location.origin), returns parsed.href
only if parsed.protocol is 'http:' or 'https:', and otherwise returns '#'
(catching parse errors and returning '#'); then replace the template insertion
of ${issue.URL} in the similar-issue anchor with a call to
sanitizeUrl(issue.URL) so only whitelisted http/https links are used as hrefs.
In `@cmd/simili-web/static/index.html`:
- Around line 89-91: The external anchor element that opens in a new tab (the
<a> element with text "Simili Bot" and attribute target="_blank") should include
rel="noopener noreferrer" to prevent tabnabbing and ensure safe behavior; update
that anchor tag to add the rel attribute with both noopener and noreferrer
values while keeping the existing href and target attributes unchanged.
In `@cmd/simili/commands/batch.go`:
- Around line 222-246: The parameter name in loadIssues(filepath string) shadows
the imported path/filepath package; rename the parameter (for example to
filePath, path, inputPath, or filename) and update all references inside
loadIssues to the new name so the function no longer conflicts with the filepath
package and future uses of filepath.Join/etc. will compile correctly.
🧹 Nitpick comments (6)
README.md (1)
165-168: Add language specifier to fenced code block.The code block is missing a language specifier. Since this shows shell output, consider adding
bashortextas the language.📝 Proposed fix
-``` +```bash # 4. Review results cat analysis.csv</details> </blockquote></details> <details> <summary>cmd/simili-web/README.md (1)</summary><blockquote> `90-112`: **Consider documenting the `Body` field in the API response.** The `SimilarIssue` struct now includes a `Body` field (per changes in `internal/core/pipeline/pipeline.go`), but it's not shown in the example response. If the API exposes this field, consider adding it to the example for completeness. </blockquote></details> <details> <summary>cmd/simili-web/static/styles.css (1)</summary><blockquote> `96-185`: **Add explicit focus-visible styles for keyboard accessibility.** Hover styles are present, but focus indicators aren’t defined; adding them improves keyboard navigation without altering visual design for mouse users. <details> <summary>Suggested CSS</summary> ```diff +button[type="submit"]:focus-visible, +.similar-title a:focus-visible, +footer a:focus-visible { + outline: 2px solid `#fff`; + outline-offset: 2px; +}cmd/simili/commands/batch_test.go (1)
163-190: Reset package-level flags to avoid cross-test leakage.
These tests mutate global flags; restoring them witht.Cleanupprevents state bleed into other tests in this package.Suggested change
t.Run(tt.name, func(t *testing.T) { + prevCollection := batchCollection + prevThreshold := batchThreshold + prevDuplicate := batchDuplicateThresh + prevTopK := batchTopK + t.Cleanup(func() { + batchCollection = prevCollection + batchThreshold = prevThreshold + batchDuplicateThresh = prevDuplicate + batchTopK = prevTopK + }) + // Set global flags batchCollection = tt.collection batchThreshold = tt.threshold batchDuplicateThresh = tt.duplicateThresh batchTopK = tt.topKcmd/simili/commands/batch.go (2)
296-299: Verbose log may show empty model name.When
cfg.Embedding.Modelis not specified in the config, this log will print an empty string, even thoughNewEmbedderinternally defaults to"text-embedding-004". This could confuse users running with--verbose.💡 Proposed fix
deps.Embedder = embedder if verbose { - fmt.Printf("✓ Initialized Gemini Embedder with model: %s\n", cfg.Embedding.Model) + model := cfg.Embedding.Model + if model == "" { + model = "text-embedding-004" // default + } + fmt.Printf("✓ Initialized Gemini Embedder with model: %s\n", model) }
340-350: Redundant nil check for geminiKey.The check
if geminiKey != ""on line 341 is always true at this point, since lines 288-290 already return an error ifgeminiKeyis empty.🧹 Proposed fix
// Initialize LLM Client -if geminiKey != "" { - llm, err := gemini.NewLLMClient(geminiKey) - if err != nil { - return nil, fmt.Errorf("failed to initialize Gemini LLM client: %w", err) - } - deps.LLMClient = llm - if verbose { - fmt.Println("✓ Initialized Gemini LLM client") - } +llm, err := gemini.NewLLMClient(geminiKey) +if err != nil { + return nil, fmt.Errorf("failed to initialize Gemini LLM client: %w", err) +} +deps.LLMClient = llm +if verbose { + fmt.Println("✓ Initialized Gemini LLM client") }
| func handleHealth(w http.ResponseWriter, r *http.Request) { | ||
| w.Header().Set("Content-Type", "application/json") | ||
| json.NewEncoder(w).Encode(map[string]string{"status": "ok"}) | ||
| } |
There was a problem hiding this comment.
Check Encode errors to satisfy errcheck and improve reliability.
Lint currently fails because JSON encoding errors are ignored. Handle or log the errors in both handlers.
Proposed fix (apply pattern to all Encode calls)
- json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
+ if err := json.NewEncoder(w).Encode(map[string]string{"status": "ok"}); err != nil {
+ log.Printf("encode health response failed: %v", err)
+ }- json.NewEncoder(w).Encode(AnalysisResponse{
+ if err := json.NewEncoder(w).Encode(AnalysisResponse{
Success: false,
Error: "Invalid JSON: " + err.Error(),
- })
+ }); err != nil {
+ log.Printf("encode analyze error response failed: %v", err)
+ }Also applies to: 182-187, 192-196, 223-227, 231-242
🧰 Tools
🪛 GitHub Check: Lint
[failure] 162-162:
Error return value of (*encoding/json.Encoder).Encode is not checked (errcheck)
🤖 Prompt for AI Agents
In `@cmd/simili-web/main.go` around lines 160 - 163, The JSON encoding calls in
handlers like handleHealth (and the other handlers around lines referenced)
ignore errors; update each json.NewEncoder(w).Encode(...) to capture its
returned error, and if non-nil respond with an appropriate HTTP error and log
the failure (e.g., using log.Printf or the existing logger) before returning.
Specifically, in handleHealth and the handlers that call
json.NewEncoder(w).Encode (the ones noted in the comment), assign the result to
err, check if err != nil, write an http.Error (or set proper status and
message), and log the encoding error with contextual info including the handler
name.
Fix all issues identified by CodeRabbit review and CI checks: 1. CI/Vet: Remove redundant newline in fmt.Println - Line 107 already adds newline, extra \n triggered go vet 2. Errcheck: Handle JSON encoding errors in all handlers - Added error checking for all json.Encode() calls - Log errors instead of silently ignoring them 3. Security: Sanitize URLs before injecting into href attributes - Added sanitizeUrl() function to whitelist http(s) protocols - Prevents XSS via javascript: URLs - Also added rel="noopener noreferrer" to external links 4. Code quality: Fix parameter shadowing in loadIssues - Renamed filepath -> filePath to avoid shadowing imported package All CI checks should now pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary
This PR introduces two major features for testing and analyzing issues against the Simili Bot vector database:
Both features run in dry-run mode and never make writes to GitHub or the vector database.
Changes
🔧 Batch CLI Command (
simili batch)ExecutePipeline()function from process commandUse Cases:
Example:
🌐 Web UI (
simili-web)/api/analyzeendpointFeatures:
Usage:
🐛 Bug Fixes
Duplicate Detection Improvements:
textfield from Qdrant payloadDuplicateReasonfield to capture LLM reasoninggemini-embedding-001Before: Issue chain #8640 → #8641 → #8642 incorrectly marked as duplicates
After: Issues correctly identified as related but not duplicates
Testing
Batch CLI
Web UI
Documentation
Breaking Changes
None. All changes are additive.
Checklist
🤖 Generated with Claude Sonnet 4.5
Summary by CodeRabbit
New Features
Documentation