RepoPilot

Inspiration

Ship-time is precious, yet we watched senior engineers burn hours triaging a growing graveyard of GitHub issues instead of shipping the next feature. Modern LLMs can already reason about source code—so why not let the repo heal itself while humans keep building? That question birthed RepoPilot: an autonomous teammate that attacks both fresh and long-standing issues, generates a patch that compiles and tests locally, then opens a polished pull-request for human review—without a single line ever leaving your private cloud.

What it does

Spins up a sandbox – For each incoming or back-logged GitHub issue, RepoPilot checks the issue body/labels for a branch name or tag (e.g., release-v2.3.1). It then clones that exact branch or tagged snapshot into an isolated container and reinstalls the corresponding dependency lock-file so the sandbox faithfully matches the user-specified environment.
Understands the failure – The issue text, stack traces (when available), and the sandboxed codebase are fed to a locally-hosted Qwen-2-5-Coder (swappable for any model). The LLM pinpoints the root cause across files.
Generates a fix – Inside the same sandbox, the agent edits code, runs the project’s test suite (or reproduces the reported steps), and iterates until everything passes.
Explains the patch – It writes a plain-English summary of what went wrong and how the fix resolves it, linking to relevant lines.
Hands off for review – Pushes the updated branch and opens a pull request with the diff, passing-test badge, and explanation attached.
Tracks impact – Logs “issue-to-PR” time, sandbox runtime, and success rate so teams can quantify hours saved.

How we built it
Architecture – Docker-compose stack (Flask + MongoDB + docker) running in a single VM on any cloud; the model spins up via Ollama.

Model layer – Qwen-2.5-Coder is our current workhorse, but the interface is pluggable—swap in bigger local models or remote APIs as policy allows.
Agent loop – Aider:

Retriever pulls only changed files plus dependency graph.

Planner decomposes multi-file bugs into atomic tasks.

Executor applies edits and runs pytest.
Security – All processing stays inside the customer’s VPC; GitHub tokens are stored in Vault; no outbound model calls.
Latency tricks – Parallel diff generation & test runs; caching embeddings between agent cycles.
Frontend – Lightweight React dashboard that streams live logs, shows queue status, and draws the “minutes-saved” counter.

Challenges we ran into

LLM hallucinations - Early on, small Language models would invent APIs or rename variables, breaking compilations. Upgrading to Qwen-2-5-Coder-32B cut hallucinations by ~60 %. We’ve verified that proprietary models (e.g., GPT-4-class) push accuracy even higher, but their licenses kept them out of the hackathon demo.
Huge codebases - Some repos exceed a million lines and issues include full stack traces or logs, blowing past 32k tokens. With LLMs we can increase to 128k tokens

Accomplishments that we're proud of

End-to-end working demo completed in less than two weeks.
Cleared a 35-issue backlog of a sample OSS repo in 2 h with an 87 % first-pass success rate.
Mean “issue-to-PR” latency: 7 min 12 s (20 × faster than manual triage in our benchmark).
Zero code ever left the demo VPC—proving enterprise-grade privacy on day one.

What we learned

LLMs are surprisingly good at small, surgical fixes; complex refactors still need humans—so “human-in-the-loop PR” is the right boundary.
Quality of test suites is the ceiling; bad or missing tests equal uncertain fixes.
Fine-tuning isn’t always worth it—prompt engineering plus smart retrieval covered 80 % of cases.
Simple UX cues (a live log stream, a green badge) dramatically boost user trust in autonomous code edits.

What's next for RepoPilot

Broaden the fix-zone
Auto-repair beyond “bug” labels—handle documentation typos, feature-request scaffolding, and even CI-failure patches. One agent, many chores, slashing even more context-switch time for engineers.
Self-generated regression tests
Before touching code, RepoPilot will synthesize unit or integration tests around the failing path, hardening weak suites and guaranteeing future stability.
Pluggable model layer
Keep the open-source default (Qwen-2.5-Coder) but allow drop-in swaps to larger local models or remote proprietary APIs whenever policy and budget permit.
Usage-based savings dashboard
A finance-friendly panel that converts “minutes saved” into real dollars reclaimed, broken down by repo, label, and sprint—so leadership sees ROI at a glance.
Marketplace of skills
Community-contributed “skill packs” (e.g., React refactor rules, Terraform lint fixes) that RepoPilot can pull in on demand, making the agent smarter with every install.