Conversation
There was a problem hiding this comment.
👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds
More details
- Looked at
376lines of code in6files - Skipped
1files when reviewing. - Skipped posting
4drafted comments based on config settings.
1. gptme/eval/swebench/utils.py:10
- Draft comment:
The import statement forDownloadModeis repeated. Remove the duplicate import to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import statement forDownloadModeis repeated, which is unnecessary and can be removed.
2. gptme/eval/swebench/utils.py:46
- Draft comment:
Thecurrent_filevariable is initialized but never used. Consider removing it to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
Theget_file_spans_from_patchfunction initializescurrent_filebut never uses it, which is unnecessary and can be removed.
3. gptme/eval/swebench/utils.py:74
- Draft comment:
Usingos.chdirto change the working directory can have side effects. Consider using a context manager to temporarily change the directory. - Reason this comment was not posted:
Confidence changes required:50%
Thesetup_github_repofunction changes the current working directory usingos.chdir, which can have side effects. It's better to use a context manager to temporarily change the directory.
4. gptme/eval/swebench/main.py:86
- Draft comment:
Thewrite_resultsfunction is called but not defined in the provided code. Ensure that it is implemented or imported correctly. - Reason this comment was not posted:
Confidence changes required:50%
Thewrite_resultsfunction is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
Workflow ID: wflow_QDiWSjoiJJC7dGXD
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds
More details
- Looked at
21lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1drafted comments based on config settings.
1. gptme/eval/swebench/main.py:4
- Draft comment:
The importEvalResultis unused and can be removed to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import statement forEvalResultis not used in the code, which is unnecessary and should be removed to keep the code clean.
Workflow ID: wflow_RhT1myKfhHe3YANu
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
|
Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models. Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though. |
|
I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra Might also integrate https://swe-rex.com/latest/ which seems pretty useful My branch is a giant mess atm though 😭 |
|
@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes 🙏 |
|
I also just found SWE-Gym: https://arxiv.org/abs/2412.21139 |
|
I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc. |
Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
|
Recovered and combined with #424 into clean PR #1489. Closing as requested by @ErikBjare. |
Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
…424) (#1489) * feat(eval): add SWE-bench evaluation modules Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com> * fix(eval): remove os.chdir() and duplicate import in swebench utils - Replace os.chdir(repo_dir) with cwd=repo_dir in subprocess calls to preserve working directory persistence (per revert in #1487) - Consolidate duplicate 'from datasets import' into single import line * fix(eval): address greptile review - fix repo_dir in files dict and missing model arg - evaluate.py: replace incorrect {"repo_dir": repo_dir} Files dict with proper prompt-embedded context; add missing log_dir/workspace_dir to EvalResult instances - run_swe_extra.py: add --model CLI arg so cli() can call main() correctly for non-resume evaluations * fix(eval): address greptile review feedback on swebench module * fix(eval): copy repo to agent workspace and capture diff via git * fix(eval): enable mypy for swebench modules per review Remove WIP mypy ignore_errors for gptme.eval.swebench.* — code is clean. Keep ignore_errors for gptme.eval.swe_extra.* (depends on unmerged SWEBenchInfo). * feat(eval): integrate SWEBenchInfo/SWEBenchAgent and fix swe_extra for current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules * fix(eval): restore cwd after os.chdir in SWEBenchAgent.replay() * refactor(eval): move SWEBenchInfo from logmanager to eval/swebench/info.py Per Erik's review: SWEBenchInfo doesn't belong in logmanager.py, move it into the eval/swebench module where it belongs. - New: gptme/eval/swebench/info.py with SWEBenchInfo dataclass - Updated gptme/eval/swebench/__init__.py to export SWEBenchInfo - Removed SWEBenchInfo from gptme/logmanager.py - Updated all imports in eval/agents/swebench.py, swe_extra/{run_swe_extra, swe_bench_test_spec,swe_bench_extra_data}.py --------- Co-authored-by: TimeToBuildBob <TimeToBuildBob@users.noreply.github.com> Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
Implemented with gptme, given moatless-tools and aider as reference implementations.
Important
Introduces SWE-bench evaluation framework in
gptmewith new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.gptme/eval/swebench.run_swebench_evaluation()inevaluate.pyto evaluate instances using anAgent.main.pyfor running evaluations with options for model, dataset, split, instance, and verbosity.utils.pyprovides functions for loading instances, setting up repositories, and extracting file spans from patches.gptme-eval-swebenchscript entry inpyproject.toml.datasetsandfsspecas dependencies inpyproject.toml.This description was created by
for 4e9b48a. It will automatically update as commits are pushed.