feat: started working on SWE-bench evals by ErikBjare · Pull Request #142 · gptme/gptme

ErikBjare · 2024-09-30T07:45:36Z

Implemented with gptme, given moatless-tools and aider as reference implementations.

Set up harness
Get a single eval instance passing
- Gets stuck at installing deps for repos
  - moatless-tools don't support running tests?
  - aider depends on the Docker env?
Try making our own eval instance?

Important

Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

New Features:
- Introduces SWE-bench evaluation framework in gptme/eval/swebench.
- Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.
- Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.
Utilities:
- utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.
Configuration:
- Adds gptme-eval-swebench script entry in pyproject.toml.
- Adds datasets and fsspec as dependencies in pyproject.toml.

^{This description was created by}^{for 4e9b48a. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds

More details

Looked at 376 lines of code in 6 files
Skipped 1 files when reviewing.
Skipped posting 4 drafted comments based on config settings.

1. gptme/eval/swebench/utils.py:10

Draft comment:
The import statement for DownloadMode is repeated. Remove the duplicate import to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The import statement for DownloadMode is repeated, which is unnecessary and can be removed.

2. gptme/eval/swebench/utils.py:46

Draft comment:
The current_file variable is initialized but never used. Consider removing it to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The get_file_spans_from_patch function initializes current_file but never uses it, which is unnecessary and can be removed.

3. gptme/eval/swebench/utils.py:74

Draft comment:
Using os.chdir to change the working directory can have side effects. Consider using a context manager to temporarily change the directory.
Reason this comment was not posted:
Confidence changes required: 50%
The setup_github_repo function changes the current working directory using os.chdir, which can have side effects. It's better to use a context manager to temporarily change the directory.

4. gptme/eval/swebench/main.py:86

Draft comment:
The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
Reason this comment was not posted:
Confidence changes required: 50%
The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.

Workflow ID: wflow_QDiWSjoiJJC7dGXD

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

codecov-commenter · 2024-09-30T07:48:11Z

Codecov Report

❌ Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
gptme/eval/swebench/evaluate.py	0.00%	62 Missing ⚠️
gptme/eval/swebench/utils.py	0.00%	47 Missing ⚠️
gptme/eval/swebench/main.py	0.00%	29 Missing ⚠️
gptme/eval/swebench/__init__.py	0.00%	3 Missing ⚠️
gptme/eval/swebench/__main__.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

ellipsis-dev

👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds

More details

Looked at 21 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. gptme/eval/swebench/main.py:4

Draft comment:
The import EvalResult is unused and can be removed to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The import statement for EvalResult is not used in the code, which is unnecessary and should be removed to keep the code clean.

Workflow ID: wflow_RhT1myKfhHe3YANu

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ErikBjare · 2024-11-01T21:03:23Z

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

bjsi · 2024-12-27T09:46:48Z

I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra

Might also integrate https://swe-rex.com/latest/ which seems pretty useful

My branch is a giant mess atm though 😭

ErikBjare · 2025-01-14T15:33:40Z

@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes 🙏

ErikBjare · 2025-01-14T16:08:37Z

I also just found SWE-Gym: https://arxiv.org/abs/2412.21139

bjsi · 2025-01-14T16:28:59Z

I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc.

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

TimeToBuildBob · 2026-02-25T15:14:47Z

Recovered and combined with #424 into clean PR #1489. Closing as requested by @ErikBjare.

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

@ErikBjare

…424) (#1489) * feat(eval): add SWE-bench evaluation modules Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com> * fix(eval): remove os.chdir() and duplicate import in swebench utils - Replace os.chdir(repo_dir) with cwd=repo_dir in subprocess calls to preserve working directory persistence (per revert in #1487) - Consolidate duplicate 'from datasets import' into single import line * fix(eval): address greptile review - fix repo_dir in files dict and missing model arg - evaluate.py: replace incorrect {"repo_dir": repo_dir} Files dict with proper prompt-embedded context; add missing log_dir/workspace_dir to EvalResult instances - run_swe_extra.py: add --model CLI arg so cli() can call main() correctly for non-resume evaluations * fix(eval): address greptile review feedback on swebench module * fix(eval): copy repo to agent workspace and capture diff via git * fix(eval): enable mypy for swebench modules per review Remove WIP mypy ignore_errors for gptme.eval.swebench.* — code is clean. Keep ignore_errors for gptme.eval.swe_extra.* (depends on unmerged SWEBenchInfo). * feat(eval): integrate SWEBenchInfo/SWEBenchAgent and fix swe_extra for current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules * fix(eval): restore cwd after os.chdir in SWEBenchAgent.replay() * refactor(eval): move SWEBenchInfo from logmanager to eval/swebench/info.py Per Erik's review: SWEBenchInfo doesn't belong in logmanager.py, move it into the eval/swebench module where it belongs. - New: gptme/eval/swebench/info.py with SWEBenchInfo dataclass - Updated gptme/eval/swebench/__init__.py to export SWEBenchInfo - Removed SWEBenchInfo from gptme/logmanager.py - Updated all imports in eval/agents/swebench.py, swe_extra/{run_swe_extra, swe_bench_test_spec,swe_bench_extra_data}.py --------- Co-authored-by: TimeToBuildBob <TimeToBuildBob@users.noreply.github.com> Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

feat: started working on SWE-bench evals

96f1ede

ellipsis-dev bot reviewed Sep 30, 2024

View reviewed changes

ErikBjare mentioned this pull request Sep 30, 2024

Benchmarks/evals #63

Closed

8 tasks

fix: typing fix

4e9b48a

ellipsis-dev bot reviewed Oct 2, 2024

View reviewed changes

ErikBjare mentioned this pull request Oct 2, 2024

Making gptme work while you sleep #143

Closed

3 tasks

ErikBjare mentioned this pull request Feb 25, 2026

add swe bench / swe extra setup scripts #424

Closed

TimeToBuildBob mentioned this pull request Feb 25, 2026

feat(eval): add SWE-bench evaluation modules (recovered from #142 and #424) #1489

Merged

TimeToBuildBob closed this Feb 25, 2026

TimeToBuildBob deleted the dev/swebench branch March 28, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: started working on SWE-bench evals#142

feat: started working on SWE-bench evals#142
ErikBjare wants to merge 2 commits intomasterfrom
dev/swebench

ErikBjare commented Sep 30, 2024 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

codecov-commenter commented Sep 30, 2024 •

edited by codecov bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ErikBjare commented Nov 1, 2024 •

edited

Loading

Uh oh!

bjsi commented Dec 27, 2024 •

edited

Loading

Uh oh!

ErikBjare commented Jan 14, 2025

Uh oh!

ErikBjare commented Jan 14, 2025

Uh oh!

bjsi commented Jan 14, 2025

Uh oh!

TimeToBuildBob commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ErikBjare commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 30, 2024 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ErikBjare commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjsi commented Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ErikBjare commented Jan 14, 2025

Uh oh!

ErikBjare commented Jan 14, 2025

Uh oh!

bjsi commented Jan 14, 2025

Uh oh!

TimeToBuildBob commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ErikBjare commented Sep 30, 2024 •

edited

Loading

codecov-commenter commented Sep 30, 2024 •

edited by codecov bot

Loading

ErikBjare commented Nov 1, 2024 •

edited

Loading

bjsi commented Dec 27, 2024 •

edited

Loading