Skip to content

feat: started working on SWE-bench evals#142

Closed
ErikBjare wants to merge 2 commits intomasterfrom
dev/swebench
Closed

feat: started working on SWE-bench evals#142
ErikBjare wants to merge 2 commits intomasterfrom
dev/swebench

Conversation

@ErikBjare
Copy link
Copy Markdown
Member

@ErikBjare ErikBjare commented Sep 30, 2024

Implemented with gptme, given moatless-tools and aider as reference implementations.

  • Set up harness
  • Get a single eval instance passing
    • Gets stuck at installing deps for repos
      • moatless-tools don't support running tests?
      • aider depends on the Docker env?
  • Try making our own eval instance?

Important

Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

  • New Features:
    • Introduces SWE-bench evaluation framework in gptme/eval/swebench.
    • Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.
    • Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.
  • Utilities:
    • utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.
  • Configuration:
    • Adds gptme-eval-swebench script entry in pyproject.toml.
    • Adds datasets and fsspec as dependencies in pyproject.toml.

This description was created by Ellipsis for 4e9b48a. It will automatically update as commits are pushed.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds

More details
  • Looked at 376 lines of code in 6 files
  • Skipped 1 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. gptme/eval/swebench/utils.py:10
  • Draft comment:
    The import statement for DownloadMode is repeated. Remove the duplicate import to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The import statement for DownloadMode is repeated, which is unnecessary and can be removed.
2. gptme/eval/swebench/utils.py:46
  • Draft comment:
    The current_file variable is initialized but never used. Consider removing it to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The get_file_spans_from_patch function initializes current_file but never uses it, which is unnecessary and can be removed.
3. gptme/eval/swebench/utils.py:74
  • Draft comment:
    Using os.chdir to change the working directory can have side effects. Consider using a context manager to temporarily change the directory.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The setup_github_repo function changes the current working directory using os.chdir, which can have side effects. It's better to use a context manager to temporarily change the directory.
4. gptme/eval/swebench/main.py:86
  • Draft comment:
    The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.

Workflow ID: wflow_QDiWSjoiJJC7dGXD


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 30, 2024

Codecov Report

❌ Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
gptme/eval/swebench/evaluate.py 0.00% 62 Missing ⚠️
gptme/eval/swebench/utils.py 0.00% 47 Missing ⚠️
gptme/eval/swebench/main.py 0.00% 29 Missing ⚠️
gptme/eval/swebench/__init__.py 0.00% 3 Missing ⚠️
gptme/eval/swebench/__main__.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ErikBjare ErikBjare mentioned this pull request Sep 30, 2024
8 tasks
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds

More details
  • Looked at 21 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. gptme/eval/swebench/main.py:4
  • Draft comment:
    The import EvalResult is unused and can be removed to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The import statement for EvalResult is not used in the code, which is unnecessary and should be removed to keep the code clean.

Workflow ID: wflow_RhT1myKfhHe3YANu


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@ErikBjare ErikBjare mentioned this pull request Oct 2, 2024
3 tasks
@ErikBjare
Copy link
Copy Markdown
Member Author

ErikBjare commented Nov 1, 2024

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

@bjsi
Copy link
Copy Markdown
Contributor

bjsi commented Dec 27, 2024

I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra

Might also integrate https://swe-rex.com/latest/ which seems pretty useful

My branch is a giant mess atm though 😭

@ErikBjare
Copy link
Copy Markdown
Member Author

@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes 🙏

@ErikBjare
Copy link
Copy Markdown
Member Author

I also just found SWE-Gym: https://arxiv.org/abs/2412.21139

@bjsi
Copy link
Copy Markdown
Contributor

bjsi commented Jan 14, 2025

I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc.

TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 25, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR gptme#142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR gptme#424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
@TimeToBuildBob
Copy link
Copy Markdown
Member

Recovered and combined with #424 into clean PR #1489. Closing as requested by @ErikBjare.

TimeToBuildBob added a commit that referenced this pull request Feb 26, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR #142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR #424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 26, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR gptme#142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR gptme#424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
ErikBjare pushed a commit that referenced this pull request Feb 26, 2026
…424) (#1489)

* feat(eval): add SWE-bench evaluation modules

Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR #142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR #424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

* fix(eval): remove os.chdir() and duplicate import in swebench utils

- Replace os.chdir(repo_dir) with cwd=repo_dir in subprocess calls to
  preserve working directory persistence (per revert in #1487)
- Consolidate duplicate 'from datasets import' into single import line

* fix(eval): address greptile review - fix repo_dir in files dict and missing model arg

- evaluate.py: replace incorrect {"repo_dir": repo_dir} Files dict with
  proper prompt-embedded context; add missing log_dir/workspace_dir to
  EvalResult instances
- run_swe_extra.py: add --model CLI arg so cli() can call main() correctly
  for non-resume evaluations

* fix(eval): address greptile review feedback on swebench module

* fix(eval): copy repo to agent workspace and capture diff via git

* fix(eval): enable mypy for swebench modules per review

Remove WIP mypy ignore_errors for gptme.eval.swebench.* — code is clean.
Keep ignore_errors for gptme.eval.swe_extra.* (depends on unmerged SWEBenchInfo).

* feat(eval): integrate SWEBenchInfo/SWEBenchAgent and fix swe_extra for current swebench API

Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424:

- Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata
- Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline
- Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py)
- Update swe_bench_constants.py for current swebench library API
- Fix swe_bench_test_spec.py imports for restructured swebench package
- Fix type errors in swe_bench_extra_data.py and run_swe_extra.py
- Add matplotlib to mypy ignore_missing_imports
- Remove WIP ignore_errors override for swe_extra modules

* fix(eval): restore cwd after os.chdir in SWEBenchAgent.replay()

* refactor(eval): move SWEBenchInfo from logmanager to eval/swebench/info.py

Per Erik's review: SWEBenchInfo doesn't belong in logmanager.py, move it
into the eval/swebench module where it belongs.

- New: gptme/eval/swebench/info.py with SWEBenchInfo dataclass
- Updated gptme/eval/swebench/__init__.py to export SWEBenchInfo
- Removed SWEBenchInfo from gptme/logmanager.py
- Updated all imports in eval/agents/swebench.py, swe_extra/{run_swe_extra,
  swe_bench_test_spec,swe_bench_extra_data}.py

---------

Co-authored-by: TimeToBuildBob <TimeToBuildBob@users.noreply.github.com>
Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
@TimeToBuildBob TimeToBuildBob deleted the dev/swebench branch March 28, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants