SWE-fficiency benchmark implementation by 18jeffreyma · Pull Request #11716 · OpenHands/OpenHands

18jeffreyma · 2025-11-12T14:17:09Z

Summary of PR

Adding SWE-fficiency inference implementation to OpenHands benchmark. https://swefficiency.com/

Change Type

Bug fix
[ x ] New feature
Breaking change
Refactor
Other (dependency update, docs, typo fixes, etc.)

Checklist

[ x ] I have read and reviewed the code and I understand what the code is doing.
I have tested the code to the best of my ability and ensured it works as expected.

Fixes

In addition to addding benchmark implementation, this change also fixes an issue with parallel evaluation (where one finish evaluation would crash others). This was due to one eval finishing and terminating other containers when rm_all_containers is True.

Release Notes

[ x ] Include this change in the Release Notes.

This reverts commit d121465.

enyst · 2025-11-12T20:23:10Z

Hey @18jeffreyma this is great and so fast, thank you! However, please note that this is the directory for OpenHands V0. We are transitioning to the next major version, OpenHands V1, and we have another repo, benchmarks, for V1. I understand that SWE-fficiency was run on V0? At this time, we are already using the new repo…

I’m just concerned that under the current transition, maybe you wouldn’t actually want to put work in adapting to this repo… 😓 unless I’m missing something and you are sure this should be here.

18jeffreyma · 2025-11-12T20:25:22Z

ah thanks for the catch! I forked OpenHands back in March well before the migration. Is there a guide to the new repository I can instead rebase against?

Happy to implement a V1 version and keep this fork for historical purposes, and instead have a new implementation this be part of the new V1 if you can share more details on differences?

enyst · 2025-11-12T20:35:34Z

V1 is a major architectural rebuild of OpenHands, based on agent-sdk

the benchmarks are in this benchmarks repo
arxiv paper documenting V1 agent-sdk: The OpenHands Software Agent SDK

Cc: @juanmichelini and @xingyaoww know better what changes the benchmark code adaptation needed

neubig

Hi @18jeffreyma , I think we can merge in this for now. We're eventually going to port over the benchmarks to the new https://github.com/OpenHands/benchmarks repo, but I think having this documented is better than not (and porting might take a little bit).

Would you mind splitting the fixes of the docker containers into a separate PR? I think we can merge in the evaluation/benchmarks/swefficiency changes right away, since they're low risk but we might need to take a closer look at the docker runtime changes to make sure they don't have unintended consequences.

neubig · 2025-11-13T13:56:16Z

I think this is an unintended change, could we revert it?

sounds good will clean up this PR and reply back when done!

mamoodi · 2025-11-24T18:59:05Z

@neubig should this be moved over to the benchmarks repo?

enyst · 2025-11-25T10:53:05Z

+        except Exception as e:
+            logger.warning(f'Error during base image cleanup: {e}')
+
+    def _find_dependent_images(self, docker_client: docker.DockerClient, base_image: str) -> list[str]:


Could we perhaps let the docker changes out of this PR?

I think you are correct that there were issues, but we are transitioning to a better runtime handling in V1. I think it's like, anytime now this docker code will likely be deleted anyway, and on the PR docker tests are failing so they are blocking merge...

18jeffreyma · 2025-11-27T01:20:58Z

Removed the docker implementation fixes + fixed previous merge issue, should be ready for review!

Apologies for the delay!

enyst · 2025-11-27T06:23:51Z

@OpenHands CI lint / python lint job is failing on this PR. Run pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml to autofix it, and push your changes

Note that its branch is 18jeffreyma:main so from a fork, use the right remote.

openhands-ai · 2025-11-27T06:24:02Z

I'm on it! enyst can track my progress at all-hands.dev

…enhands <openhands@all-hands.dev>

openhands-ai · 2025-11-27T06:31:08Z

Summary of work completed

Actions taken:

Checked out the PR head branch from the fork: created local branch pr-11716 tracking fork/main (18jeffreyma:main).
Ran the requested linting exactly:
- poetry run pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml
- Addressed auto-fixes and re-ran to confirm all hooks passed (ruff, ruff-format, mypy, etc.).
Committed and pushed fixes to the correct remote/branch (18jeffreyma:main).

Files changed (lint-only changes):

evaluation/benchmarks/swefficiency/README.md (removed trailing newline/whitespace)
evaluation/benchmarks/swefficiency/run_infer.py (formatting: quote/style, minor whitespace, import order)

Commit:

0debe8e Fix lint via pre-commit for SWE-fficiency files

Checklist:

Ran pre-commit with the specified flags and config.
Auto-fixes applied; re-run passed cleanly.
Pushed changes to 18jeffreyma:main (fork), updating PR SWE-fficiency benchmark implementation #11716.
Only minimal, necessary lint/formatting changes; no extraneous edits.

Next step:

CI “python lint” job should now pass. Let me know if you want me to monitor CI and address any follow-ups.

_{View full conversation}

enyst

Thank you!

It would be awesome if you could consider to submit it to OpenHands benchmarks repository for the V1 version of OpenHands. It's similar, and of course similarly it had SWE-bench integrated to help document the tweaks to add other benchmarks, but we hope it's much easier both to work with the agent loop and to run.

18jeffreyma · 2025-11-28T01:20:07Z

@enyst would love to! I'll work on a PR and get that up soon!

18jeffreyma added 30 commits June 2, 2025 01:57

sweperf initial eval code

367d562

working benchmark run

e6f8d17

sweperf eval

68e37d8

Merge branch 'All-Hands-AI:main' into main

52f764a

updated with cmd to run

269f605

added directions

0555dca

latest

828fd11

updated prompt

cf4a0ee

fix infer and sid

acc69a1

latest instruction

ef75475

latest

cf3c0a9

latest concurrent fix

ba910e2

test implementaiton (didn't work)

d121465

Revert "test implementaiton (didn't work)"

b0329c4

This reverts commit d121465.

latest remote runtime changes

84d0516

add remote

99ffa83

Merge branch 'main' of github.com:18jeffreyma/OpenHands

919fe14

try a fix for docker nat tables

c0ca008

Merge branch 'main' of github.com:18jeffreyma/OpenHands

31b7eac

add sweperf fixes

e027811

added reverts

eaa2259

try setting higher webserver priority

15d678e

attempted fix for parallelization

3440b79

revert noise

e41a32a

add back semaphore

aec9633

latest action execution client

e6b8580

revert to just only add swefficiency

96489e6

latest

a084096

remove lifecycle lock

d459083

latest infer

52d5e96

18jeffreyma added 7 commits November 10, 2025 21:59

move terms

977578c

update

56a7cae

readd hooks

0af1041

chmod

60fab21

remove files

3404870

fix

2731842

Merge branch 'OpenHands:main' into main

0afdd7b

18jeffreyma requested review from neubig and xingyaoww as code owners November 12, 2025 14:17

neubig reviewed Nov 13, 2025

View reviewed changes

Merge branch 'OpenHands:main' into main

8b7781a

enyst reviewed Nov 25, 2025

View reviewed changes

18jeffreyma added 3 commits November 27, 2025 01:19

revert

20ca65d

rename

7fab698

revert containers too

a9bbb98

Merge branch 'main' into main

d8cb48a

Fix lint via pre-commit for SWE-fficiency files\n\nCo-authored-by: op…

0debe8e

…enhands <openhands@all-hands.dev>

enyst approved these changes Nov 27, 2025

View reviewed changes

enyst merged commit 974bcdf into OpenHands:main Nov 27, 2025
24 of 25 checks passed

TuringND mentioned this pull request Dec 9, 2025

openhands upgrade ParentSquare/OpenHands#17

Open

8 tasks

Copilot AI mentioned this pull request Jan 9, 2026

Add SWE-fficiency benchmark implementation 18jeffreyma/benchmarks#1

Draft

19 tasks

Conversation

18jeffreyma commented Nov 12, 2025

Summary of PR

Change Type

Checklist

Fixes

Release Notes

Uh oh!

enyst commented Nov 12, 2025

Uh oh!

18jeffreyma commented Nov 12, 2025

Uh oh!

enyst commented Nov 12, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

neubig Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

18jeffreyma Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

mamoodi commented Nov 24, 2025

Uh oh!

enyst Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

18jeffreyma commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst commented Nov 27, 2025

Uh oh!

openhands-ai Bot commented Nov 27, 2025

Uh oh!

openhands-ai Bot commented Nov 27, 2025

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

18jeffreyma commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

18jeffreyma commented Nov 27, 2025 •

edited

Loading