Skip to content

SWE-fficiency benchmark implementation#11716

Merged
enyst merged 54 commits intoOpenHands:mainfrom
18jeffreyma:main
Nov 27, 2025
Merged

SWE-fficiency benchmark implementation#11716
enyst merged 54 commits intoOpenHands:mainfrom
18jeffreyma:main

Conversation

@18jeffreyma
Copy link
Copy Markdown
Contributor

Summary of PR

Adding SWE-fficiency inference implementation to OpenHands benchmark. https://swefficiency.com/

Change Type

  • Bug fix
  • [ x ] New feature
  • Breaking change
  • Refactor
  • Other (dependency update, docs, typo fixes, etc.)

Checklist

  • [ x ] I have read and reviewed the code and I understand what the code is doing.
  • I have tested the code to the best of my ability and ensured it works as expected.

Fixes

In addition to addding benchmark implementation, this change also fixes an issue with parallel evaluation (where one finish evaluation would crash others). This was due to one eval finishing and terminating other containers when rm_all_containers is True.

Release Notes

  • [ x ] Include this change in the Release Notes.

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Nov 12, 2025

Hey @18jeffreyma this is great and so fast, thank you! However, please note that this is the directory for OpenHands V0. We are transitioning to the next major version, OpenHands V1, and we have another repo, benchmarks, for V1. I understand that SWE-fficiency was run on V0? At this time, we are already using the new repo…

I’m just concerned that under the current transition, maybe you wouldn’t actually want to put work in adapting to this repo… 😓 unless I’m missing something and you are sure this should be here.

@18jeffreyma
Copy link
Copy Markdown
Contributor Author

ah thanks for the catch! I forked OpenHands back in March well before the migration. Is there a guide to the new repository I can instead rebase against?

Happy to implement a V1 version and keep this fork for historical purposes, and instead have a new implementation this be part of the new V1 if you can share more details on differences?

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Nov 12, 2025

V1 is a major architectural rebuild of OpenHands, based on agent-sdk

Cc: @juanmichelini and @xingyaoww know better what changes the benchmark code adaptation needed

Copy link
Copy Markdown
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @18jeffreyma , I think we can merge in this for now. We're eventually going to port over the benchmarks to the new https://github.com/OpenHands/benchmarks repo, but I think having this documented is better than not (and porting might take a little bit).

Would you mind splitting the fixes of the docker containers into a separate PR? I think we can merge in the evaluation/benchmarks/swefficiency changes right away, since they're low risk but we might need to take a closer look at the docker runtime changes to make sure they don't have unintended consequences.

Comment thread .openhands/precommit.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an unintended change, could we revert it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good will clean up this PR and reply back when done!

@mamoodi
Copy link
Copy Markdown
Collaborator

mamoodi commented Nov 24, 2025

@neubig should this be moved over to the benchmarks repo?

except Exception as e:
logger.warning(f'Error during base image cleanup: {e}')

def _find_dependent_images(self, docker_client: docker.DockerClient, base_image: str) -> list[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we perhaps let the docker changes out of this PR?

I think you are correct that there were issues, but we are transitioning to a better runtime handling in V1. I think it's like, anytime now this docker code will likely be deleted anyway, and on the PR docker tests are failing so they are blocking merge...

@18jeffreyma
Copy link
Copy Markdown
Contributor Author

18jeffreyma commented Nov 27, 2025

Removed the docker implementation fixes + fixed previous merge issue, should be ready for review!

Apologies for the delay!

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Nov 27, 2025

@OpenHands CI lint / python lint job is failing on this PR. Run pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml to autofix it, and push your changes

Note that its branch is 18jeffreyma:main so from a fork, use the right remote.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Nov 27, 2025

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Nov 27, 2025

Summary of work completed

Actions taken:

  • Checked out the PR head branch from the fork: created local branch pr-11716 tracking fork/main (18jeffreyma:main).
  • Ran the requested linting exactly:
    • poetry run pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml
    • Addressed auto-fixes and re-ran to confirm all hooks passed (ruff, ruff-format, mypy, etc.).
  • Committed and pushed fixes to the correct remote/branch (18jeffreyma:main).

Files changed (lint-only changes):

  • evaluation/benchmarks/swefficiency/README.md (removed trailing newline/whitespace)
  • evaluation/benchmarks/swefficiency/run_infer.py (formatting: quote/style, minor whitespace, import order)

Commit:

  • 0debe8e Fix lint via pre-commit for SWE-fficiency files

Checklist:

  • Ran pre-commit with the specified flags and config.
  • Auto-fixes applied; re-run passed cleanly.
  • Pushed changes to 18jeffreyma:main (fork), updating PR SWE-fficiency benchmark implementation #11716.
  • Only minimal, necessary lint/formatting changes; no extraneous edits.

Next step:

  • CI “python lint” job should now pass. Let me know if you want me to monitor CI and address any follow-ups.

View full conversation

Copy link
Copy Markdown
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

It would be awesome if you could consider to submit it to OpenHands benchmarks repository for the V1 version of OpenHands. It's similar, and of course similarly it had SWE-bench integrated to help document the tweaks to add other benchmarks, but we hope it's much easier both to work with the agent loop and to run.

@enyst enyst merged commit 974bcdf into OpenHands:main Nov 27, 2025
24 of 25 checks passed
@18jeffreyma
Copy link
Copy Markdown
Contributor Author

@enyst would love to! I'll work on a PR and get that up soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants