SWE-fficiency benchmark implementation#11716
Conversation
This reverts commit d121465.
|
Hey @18jeffreyma this is great and so fast, thank you! However, please note that this is the directory for OpenHands V0. We are transitioning to the next major version, OpenHands V1, and we have another repo, benchmarks, for V1. I understand that SWE-fficiency was run on V0? At this time, we are already using the new repo… I’m just concerned that under the current transition, maybe you wouldn’t actually want to put work in adapting to this repo… 😓 unless I’m missing something and you are sure this should be here. |
|
ah thanks for the catch! I forked OpenHands back in March well before the migration. Is there a guide to the new repository I can instead rebase against? Happy to implement a V1 version and keep this fork for historical purposes, and instead have a new implementation this be part of the new V1 if you can share more details on differences? |
|
V1 is a major architectural rebuild of OpenHands, based on agent-sdk
Cc: @juanmichelini and @xingyaoww know better what changes the benchmark code adaptation needed |
neubig
left a comment
There was a problem hiding this comment.
Hi @18jeffreyma , I think we can merge in this for now. We're eventually going to port over the benchmarks to the new https://github.com/OpenHands/benchmarks repo, but I think having this documented is better than not (and porting might take a little bit).
Would you mind splitting the fixes of the docker containers into a separate PR? I think we can merge in the evaluation/benchmarks/swefficiency changes right away, since they're low risk but we might need to take a closer look at the docker runtime changes to make sure they don't have unintended consequences.
There was a problem hiding this comment.
I think this is an unintended change, could we revert it?
There was a problem hiding this comment.
sounds good will clean up this PR and reply back when done!
|
@neubig should this be moved over to the benchmarks repo? |
| except Exception as e: | ||
| logger.warning(f'Error during base image cleanup: {e}') | ||
|
|
||
| def _find_dependent_images(self, docker_client: docker.DockerClient, base_image: str) -> list[str]: |
There was a problem hiding this comment.
Could we perhaps let the docker changes out of this PR?
I think you are correct that there were issues, but we are transitioning to a better runtime handling in V1. I think it's like, anytime now this docker code will likely be deleted anyway, and on the PR docker tests are failing so they are blocking merge...
|
Removed the docker implementation fixes + fixed previous merge issue, should be ready for review! Apologies for the delay! |
|
@OpenHands CI lint / python lint job is failing on this PR. Run pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml to autofix it, and push your changes Note that its branch is |
|
I'm on it! enyst can track my progress at all-hands.dev |
…enhands <openhands@all-hands.dev>
|
Summary of work completed Actions taken:
Files changed (lint-only changes):
Commit:
Checklist:
Next step:
|
enyst
left a comment
There was a problem hiding this comment.
Thank you!
It would be awesome if you could consider to submit it to OpenHands benchmarks repository for the V1 version of OpenHands. It's similar, and of course similarly it had SWE-bench integrated to help document the tweaks to add other benchmarks, but we hope it's much easier both to work with the agent loop and to run.
|
@enyst would love to! I'll work on a PR and get that up soon! |
Summary of PR
Adding SWE-fficiency inference implementation to OpenHands benchmark. https://swefficiency.com/
Change Type
Checklist
Fixes
In addition to addding benchmark implementation, this change also fixes an issue with parallel evaluation (where one finish evaluation would crash others). This was due to one eval finishing and terminating other containers when rm_all_containers is True.
Release Notes