add swe bench / swe extra setup scripts by bjsi · Pull Request #424 · gptme/gptme

bjsi · 2025-01-26T08:03:01Z

I extracted all of the changes from my branch and dumped them in this folder, it will still require a bit of work to integrate into the main branch, but the challenging part of installing requirements and deps should be taken care of.

Important

Add SWE-bench evaluation setup with scripts for running, data handling, and test specifications in gptme/eval/swe_extra.

Behavior:
- Adds run_swe_extra.py to execute SWE-bench evaluations with options to resume from previous runs and clear branches.
- Introduces swe_bench_extra_data.py for loading and analyzing SWE-bench task instances and trajectories.
- Implements caching for top 50 easiest task instances in swe_bench_extra_data.py.
Constants and Configurations:
- Defines placeholders and default mappings in swe_bench_constants.py for requirements paths, test frameworks, and version installations.
Test Specifications:
- Adds swe_bench_test_spec.py to create and manage test specifications for SWE-bench instances.
- Provides functions to generate setup and evaluation scripts for repositories in swe_bench_test_spec.py.

^{This description was created by}^{for 49cc0d7. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 49cc0d7 in 1 minute and 26 seconds

More details

Looked at 771 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 8 drafted comments based on config settings.

1. gptme/eval/swe_extra/swe_bench_test_spec.py:128

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While using shell=True can indeed be a security risk, this code is part of a test harness that already needs to execute arbitrary shell commands to run tests. The scripts being run are constructed from trusted sources (git commands, test frameworks, etc). Converting these to use argument lists would be complex and wouldn't meaningfully improve security since the code needs to execute shell commands anyway. The comment is technically correct but not practically useful in this context.
I might be underestimating the security risk. Even in a test environment, command injection could potentially be dangerous if the input data is not fully trusted.
The code is specifically designed to run arbitrary test commands in a controlled environment. The security risk from shell=True is minimal compared to the inherent risk of running tests, and making this change would add complexity without meaningful benefit.
While technically correct, this comment suggests a change that would add complexity without meaningful security benefits in this specific context. The comment should be removed.

2. gptme/eval/swe_extra/swe_bench_test_spec.py:134

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Marked as duplicate.

3. gptme/eval/swe_extra/swe_bench_test_spec.py:140

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Marked as duplicate.

4. gptme/eval/swe_extra/swe_bench_extra_data.py:115

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

5. gptme/eval/swe_extra/swe_bench_extra_data.py:116

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

6. gptme/eval/swe_extra/swe_bench_extra_data.py:117

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

7. gptme/eval/swe_extra/swe_bench_extra_data.py:118

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

8. gptme/eval/swe_extra/swe_bench_test_spec.py:37

Draft comment:
Using os.path.join for URLs can lead to incorrect URL formation. Consider using string concatenation or urllib.parse.urljoin for URLs.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While technically correct that os.path.join isn't ideal for URLs (especially on Windows where it uses backslashes), in this specific case it's likely working fine since:

The URLs are GitHub raw URLs which use forward slashes
The code is clearly working as evidenced by the requests.get() calls
The paths being joined are simple segments without special characters
The code is part of a test harness, not production code
The comment raises a valid technical point - os.path.join can cause issues with URLs, especially on Windows systems. This could potentially break in some environments.
While technically correct, the current implementation appears to be working fine for its purpose in a test harness. The benefit of changing it doesn't outweigh the cost given the low risk in this specific context.
Delete the comment. While technically correct, the current implementation is working fine for its purpose and the suggested change would add complexity without significant benefit in this context.

Workflow ID: wflow_XyI8KJdp6my1PswB

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ErikBjare

Looks nice, want to get this semi-working, then merged in some shape (doesn't have to be perfect), and then I'll refine it from there in a new PR.

Could you provide the missing things? :)

ErikBjare · 2025-01-29T12:23:59Z

gptme/eval/swe_extra/run_swe_extra.py

+from pathlib import Path
+import glob
+import os
+from gptme.eval.agents.swebench import SWEBenchAgent


Can I see this plz? :)

It was quite janky lol

import json import logging import os from pathlib import Path import time import uuid from gptme.cli import get_name from gptme.dirs import get_logs_dir from gptme.eval.agents.fix import Fix from gptme.eval.agents.reproduce import Reproduce from gptme.eval.agents.understand import Understand from gptme.eval.swe_extra.swe_bench_test_spec import instance_to_trajectory_info, make_test_spec from gptme.llm import set_stop from gptme.logmanager import LogManager, SWEBenchInfo from gptme.message import print_msg from gptme.tools import execute_msg, init_tools from gptme.tools.read import reset_file_read_cache, save_file_read_cache from swebench.harness.constants import SWEbenchInstance logger = logging.getLogger(__name__) class SWEBenchAgent: stages = ["understand", "reproduce", "fix"] def act( self, model: str, instance: SWEbenchInstance, repo_dir: str, log_dir: str, resume: bool = False, start_stage: str = "understand", **kwargs ): # Initialize or load trajectory info trajectory_info = instance_to_trajectory_info( instance, model, repo_dir=repo_dir, log_dir=log_dir if resume else None ) if not resume: trajectory_info.save_to_log_dir(log_dir) # Understand if self.stages.index(start_stage) <= self.stages.index("understand"): Understand().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("understand", {})) set_stop(["</planning>"]) # Reproduce if self.stages.index(start_stage) <= self.stages.index("reproduce"): Reproduce().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("reproduce", {})) set_stop(["</planning>"]) # Fix if self.stages.index(start_stage) <= self.stages.index("fix"): Fix().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("fix", {})) # reset_file_read_cache() # maybe remove return trajectory_info.artifacts def get_resume_stage(self, log_dir: str) -> str: understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand") reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce") fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix") if not understand_manager.log.messages: return "understand" elif not reproduce_manager.log.messages: return "reproduce" elif not fix_manager.log.messages: return "fix" return "understand" def replay(self, log_dir: str): logger.info(f"Replaying from log directory: {log_dir}") info = SWEBenchInfo.load_from_log_dir(log_dir) os.chdir(info.repo_dir) init_tools() understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand") reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce") fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix") for msg in understand_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) files = {} save_file_read_cache(ignore_files=["understanding.md", "read_cache.json"]) read_file_json = Path(info.repo_dir) / "read_cache.json" with open(read_file_json, "r") as f: files.update({"read_cache.json": json.load(f)}) info.artifacts.update(files) info.save_to_log_dir(log_dir) for msg in reproduce_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) for msg in fix_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) def evaluate_instance( self, instance: SWEbenchInstance, model: str = "openrouter/qwen/qwen-2.5-coder-32b-instruct", resume_dir: Path | None = None, **kwargs ): instance_id = instance["instance_id"] problem_statement = instance["problem_statement"] info = SWEBenchInfo.load_from_log_dir(resume_dir) if resume_dir else None if resume_dir and not info: raise ValueError(f"No info found in {resume_dir}") test_spec = make_test_spec(instance, info.repo_dir if info else None) logger.info(f"Evaluating instance: {instance_id}") logger.debug(f"Problem statement: {problem_statement}") if resume_dir: log_dir = resume_dir logger.info(f"Resuming from log directory: {log_dir}") test_spec.reset_repo() self.replay(log_dir) repo_dir = info.repo_dir else: _id = uuid.uuid4().hex[:8] name = get_name(f"gptme-evals-{model.replace('/', '--')}-{_id}") log_dir = get_logs_dir() / name repo_dir = test_spec.setup_repo() start_time = time.time() try: logger.info(f"Executing agent for instance {instance_id}") logger.info(f"Setting up repo for instance {instance_id}") logger.info(f"Finished setting up repo for instance {instance_id} {repo_dir}") SWEBenchAgent().act( model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, resume=bool(resume_dir), start_stage=self.get_resume_stage(log_dir) if resume_dir else "understand", **kwargs ) gen_time = time.time() - start_time logger.info( f"Agent execution completed for instance {instance_id} in {gen_time:.2f} seconds" ) passed = test_spec.eval_repo() logger.info(f"Evaluation completed for instance {instance_id}. Passed: {passed}") except Exception as e: import traceback logger.error(f"Error during agent execution for instance {instance_id}: {e}\n{''.join(traceback.format_tb(e.__traceback__))}")

ErikBjare · 2025-01-29T12:24:13Z

gptme/eval/swe_extra/run_swe_extra.py

+from gptme.eval.agents.swebench import SWEBenchAgent
+from gptme.eval.swe_extra.swe_bench_extra_data import load_instance_by_id, load_top_50_easiest_task_instances
+from gptme.dirs import get_logs_dir
+from gptme.logmanager import LogManager, SWEBenchInfo


And SWEBenchInfo?

@dataclass(frozen=True) class SWEBenchInfo: instance_id: str model_name: str target: bool # Changed from int to bool to match dataset format exit_status: str | None = None generated_patch: str | None = None eval_logs: str | None = None artifacts: dict[str, str] = field(default_factory=dict) timestamp: datetime = field(default_factory=datetime.now) repo_dir: str | None = None def to_dict(self) -> dict: """Convert to dictionary format matching SWE-bench dataset.""" return { "instance_id": self.instance_id, "model_name": self.model_name, "target": self.target, "exit_status": self.exit_status, "generated_patch": self.generated_patch, "eval_logs": self.eval_logs, "timestamp": self.timestamp.isoformat(), "artifacts": self.artifacts, "repo_dir": self.repo_dir, } @classmethod def from_dict(cls, d: dict) -> "SWEBenchInfo": """Create from dictionary, handling optional fields.""" # Convert timestamp string to datetime if present if "timestamp" in d: d = d.copy() # Make a copy to avoid modifying input d["timestamp"] = datetime.fromisoformat(d["timestamp"]) return cls(**d) @classmethod def load_from_log_dir(cls, log_dir: PathLike) -> Optional["SWEBenchInfo"]: log_dir = Path(log_dir) swe_bench_info_file = log_dir / "swe_bench_info.json" if not swe_bench_info_file.exists(): return None with swe_bench_info_file.open() as f: return cls.from_dict(json.load(f)) def save_to_log_dir(self, log_dir: PathLike) -> None: log_dir = Path(log_dir) swe_bench_info_file = log_dir / "swe_bench_info.json" swe_bench_info_file.parent.mkdir(parents=True, exist_ok=True) json.dump(self.to_dict(), swe_bench_info_file.open("w"), indent=2)

bjsi · 2025-01-29T13:57:23Z

I was using the branch system to save each step in the understand, reproduce, fix cycle separately so I could restore from checkpoints

ErikBjare · 2025-01-29T18:25:21Z

@bjsi Nice! Was the branch system a mess or did it make sense to you/work well?

bjsi · 2025-01-29T18:54:46Z

@ErikBjare I didn't do extensive testing, but it seemed to work pretty well. I didn't notice any bugs.

ErikBjare · 2026-02-25T15:02:11Z

@TimeToBuildBob Recover this and #142 into a new clean PR (preserving credit for @bjsi), then close both.

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

TimeToBuildBob · 2026-02-25T15:14:44Z

Recovered and combined with #142 into clean PR #1489. Closing as requested by @ErikBjare.

…ent swebench API Per Erik's review: integrate the missing SWEBenchInfo dataclass and SWEBenchAgent class from bjsi's original SWE-bench work (PR gptme#424). Changes: - Add SWEBenchInfo dataclass to logmanager.py (where swe_extra imports it) - Convert agents.py to agents/ package for extensibility - Add agents/swebench.py with SWEBenchAgent orchestration skeleton (understand/reproduce/fix stages are stubs pending sub-agent implementation) - Update swe_bench_constants.py for current swebench API (MAP_REPO_VERSION_TO_SPECS_PY replaces MAP_VERSION_TO_INSTALL/MAP_REPO_TO_TEST_FRAMEWORK) - Fix swe_bench_test_spec.py import (get_test_directives moved to swebench.harness.test_spec.python) - Fix mypy errors across swe_extra modules - Remove WIP mypy ignore_errors override (no longer needed)

…r current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

…r current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules

@ErikBjare

Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR gptme#142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR gptme#424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

…r current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR gptme#424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules

@ErikBjare

…424) (#1489) * feat(eval): add SWE-bench evaluation modules Combines work from two WIP branches into a single recoverable base: - **gptme/eval/swebench/**: Core SWE-bench evaluation framework (originally by @ErikBjare, PR #142) - Instance loading, repository setup utilities - Evaluation runner with gptme agent integration - CLI entry point (`gptme-eval-swebench`) - **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis (originally by @bjsi, PR #424) - Setup scripts for running SWE-bench evals - Data loading and analysis utilities (top-50 easiest instances) - Test specification generation - SWE-bench constants and harness integration Both modules are WIP and require further integration work. Key known limitations: - `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main) - `swe_extra` references `SWEBenchAgent` (not yet merged) - `datasets` and `swebench` added as optional eval dependencies Adds mypy overrides for WIP modules to defer type checking. Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com> * fix(eval): remove os.chdir() and duplicate import in swebench utils - Replace os.chdir(repo_dir) with cwd=repo_dir in subprocess calls to preserve working directory persistence (per revert in #1487) - Consolidate duplicate 'from datasets import' into single import line * fix(eval): address greptile review - fix repo_dir in files dict and missing model arg - evaluate.py: replace incorrect {"repo_dir": repo_dir} Files dict with proper prompt-embedded context; add missing log_dir/workspace_dir to EvalResult instances - run_swe_extra.py: add --model CLI arg so cli() can call main() correctly for non-resume evaluations * fix(eval): address greptile review feedback on swebench module * fix(eval): copy repo to agent workspace and capture diff via git * fix(eval): enable mypy for swebench modules per review Remove WIP mypy ignore_errors for gptme.eval.swebench.* — code is clean. Keep ignore_errors for gptme.eval.swe_extra.* (depends on unmerged SWEBenchInfo). * feat(eval): integrate SWEBenchInfo/SWEBenchAgent and fix swe_extra for current swebench API Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424: - Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata - Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline - Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py) - Update swe_bench_constants.py for current swebench library API - Fix swe_bench_test_spec.py imports for restructured swebench package - Fix type errors in swe_bench_extra_data.py and run_swe_extra.py - Add matplotlib to mypy ignore_missing_imports - Remove WIP ignore_errors override for swe_extra modules * fix(eval): restore cwd after os.chdir in SWEBenchAgent.replay() * refactor(eval): move SWEBenchInfo from logmanager to eval/swebench/info.py Per Erik's review: SWEBenchInfo doesn't belong in logmanager.py, move it into the eval/swebench module where it belongs. - New: gptme/eval/swebench/info.py with SWEBenchInfo dataclass - Updated gptme/eval/swebench/__init__.py to export SWEBenchInfo - Removed SWEBenchInfo from gptme/logmanager.py - Updated all imports in eval/agents/swebench.py, swe_extra/{run_swe_extra, swe_bench_test_spec,swe_bench_extra_data}.py --------- Co-authored-by: TimeToBuildBob <TimeToBuildBob@users.noreply.github.com> Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

Erik suggested [agent.links] as the canonical section name to align with how pyproject.toml and Cargo.toml group links (gptme/gptme-contrib#382). Changes: - AgentConfig: add `links: dict[str, str] | None = None` as canonical field - AgentConfig: keep `urls` as deprecated fallback for backwards compat - logmanager: prefer `agent.links` over `agent.urls` when building agent_urls - docs/config.rst: document [agent.links] + deprecation note on [agent.urls] - tests: add test_agent_links_canonical_key and test_agent_links_preferred_over_urls The webui already reads `agent_urls` from the backend and shows a Dashboard link in the AgentsList panel — this makes [agent.links] work end-to-end alongside the gptme-contrib dashboard generator (PR #424).

add swe bench / swe extra setup scripts

49cc0d7

ellipsis-dev bot reviewed Jan 26, 2025

View reviewed changes

ErikBjare reviewed Jan 29, 2025

View reviewed changes

TimeToBuildBob mentioned this pull request Feb 25, 2026

feat(eval): add SWE-bench evaluation modules (recovered from #142 and #424) #1489

Merged

TimeToBuildBob closed this Feb 25, 2026

TimeToBuildBob mentioned this pull request Feb 25, 2026

feat: started working on SWE-bench evals #142

Closed

4 tasks

TimeToBuildBob mentioned this pull request Mar 8, 2026

feat(config): add [agent.links] as canonical key for agent named URLs #1625

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add swe bench / swe extra setup scripts#424

add swe bench / swe extra setup scripts#424
bjsi wants to merge 1 commit intogptme:masterfrom
bjsi:swe-bench-stuff

bjsi commented Jan 26, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ErikBjare left a comment

Uh oh!

ErikBjare Jan 29, 2025

Uh oh!

bjsi Jan 29, 2025

Uh oh!

ErikBjare Jan 29, 2025

Uh oh!

bjsi Jan 29, 2025

Uh oh!

bjsi commented Jan 29, 2025

Uh oh!

ErikBjare commented Jan 29, 2025

Uh oh!

bjsi commented Jan 29, 2025

Uh oh!

ErikBjare commented Feb 25, 2026

Uh oh!

TimeToBuildBob commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bjsi commented Jan 26, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ErikBjare left a comment

Choose a reason for hiding this comment

Uh oh!

ErikBjare Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

bjsi Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

ErikBjare Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

bjsi Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

bjsi commented Jan 29, 2025

Uh oh!

ErikBjare commented Jan 29, 2025

Uh oh!

bjsi commented Jan 29, 2025

Uh oh!

ErikBjare commented Feb 25, 2026

Uh oh!

TimeToBuildBob commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bjsi commented Jan 26, 2025 •

edited by ellipsis-dev bot

Loading