Skip to content

add swe bench / swe extra setup scripts#424

Closed
bjsi wants to merge 1 commit intogptme:masterfrom
bjsi:swe-bench-stuff
Closed

add swe bench / swe extra setup scripts#424
bjsi wants to merge 1 commit intogptme:masterfrom
bjsi:swe-bench-stuff

Conversation

@bjsi
Copy link
Copy Markdown
Contributor

@bjsi bjsi commented Jan 26, 2025

I extracted all of the changes from my branch and dumped them in this folder, it will still require a bit of work to integrate into the main branch, but the challenging part of installing requirements and deps should be taken care of.


Important

Add SWE-bench evaluation setup with scripts for running, data handling, and test specifications in gptme/eval/swe_extra.

  • Behavior:
    • Adds run_swe_extra.py to execute SWE-bench evaluations with options to resume from previous runs and clear branches.
    • Introduces swe_bench_extra_data.py for loading and analyzing SWE-bench task instances and trajectories.
    • Implements caching for top 50 easiest task instances in swe_bench_extra_data.py.
  • Constants and Configurations:
    • Defines placeholders and default mappings in swe_bench_constants.py for requirements paths, test frameworks, and version installations.
  • Test Specifications:
    • Adds swe_bench_test_spec.py to create and manage test specifications for SWE-bench instances.
    • Provides functions to generate setup and evaluation scripts for repositories in swe_bench_test_spec.py.

This description was created by Ellipsis for 49cc0d7. It will automatically update as commits are pushed.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 49cc0d7 in 1 minute and 26 seconds

More details
  • Looked at 771 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 8 drafted comments based on config settings.
1. gptme/eval/swe_extra/swe_bench_test_spec.py:128
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    While using shell=True can indeed be a security risk, this code is part of a test harness that already needs to execute arbitrary shell commands to run tests. The scripts being run are constructed from trusted sources (git commands, test frameworks, etc). Converting these to use argument lists would be complex and wouldn't meaningfully improve security since the code needs to execute shell commands anyway. The comment is technically correct but not practically useful in this context.
    I might be underestimating the security risk. Even in a test environment, command injection could potentially be dangerous if the input data is not fully trusted.
    The code is specifically designed to run arbitrary test commands in a controlled environment. The security risk from shell=True is minimal compared to the inherent risk of running tests, and making this change would add complexity without meaningful benefit.
    While technically correct, this comment suggests a change that would add complexity without meaningful security benefits in this specific context. The comment should be removed.
2. gptme/eval/swe_extra/swe_bench_test_spec.py:134
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Marked as duplicate.
3. gptme/eval/swe_extra/swe_bench_test_spec.py:140
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Marked as duplicate.
4. gptme/eval/swe_extra/swe_bench_extra_data.py:115
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
5. gptme/eval/swe_extra/swe_bench_extra_data.py:116
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
6. gptme/eval/swe_extra/swe_bench_extra_data.py:117
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
7. gptme/eval/swe_extra/swe_bench_extra_data.py:118
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
8. gptme/eval/swe_extra/swe_bench_test_spec.py:37
  • Draft comment:
    Using os.path.join for URLs can lead to incorrect URL formation. Consider using string concatenation or urllib.parse.urljoin for URLs.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    While technically correct that os.path.join isn't ideal for URLs (especially on Windows where it uses backslashes), in this specific case it's likely working fine since:
  1. The URLs are GitHub raw URLs which use forward slashes
  2. The code is clearly working as evidenced by the requests.get() calls
  3. The paths being joined are simple segments without special characters
  4. The code is part of a test harness, not production code
    The comment raises a valid technical point - os.path.join can cause issues with URLs, especially on Windows systems. This could potentially break in some environments.
    While technically correct, the current implementation appears to be working fine for its purpose in a test harness. The benefit of changing it doesn't outweigh the cost given the low risk in this specific context.
    Delete the comment. While technically correct, the current implementation is working fine for its purpose and the suggested change would add complexity without significant benefit in this context.

Workflow ID: wflow_XyI8KJdp6my1PswB


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Member

@ErikBjare ErikBjare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, want to get this semi-working, then merged in some shape (doesn't have to be perfect), and then I'll refine it from there in a new PR.

Could you provide the missing things? :)

from pathlib import Path
import glob
import os
from gptme.eval.agents.swebench import SWEBenchAgent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I see this plz? :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was quite janky lol

import json
import logging
import os
from pathlib import Path
import time
import uuid

from gptme.cli import get_name
from gptme.dirs import get_logs_dir
from gptme.eval.agents.fix import Fix
from gptme.eval.agents.reproduce import Reproduce
from gptme.eval.agents.understand import Understand
from gptme.eval.swe_extra.swe_bench_test_spec import instance_to_trajectory_info, make_test_spec
from gptme.llm import set_stop
from gptme.logmanager import LogManager, SWEBenchInfo 
from gptme.message import print_msg
from gptme.tools import execute_msg, init_tools
from gptme.tools.read import reset_file_read_cache, save_file_read_cache
from swebench.harness.constants import SWEbenchInstance

logger = logging.getLogger(__name__)

class SWEBenchAgent:
    stages = ["understand", "reproduce", "fix"]
    def act(
        self,
        model: str,
        instance: SWEbenchInstance,
        repo_dir: str,
        log_dir: str,
        resume: bool = False,
        start_stage: str = "understand",
        **kwargs
    ):

        # Initialize or load trajectory info
        trajectory_info = instance_to_trajectory_info(
            instance, 
            model, 
            repo_dir=repo_dir,
            log_dir=log_dir if resume else None
        )
        
        if not resume:
            trajectory_info.save_to_log_dir(log_dir)
            
        # Understand
        if self.stages.index(start_stage) <= self.stages.index("understand"): 
            Understand().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("understand", {}))

        set_stop(["</planning>"])

        # Reproduce
        if self.stages.index(start_stage) <= self.stages.index("reproduce"):
            Reproduce().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("reproduce", {}))

        set_stop(["</planning>"])
        # Fix
        if self.stages.index(start_stage) <= self.stages.index("fix"):
            Fix().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("fix", {}))
            
        # reset_file_read_cache() # maybe remove
        return trajectory_info.artifacts
    
    def get_resume_stage(self, log_dir: str) -> str:
        understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
        reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
        fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
        if not understand_manager.log.messages:
            return "understand"
        elif not reproduce_manager.log.messages:
            return "reproduce"
        elif not fix_manager.log.messages:
            return "fix"
        return "understand"
    
    def replay(self, log_dir: str):
        logger.info(f"Replaying from log directory: {log_dir}")
        info = SWEBenchInfo.load_from_log_dir(log_dir)
        os.chdir(info.repo_dir)
        init_tools()
        understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
        reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
        fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
        for msg in understand_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)
        files = {}
        save_file_read_cache(ignore_files=["understanding.md", "read_cache.json"])
        read_file_json = Path(info.repo_dir) / "read_cache.json"
        with open(read_file_json, "r") as f: files.update({"read_cache.json": json.load(f)})
        info.artifacts.update(files)
        info.save_to_log_dir(log_dir)
        for msg in reproduce_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)
        for msg in fix_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)

    def evaluate_instance(
        self,
        instance: SWEbenchInstance,
        model: str = "openrouter/qwen/qwen-2.5-coder-32b-instruct",
        resume_dir: Path | None = None,
        **kwargs
    ):
        instance_id = instance["instance_id"]
        problem_statement = instance["problem_statement"]
        info = SWEBenchInfo.load_from_log_dir(resume_dir) if resume_dir else None
        if resume_dir and not info: raise ValueError(f"No info found in {resume_dir}")

        test_spec = make_test_spec(instance, info.repo_dir if info else None)

        logger.info(f"Evaluating instance: {instance_id}")
        logger.debug(f"Problem statement: {problem_statement}")

        if resume_dir:
            log_dir = resume_dir
            logger.info(f"Resuming from log directory: {log_dir}")
            test_spec.reset_repo()
            self.replay(log_dir)
            repo_dir = info.repo_dir
        else:
            _id = uuid.uuid4().hex[:8]
            name = get_name(f"gptme-evals-{model.replace('/', '--')}-{_id}")
            log_dir = get_logs_dir() / name
            repo_dir = test_spec.setup_repo()

        start_time = time.time()
        try:
            logger.info(f"Executing agent for instance {instance_id}")
            logger.info(f"Setting up repo for instance {instance_id}")
            logger.info(f"Finished setting up repo for instance {instance_id} {repo_dir}")
            
            SWEBenchAgent().act(
                model=model, 
                instance=instance, 
                repo_dir=repo_dir, 
                log_dir=log_dir,
                resume=bool(resume_dir),
                start_stage=self.get_resume_stage(log_dir) if resume_dir else "understand",
                **kwargs
            )
            
            gen_time = time.time() - start_time
            logger.info(
                f"Agent execution completed for instance {instance_id} in {gen_time:.2f} seconds"
            )
            passed = test_spec.eval_repo()
            logger.info(f"Evaluation completed for instance {instance_id}. Passed: {passed}")
        except Exception as e:
            import traceback
            logger.error(f"Error during agent execution for instance {instance_id}: {e}\n{''.join(traceback.format_tb(e.__traceback__))}")

from gptme.eval.agents.swebench import SWEBenchAgent
from gptme.eval.swe_extra.swe_bench_extra_data import load_instance_by_id, load_top_50_easiest_task_instances
from gptme.dirs import get_logs_dir
from gptme.logmanager import LogManager, SWEBenchInfo
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And SWEBenchInfo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dataclass(frozen=True)
class SWEBenchInfo:
    instance_id: str
    model_name: str
    target: bool  # Changed from int to bool to match dataset format
    exit_status: str | None = None
    generated_patch: str | None = None
    eval_logs: str | None = None
    artifacts: dict[str, str] = field(default_factory=dict)
    timestamp: datetime = field(default_factory=datetime.now)
    repo_dir: str | None = None

    def to_dict(self) -> dict:
        """Convert to dictionary format matching SWE-bench dataset."""
        return {
            "instance_id": self.instance_id,
            "model_name": self.model_name,
            "target": self.target,
            "exit_status": self.exit_status,
            "generated_patch": self.generated_patch,
            "eval_logs": self.eval_logs,
            "timestamp": self.timestamp.isoformat(),
            "artifacts": self.artifacts,
            "repo_dir": self.repo_dir,
        }

    @classmethod
    def from_dict(cls, d: dict) -> "SWEBenchInfo":
        """Create from dictionary, handling optional fields."""
        # Convert timestamp string to datetime if present
        if "timestamp" in d:
            d = d.copy()  # Make a copy to avoid modifying input
            d["timestamp"] = datetime.fromisoformat(d["timestamp"])
        return cls(**d)
    
    @classmethod
    def load_from_log_dir(cls, log_dir: PathLike) -> Optional["SWEBenchInfo"]:
        log_dir = Path(log_dir)
        swe_bench_info_file = log_dir / "swe_bench_info.json"
        if not swe_bench_info_file.exists(): return None
        with swe_bench_info_file.open() as f:
            return cls.from_dict(json.load(f))
    
    def save_to_log_dir(self, log_dir: PathLike) -> None:
        log_dir = Path(log_dir)
        swe_bench_info_file = log_dir / "swe_bench_info.json"
        swe_bench_info_file.parent.mkdir(parents=True, exist_ok=True)
        json.dump(self.to_dict(), swe_bench_info_file.open("w"), indent=2)

@bjsi
Copy link
Copy Markdown
Contributor Author

bjsi commented Jan 29, 2025

I was using the branch system to save each step in the understand, reproduce, fix cycle separately so I could restore from checkpoints

@ErikBjare
Copy link
Copy Markdown
Member

@bjsi Nice! Was the branch system a mess or did it make sense to you/work well?

@bjsi
Copy link
Copy Markdown
Contributor Author

bjsi commented Jan 29, 2025

@ErikBjare I didn't do extensive testing, but it seemed to work pretty well. I didn't notice any bugs.

@ErikBjare
Copy link
Copy Markdown
Member

@TimeToBuildBob Recover this and #142 into a new clean PR (preserving credit for @bjsi), then close both.

TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 25, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR gptme#142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR gptme#424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
@TimeToBuildBob
Copy link
Copy Markdown
Member

Recovered and combined with #142 into clean PR #1489. Closing as requested by @ErikBjare.

TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 25, 2026
…ent swebench API

Per Erik's review: integrate the missing SWEBenchInfo dataclass and
SWEBenchAgent class from bjsi's original SWE-bench work (PR gptme#424).

Changes:
- Add SWEBenchInfo dataclass to logmanager.py (where swe_extra imports it)
- Convert agents.py to agents/ package for extensibility
- Add agents/swebench.py with SWEBenchAgent orchestration skeleton
  (understand/reproduce/fix stages are stubs pending sub-agent implementation)
- Update swe_bench_constants.py for current swebench API (MAP_REPO_VERSION_TO_SPECS_PY
  replaces MAP_VERSION_TO_INSTALL/MAP_REPO_TO_TEST_FRAMEWORK)
- Fix swe_bench_test_spec.py import (get_test_directives moved to
  swebench.harness.test_spec.python)
- Fix mypy errors across swe_extra modules
- Remove WIP mypy ignore_errors override (no longer needed)
TimeToBuildBob added a commit that referenced this pull request Feb 25, 2026
…r current swebench API

Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424:

- Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata
- Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline
- Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py)
- Update swe_bench_constants.py for current swebench library API
- Fix swe_bench_test_spec.py imports for restructured swebench package
- Fix type errors in swe_bench_extra_data.py and run_swe_extra.py
- Add matplotlib to mypy ignore_missing_imports
- Remove WIP ignore_errors override for swe_extra modules
TimeToBuildBob added a commit that referenced this pull request Feb 26, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR #142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR #424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
TimeToBuildBob added a commit that referenced this pull request Feb 26, 2026
…r current swebench API

Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424:

- Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata
- Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline
- Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py)
- Update swe_bench_constants.py for current swebench library API
- Fix swe_bench_test_spec.py imports for restructured swebench package
- Fix type errors in swe_bench_extra_data.py and run_swe_extra.py
- Add matplotlib to mypy ignore_missing_imports
- Remove WIP ignore_errors override for swe_extra modules
TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 26, 2026
Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR gptme#142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR gptme#424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
TimeToBuildBob added a commit to TimeToBuildBob/gptme that referenced this pull request Feb 26, 2026
…r current swebench API

Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR gptme#424:

- Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata
- Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline
- Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py)
- Update swe_bench_constants.py for current swebench library API
- Fix swe_bench_test_spec.py imports for restructured swebench package
- Fix type errors in swe_bench_extra_data.py and run_swe_extra.py
- Add matplotlib to mypy ignore_missing_imports
- Remove WIP ignore_errors override for swe_extra modules
ErikBjare pushed a commit that referenced this pull request Feb 26, 2026
…424) (#1489)

* feat(eval): add SWE-bench evaluation modules

Combines work from two WIP branches into a single recoverable base:

- **gptme/eval/swebench/**: Core SWE-bench evaluation framework
  (originally by @ErikBjare, PR #142)
  - Instance loading, repository setup utilities
  - Evaluation runner with gptme agent integration
  - CLI entry point (`gptme-eval-swebench`)

- **gptme/eval/swe_extra/**: SWE-bench setup scripts and data analysis
  (originally by @bjsi, PR #424)
  - Setup scripts for running SWE-bench evals
  - Data loading and analysis utilities (top-50 easiest instances)
  - Test specification generation
  - SWE-bench constants and harness integration

Both modules are WIP and require further integration work. Key known
limitations:
- `swe_extra` imports `SWEBenchInfo` from logmanager (not yet in main)
- `swe_extra` references `SWEBenchAgent` (not yet merged)
- `datasets` and `swebench` added as optional eval dependencies

Adds mypy overrides for WIP modules to defer type checking.

Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>

* fix(eval): remove os.chdir() and duplicate import in swebench utils

- Replace os.chdir(repo_dir) with cwd=repo_dir in subprocess calls to
  preserve working directory persistence (per revert in #1487)
- Consolidate duplicate 'from datasets import' into single import line

* fix(eval): address greptile review - fix repo_dir in files dict and missing model arg

- evaluate.py: replace incorrect {"repo_dir": repo_dir} Files dict with
  proper prompt-embedded context; add missing log_dir/workspace_dir to
  EvalResult instances
- run_swe_extra.py: add --model CLI arg so cli() can call main() correctly
  for non-resume evaluations

* fix(eval): address greptile review feedback on swebench module

* fix(eval): copy repo to agent workspace and capture diff via git

* fix(eval): enable mypy for swebench modules per review

Remove WIP mypy ignore_errors for gptme.eval.swebench.* — code is clean.
Keep ignore_errors for gptme.eval.swe_extra.* (depends on unmerged SWEBenchInfo).

* feat(eval): integrate SWEBenchInfo/SWEBenchAgent and fix swe_extra for current swebench API

Per review, integrate bjsi's SWEBenchInfo and SWEBenchAgent code from PR #424:

- Add SWEBenchInfo frozen dataclass to logmanager.py for evaluation metadata
- Add SWEBenchAgent class in eval/agents/swebench.py with multi-stage pipeline
- Convert eval/agents.py to package (eval/agents/__init__.py + swebench.py)
- Update swe_bench_constants.py for current swebench library API
- Fix swe_bench_test_spec.py imports for restructured swebench package
- Fix type errors in swe_bench_extra_data.py and run_swe_extra.py
- Add matplotlib to mypy ignore_missing_imports
- Remove WIP ignore_errors override for swe_extra modules

* fix(eval): restore cwd after os.chdir in SWEBenchAgent.replay()

* refactor(eval): move SWEBenchInfo from logmanager to eval/swebench/info.py

Per Erik's review: SWEBenchInfo doesn't belong in logmanager.py, move it
into the eval/swebench module where it belongs.

- New: gptme/eval/swebench/info.py with SWEBenchInfo dataclass
- Updated gptme/eval/swebench/__init__.py to export SWEBenchInfo
- Removed SWEBenchInfo from gptme/logmanager.py
- Updated all imports in eval/agents/swebench.py, swe_extra/{run_swe_extra,
  swe_bench_test_spec,swe_bench_extra_data}.py

---------

Co-authored-by: TimeToBuildBob <TimeToBuildBob@users.noreply.github.com>
Co-authored-by: bjsi <44608421+bjsi@users.noreply.github.com>
TimeToBuildBob added a commit that referenced this pull request Mar 7, 2026
Erik suggested [agent.links] as the canonical section name to align with
how pyproject.toml and Cargo.toml group links (gptme/gptme-contrib#382).

Changes:
- AgentConfig: add `links: dict[str, str] | None = None` as canonical field
- AgentConfig: keep `urls` as deprecated fallback for backwards compat
- logmanager: prefer `agent.links` over `agent.urls` when building agent_urls
- docs/config.rst: document [agent.links] + deprecation note on [agent.urls]
- tests: add test_agent_links_canonical_key and test_agent_links_preferred_over_urls

The webui already reads `agent_urls` from the backend and shows a
Dashboard link in the AgentsList panel — this makes [agent.links] work
end-to-end alongside the gptme-contrib dashboard generator (PR #424).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants