Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

This is the official codebase for the ICML 2025 paper: Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity. Please see this blogpost for a high-level overview.

Requirements:

Python 3.11 or higher

Installation

Download the repo and install the repo with

pip install -e .

Localization Stage

There are two steps in code localization: file localization and entity localization.

File Localization

We have developed, NV-EmbedCode, a code embedding model that specializes in mapping bug descriptions to faulty codes. The model is available on HuggingFace and as a NIM.

The following command runs the file localization using NV-EmbedCode's NIM:

python -m cortexa.retrieval.embed_retrieve \
        --model_name nvidia/nv-embedcode-7b-v1 \
        --base_url https://integrate.api.nvidia.com/v1  \
        --log_dir ./logs \
        --repo_playground ./repos \
        --batch_size 16 \
        --max_length 450 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --query_type llmsummary \
        --instance_id astropy__astropy-12907

The instance_id argument accepts a comma-separated list of instances to run. To run the entire benchmark, omit the argument entirely.

You can measure the accuracy of retrieval by running:

python -m cortexa.retrieval.file_retrieval_eval \
        --log_dir ./logs \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --query_type llmsummary

For SWE-bench Lite and Verified sets, we generated file localization results and made them available at src/cortexa/retrieval/files/cortexa_all_llmsummary_ordered_files.pickle. The pickle file contains a dictionary with 707 instances results (the union of the Lite and Verified set). You can access the result for each instance using its instance id. Each result is a tuple with two elements: the first is the ranked list of files predicted by our NV-EmbedCode model, and the second is the list of files that were modified in the golden patch from the dataset.

Entity Localization

To obtain more granular localization results, you can use our localization agent. It uses the file ranking results from the previous step and return a list of relevant entities, such as functions and classes.

For generating entity localization results:

python -m cortexa.localize.entity_localization \
       --log_dir ./logs \
       --repo_playground ./repos \
       --num_turns 5 \
       --num_top_files 6 \
       --embed_results cortexa_verified_llmsummary_ordered_files.pickle \
       --instance_id astropy__astropy-12907

embed_results should be the name of the file generated in the previous step. By default this agent uses the model config file at model_config.yml. For each model, you need to specify a url for API access, an api_key_name and a model_name. For security reason, the code would populate the actual API key value at runtime and assumes it is available via os.environ[api_key_name] so make sure the API key is accessible as an environment variable.

Empirically we found that running entity localization with different models and temperatures, then merging their results increase the recall accuracy. You can follow the example in the src/cortexa/localize/configs.py to add more model configurations.

To evaluate entity localization results:

python src/cortexa/localize/entity_localization_eval.py --loc_f=LOC_RES_F --target_set=verified

The LOC_RES_F needs to be a jsonl file following format in src/cortexa/retrieval/files/cortexa_all_llmsummary_LA_DP_entity.jsonl. The previous script will produce three jsonl files in this format, for direct prompt, localization agent and merged results.

Modify the get_default_loc_log_file_map of the file src/cortexa/repair/config.py with the results of the previous two stages. By default, it reads our pre-processed retrieved files and entities. This file also defined the candidates for how to generate the patches and reproduction tests. If you want to try other candidates, modify their respective functions.

Repair Stage

In the repair stage, we generate patches and reproduction tests. We then run the patches through the generated tests and final utittests. The final unittests results are only used for final reporting of the resolution rate. Finally, we filter them based on the results of the reproduction tests and output a single patch for each instance.

The agent uses the OpenAI API to run the models and reads model details from the model_config.yml file. You can configure each patch and test generation candidates in src/cortexa/repair/configs.py.

To run the patch generation, test generation, and the evaluation use the following command. It optionally accepts --instance_id with a comma-separated instance IDs if you wish to run a subset. To run the entire benchmark, omit the --instance_id argument entirely.

python -m cortexa.evaluate.run_evaluation  \
        --repro_test_dir ./repro_tests/merged \
        --repo_playground ./repos \
        --log_dir ./logs \
        --run_patch_generation \
        --run_test_generation \
        --num_workers_patch_gen 2 \
        --num_workers_test_gen 2 \
        --num_workers_eval 4 \
        --modes eval,repro \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907

Once the previous step is done, run the following command to filter the generated patches:

python -m cortexa.evaluate.run_filtering  \
        --log_dir ./logs \
        --repo_playground ./repos \
        --run_normalization \
        --vote_mode llm_judge \
        --model_name deepseek-v3-0324 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified

It produces a file at {log_dir}/final_patches.json with final selected patches in the diff format and {log_dir}/results.json with a summary of the resolved and unresolved instances.

Alternatively, if you want to run each of the reproduction test generation, patch generation, and evaluation seperately, run the following.

Reproduction Test Generation

We use reproduction tests to filter patch candidates. Run

python -m cortexa.repair.repro_test_gen \
       --log_dir ./logs \
       --out_inference_file summary_test_gen.jsonl \
       --repro_test_dir ./reproduction_tests \
       --max_round 3 \
       --num_workers 2 \
       --repo_playground ./repos \
       --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
       --instance_id astropy__astropy-12907

Repair Generation

Run

python -m cortexa.repair.repair_gen \
       --log_dir ./logs \
       --out_inference_file summary_inference.jsonl \
       --num_workers 2 \
       --repo_playground ./repos \
       --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
       --instance_id astropy__astropy-12907

Evaluation

You can evaluate the generated patches with the generated reproduction tests and the final SWE-bench unittests.

Run

python -m cortexa.evaluate.run_evaluation  \
        --repro_test_dir ./repro_tests/merged \
        --modes eval,repro \
        --log_dir ./logs \
        --out_inference_file summary_inference.jsonl \
        --num_workers_eval 4 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907

If you wish to run the final evaluation for only one patch per instance irrespective of the previous patch and test generations, run

python -m cortexa.evaluate.run_evaluation  \
        --modes eval \
        --log_dir ./logs \
        --out_inference_file summary_inference.jsonl \
        --num_workers_eval 4 \
        --dataset_name_or_path princeton-nlp/SWE-bench_Verified \
        --instance_id astropy__astropy-12907 \
        --single_patch_eval

out_inference_file is a jsonl file with each line showing results for an instance. It must have at least the following attributes for each instance_id:

instance_id
model_patch: the patch for the instance in git diff format

Citation

If you find our work useful, please cite our ICML 2025 paper

@inproceedings{
sohrabizadeh2025nemotroncortexa,
title={Nemotron-{CORTEXA}: Enhancing {LLM} Agents for Software Engineering Tasks via Improved Localization and Solution Diversity},
author={Atefeh Sohrabizadeh and Jialin Song and Mingjie Liu and Rajarshi Roy and Chankyu Lee and Jonathan Raiman and Bryan Catanzaro},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=k6p8UKRdH7}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docker		docker
logs		logs
reproduction_tests/merged		reproduction_tests/merged
src/cortexa		src/cortexa
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NOTICES		NOTICES
README.md		README.md
model_config.yml		model_config.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

Requirements:

Installation

Localization Stage

File Localization

Entity Localization

Repair Stage

Reproduction Test Generation

Repair Generation

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

Requirements:

Installation

Localization Stage

File Localization

Entity Localization

Repair Stage

Reproduction Test Generation

Repair Generation

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages