Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity
This is the official codebase for the ICML 2025 paper: Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity. Please see this blogpost for a high-level overview.
- Python 3.11 or higher
Download the repo and install the repo with
pip install -e .
There are two steps in code localization: file localization and entity localization.
We have developed, NV-EmbedCode, a code embedding model that specializes in mapping bug descriptions to faulty codes. The model is available on HuggingFace and as a NIM.
The following command runs the file localization using NV-EmbedCode's NIM:
python -m cortexa.retrieval.embed_retrieve \
--model_name nvidia/nv-embedcode-7b-v1 \
--base_url https://integrate.api.nvidia.com/v1 \
--log_dir ./logs \
--repo_playground ./repos \
--batch_size 16 \
--max_length 450 \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--query_type llmsummary \
--instance_id astropy__astropy-12907The instance_id argument accepts a comma-separated list of instances to run. To run the entire benchmark, omit the argument entirely.
You can measure the accuracy of retrieval by running:
python -m cortexa.retrieval.file_retrieval_eval \
--log_dir ./logs \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--query_type llmsummaryFor SWE-bench Lite and Verified sets, we generated file localization results and made them available at src/cortexa/retrieval/files/cortexa_all_llmsummary_ordered_files.pickle. The pickle file contains a dictionary with 707 instances results (the union of the Lite and Verified set). You can access the result for each instance using its instance id. Each result is a tuple with two elements: the first is the ranked list of files predicted by our NV-EmbedCode model, and the second is the list of files that were modified in the golden patch from the dataset.
To obtain more granular localization results, you can use our localization agent. It uses the file ranking results from the previous step and return a list of relevant entities, such as functions and classes.
For generating entity localization results:
python -m cortexa.localize.entity_localization \
--log_dir ./logs \
--repo_playground ./repos \
--num_turns 5 \
--num_top_files 6 \
--embed_results cortexa_verified_llmsummary_ordered_files.pickle \
--instance_id astropy__astropy-12907embed_results should be the name of the file generated in the previous step. By default this agent uses the model config file at model_config.yml. For each model, you need to specify a url for API access, an api_key_name and a model_name. For security reason, the code would populate the actual API key value at runtime and assumes it is available via os.environ[api_key_name] so make sure the API key is accessible as an environment variable.
Empirically we found that running entity localization with different models and temperatures, then merging their results increase the recall accuracy. You can follow the example in the src/cortexa/localize/configs.py to add more model configurations.
To evaluate entity localization results:
python src/cortexa/localize/entity_localization_eval.py --loc_f=LOC_RES_F --target_set=verified
The LOC_RES_F needs to be a jsonl file following format in src/cortexa/retrieval/files/cortexa_all_llmsummary_LA_DP_entity.jsonl. The previous script will produce three jsonl files in this format, for direct prompt, localization agent and merged results.
Modify the get_default_loc_log_file_map of the file src/cortexa/repair/config.py with the results of the previous two stages. By default, it reads our pre-processed retrieved files and entities. This file also defined the candidates for how to generate the patches and reproduction tests. If you want to try other candidates, modify their respective functions.
In the repair stage, we generate patches and reproduction tests. We then run the patches through the generated tests and final utittests. The final unittests results are only used for final reporting of the resolution rate. Finally, we filter them based on the results of the reproduction tests and output a single patch for each instance.
The agent uses the OpenAI API to run the models and reads model details from the model_config.yml file. You can configure each patch and test generation candidates in src/cortexa/repair/configs.py.
To run the patch generation, test generation, and the evaluation use the following command. It optionally accepts --instance_id with a comma-separated instance IDs if you wish to run a subset. To run the entire benchmark, omit the --instance_id argument entirely.
python -m cortexa.evaluate.run_evaluation \
--repro_test_dir ./repro_tests/merged \
--repo_playground ./repos \
--log_dir ./logs \
--run_patch_generation \
--run_test_generation \
--num_workers_patch_gen 2 \
--num_workers_test_gen 2 \
--num_workers_eval 4 \
--modes eval,repro \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--instance_id astropy__astropy-12907Once the previous step is done, run the following command to filter the generated patches:
python -m cortexa.evaluate.run_filtering \
--log_dir ./logs \
--repo_playground ./repos \
--run_normalization \
--vote_mode llm_judge \
--model_name deepseek-v3-0324 \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified It produces a file at {log_dir}/final_patches.json with final selected patches in the diff format and {log_dir}/results.json with a summary of the resolved and unresolved instances.
Alternatively, if you want to run each of the reproduction test generation, patch generation, and evaluation seperately, run the following.
We use reproduction tests to filter patch candidates. Run
python -m cortexa.repair.repro_test_gen \
--log_dir ./logs \
--out_inference_file summary_test_gen.jsonl \
--repro_test_dir ./reproduction_tests \
--max_round 3 \
--num_workers 2 \
--repo_playground ./repos \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--instance_id astropy__astropy-12907Run
python -m cortexa.repair.repair_gen \
--log_dir ./logs \
--out_inference_file summary_inference.jsonl \
--num_workers 2 \
--repo_playground ./repos \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--instance_id astropy__astropy-12907You can evaluate the generated patches with the generated reproduction tests and the final SWE-bench unittests.
Run
python -m cortexa.evaluate.run_evaluation \
--repro_test_dir ./repro_tests/merged \
--modes eval,repro \
--log_dir ./logs \
--out_inference_file summary_inference.jsonl \
--num_workers_eval 4 \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--instance_id astropy__astropy-12907If you wish to run the final evaluation for only one patch per instance irrespective of the previous patch and test generations, run
python -m cortexa.evaluate.run_evaluation \
--modes eval \
--log_dir ./logs \
--out_inference_file summary_inference.jsonl \
--num_workers_eval 4 \
--dataset_name_or_path princeton-nlp/SWE-bench_Verified \
--instance_id astropy__astropy-12907 \
--single_patch_eval out_inference_file is a jsonl file with each line showing results for an instance. It must have at least the following attributes for each instance_id:
instance_idmodel_patch: the patch for the instance in git diff format
If you find our work useful, please cite our ICML 2025 paper
@inproceedings{
sohrabizadeh2025nemotroncortexa,
title={Nemotron-{CORTEXA}: Enhancing {LLM} Agents for Software Engineering Tasks via Improved Localization and Solution Diversity},
author={Atefeh Sohrabizadeh and Jialin Song and Mingjie Liu and Rajarshi Roy and Chankyu Lee and Jonathan Raiman and Bryan Catanzaro},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=k6p8UKRdH7}
}