# Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

![framework](R2_main.png)

## Abstract

Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This router orchestrates collaboration across 9 heterogeneous models, of whom the parameter scale ranges from less than 1B to hundreds of billions, by first breaking down a complex query into subtasks with a decomposer, and then assigning each subtask to the optimal model with a subtask allocator, balancing performance with cost. To train this router involves a two-stage alternating process for the decomposer and the allocator, integrating supervised fine-tuning with reinforcement learning to enable effective self-supervised refinement. Extensive experiments across six challenging reasoning benchmarks demonstrate that R2-Reasoner reduces API costs by 84.46\% compared with state-of-the-art baselines while maintaining competitive reasoning accuracy. Our framework paves the way for the development of more scalable and efficient reasoning systems.


## Experiment:


### Openai Key Setup:

Please put your openai key and base url in function *setOpenAi()* in file */Router/utils.py*:


```
def setOpenAi(keyid):
    
    if keyid == 0:
        client = AzureOpenAI(
            api_key = "",
            api_version = "",
            azure_endpoint = ""  
        )

    # deepseekClient = setOpenAi(keyid = 1)
    if keyid == 1 :
        api_key = ""
        client = OpenAI(api_key=api_key, base_url="")

    # qwenOnClient = setOpenAi(keyid = 2)
    if keyid == 2:
        api_key = ""
        client = OpenAI(api_key=api_key, base_url="")

   
    # qwenOff_1Client = setOpenAi(keyid = 31), 3b
    if keyid == 31:
        openai_api_key = ""
        openai_api_base = ""
        client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

    # qwenOff_2Client = setOpenAi(keyid = 32), 1.5b
    if keyid == 32:
        openai_api_key = ""
        openai_api_base = ""
        client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

    # qwenOff_3Client = setOpenAi(keyid = 33), 0.5b
    if keyid == 33:
        openai_api_key = ""
        openai_api_base = ""
        client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

    # llllamaClient = setOpenAi(keyid = 4)
    if keyid == 4:
        openai_api_key = "" #  
        client = OpenAI(api_key=openai_api_key,base_url="")

    addtoken(-1)
    return client
```

There are also other files like *RL_utils.py* in every folder of *IterativeRL_{}* for every benchmark. Please set your openai key in the same way. Pay attention that in these *RL_utils.py*s there is a condition set for subtask allocator as below:


```
# AlloClient = setOpenAi(keyid = 5)
    if keyid == 5:
        openai_api_key = ""
        openai_api_base = ""
        client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
```

It is utilized when we set the pretrained subtask allocator and use API to call it. If you decide to use transformer or vllm to deploy it in a python script, then you can ignore it.


There are also other files in folder *Baselines* that are named as *{}_utils.py*. When you try to run these baselines, you need to set the openai key in the same way as above.



**NOTE:** The LLM client definition needs to meet the following calling format: `client.chat.completions.create(model, messages=messages)`, so that it can smoothly support the calling format of the `askLLM` function in `utils.py`. If the question-answer format of your LLM deployment does not comply with the OpenAI client call interface, please make sure to modify the `askLLM` function accordingly.


### Environment Setup:

All the packages required for setting the environment for R2-reasoner are included in the requirements.txt file.
To install the required packages, run the following command in the terminal:
```
pip install -r requirements.txt
```


### About the DATA and CHECKPOINTS:

We only provide the data and checkpoints on benchmark MATH, CSQA, P3 and SCAN for refrence. For other benchamrks like CHAMP, MuSiQue, etc. you can revise the code provided to fit the demand of different benchmarks.

The data and checkpoints for R2-reasoner can be obtained from the following link:
```
https://drive.google.com/drive/folders/1xawSCeIYUIR2d5m27a31wnKZYTygHGpU?usp=sharing
```

The data and checkponits are organized in the same way as the python script folder constructs. After downloading the data and checkpoints, please put them in the corresponding folders.



#### Task Datasets
The benchmark datasets are stored in the folder path *Reasoner/Task_Datasets*.


#### The SFT training data
The SFT training data for task decomposer is stored in the folder path *Reasoner/{Name of Benchmark}_Trys/Decom_training_data* of each benchmark.
The SFT training data for subtask allocator is stored in the folder path *Reasoner/{Name of Benchmark}_Trys/Allo_training_data* of each benchmark.


#### The SFT checkpoint for task decomposer
The SFT checkpoint for task decomposer is stored in the files of each benchmark, naming as *Reasoner/{Name of Benchmark}_Trys/lora_finetuned_decom_model...*.
The SFT training configuration could be checked in the file *Decom_train_LoRA_01_Noneval.py* of each benchmark.


#### The SFT checkpoint for subtask allocator
The SFT checkpoint for subtask allocator is stored in the following path: *LLaMA-Factory/ALLO/{Name of the Benchmark}*
The SFT training configuration could be checked in the file *LLaMA-Factory*.


#### The RL training data
The RL training data for both task decomposer and subtask allocator is stored in the folder path *Reasoner/{Name of Benchmark}_Trys/IterativeRL_{Name of the Benchmark}/Data* of each benchmark.
Considering we use Iterative RL method, the RL training data need to be updated after a round of RL training.


#### The RL checkpoint for task decomposer
The RL checkpoint for task decomposer is stored in the folder path *Reasoner/{Name of Benchmark}_Trys/IterativeRL_{Name of the Benchmark}/outputs_task_decomposition_lora* of each benchmark.
The RL training configuration could be checked in the file path *Reasoner/{Name of Benchmark}_Trys/IterativeRL_{Name of the Benchmark}/main_decom.py* of each benchmark.


#### The RL checkpoint for subtask allocator
The RL checkpoint for subtask allocator is stored in the folder path *Reasoner/{Name of Benchmark}_Trys/IterativeRL_{Name of the Benchmark}/outputs_model_choice_group_reward_lora* of each benchmark.
The RL training configuration could be checked in the file path *Reasoner/{Name of Benchmark}_Trys/IterativeRL_{Name of the Benchmark}/main_allo.py* of each benchmark.


#### The final evaluation data
We utilize rejection sample method when calling the task decomposer to decompose task into a sequence of subtasks.
The original 3 samples of each task and the rejection sample results are all stored in the folder *Reasoner/{Name of Benchmark}_Trys/final_evaluation_Decom_data* of each benchmark.




### Running the R2-reasoner:

We only provide the examples on benchmark MATH, CSQA, P3 and SCAN for refrence. For other benchamrks like CHAMP, MuSiQue, etc. you can revise the code provided to fit the demand of different benchmarks.

To run the R2-reasoner, you can use the following command (under the circumstance that you have already set the Task decomposer and Subtask allocator in a proper way):

```
cd MATH_Trys
python build_data_for_final_evaluation.py
# run the command above for 3 times. Move the generated files to the folder *final_evaluation_Decom_data*.
cd final_evaluation_Decom_data
python data_process_rejspl.py
cd ..
python final_test_Allo.py

cd CSQA_Trys
python build_data_for_final_evaluation.py
# run the command above for 3 times. Move the generated files to the folder *final_evaluation_Decom_data*.
cd final_evaluation_Decom_data
python data_process_rejspl.py
cd ..
python final_test_Allo.py

cd P3_Trys
python build_data_for_final_evaluation.py
# run the command above for 3 times. Move the generated files to the folder *final_evaluation_Decom_data*.
cd final_evaluation_Decom_data
python data_process_rejspl.py
cd ..
python final_test_Allo.py

cd SCAN_Trys
python build_data_for_final_evaluation.py
# run the command above for 3 times. Move the generated files to the folder *final_evaluation_Decom_data*.
cd final_evaluation_Decom_data
python data_process_rejspl.py
cd ..
python final_test_Allo.py
```

Before running the command, please open the according python script to check and set the proper model path/toknenizer path/input file path/token file name/log file name/etc.


After running the command, you can find the accuracy and model call counts in the ouput log file, also you can find the tokens calculated for each model in the token file. Based on the token file, you can calculate the API costs.