AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

Introduction

Large Language Models (LLMs) have advanced code generation but struggle to balance performance and inference costs across diverse tasks. Dynamically selecting the optimal LLM based on task difficulty and resource constraints offers a solution, yet existing methods are resource-intensive, costly, and rely on human-annotated difficulty labels, which are often unavailable or misaligned with LLMs' perception.

We introduce AdaptiveLLM , a framework that dynamically selects optimal LLMs by automatically assessing task difficulty. It estimates difficulty using Chain-of-Thought (CoT) lengths from reasoning models, clusters tasks into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features. An XGBoost classifier then selects the best model for each task, optimizing performance-cost trade-offs.

Repository Structure

Baseline/

This folder is used to store the experimental code for the baseline method ComplexityNet. In the ComplexityNet framework, the model pool consists of CodeLlama, GPT-3.5, and GPT-4o, and the selector used for fine-tuning is Qwen2.5-7B-Instruct.

Consistency_Check/

compare/

This folder contains the box plot comparison between difficulty annotations based on CoT length and human-annotated difficulty levels. We conducted the comparison on two datasets: LeetCodeSample and CodeContests.

confusion_matrix/

This folder contains the confusion matrix comparing difficulty annotations based on CoT length with human-annotated difficulty levels, aimed at exploring the differences between the two classification methods. The comparison was also performed on the LeetCodeSample and CodeContests datasets.

Processed_Data/

This folder contains the original datasets as well as the datasets annotated with difficulty labels based on CoT length.

prompts_en_extra_is_freeform.jsonl: HumanEval dataset
prompts_python_en_test.jsonl: CodeContests dataset
prompts.jsonl: LeetCodeSample dataset

K-means/

This folder contains the combined datasets of three datasets annotated with chain-of-thought difficulty labels by the DeepSeek-R1-Distill-Qwen-32B model.

Result/

This folder contains the generation results and code produced by invoking the models from the model pool on the three datasets.

Thinking_Length/

This folder contains the CoT lengths generated by the DeepSeek R1 distilled models with parameter sizes of 1.5B, 7B, 14B, and 32B, along with their corresponding clustering results.

Train/

This folder contains the code and results for fine-tuning CodeBERT and training the XGBoost classifier

CodeBert_finetune.py : Training code for fine-tuning CodeBERT.
data_split.py : Code for splitting the dataset into training and testing sets.
score.py : Formula for calculating the cost-performance score of models.
Classifier.py : Code for training the XGBoost classifier.
test_data.jsonl : Test dataset.
train_data.jsonl : Training dataset.
predictions_1.jsonl : Prediction results from AdaptiveLLM.
predictions_2.jsonl : Prediction results from AdaptiveLLM (without fine-tuning).
xgboost_model_1.pkl : Trained XGBoost classifier from the AdaptiveLLM framework.
xgboost_model_2.pkl : Trained XGBoost classifier from AdaptiveLLM (without fine-tuning).

Model candidate pool

LLM	Size	Link	Price
Yi-Coder-1.5B-Chat	1.5B	https://huggingface.co/01-ai/Yi-Coder-1.5B-Chat	$ 0.14/ M Tokens
Qwen2.5-Coder-1.5B-Instruct	1.5B	https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct	$ 0.14/ M Tokens
CodeLlama-7b-Instruct-hf	7B	https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf	$ 0.42/ M Tokens
starcoder2-15b-instruct-v0.1	15B	https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1	$ 0.72/ M Tokens
deepseek-coder-v2-lite-instruct	16B	https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct	$ 0.72/ M Tokens
Codestral-22B-v0.1	22B	https://huggingface.co/mistralai/Codestral-22B-v0.1	$ 0.95/ M Tokens
deepseek-coder-33b-instruct	33B	https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct	$ 1.26/ M Tokens
Qwen2.5-Coder-32B-Instruct	32B	https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct	$ 1.26/ M Tokens

Reasoning model

LLM	Size	Link
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B	7B	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-14B	14B	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B	32B	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1	671B	https://huggingface.co/deepseek-ai/DeepSeek-R1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

Introduction

Repository Structure

Baseline/

Consistency_Check/

compare/

confusion_matrix/

Processed_Data/

K-means/

Result/

Thinking_Length/

Train/

Model candidate pool

Reasoning model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Baseline		Baseline
Consistency_Check		Consistency_Check
Processed_Data		Processed_Data
Result		Result
Thinking_Length		Thinking_Length
Train		Train
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

Introduction

Repository Structure

Baseline/

Consistency_Check/

compare/

confusion_matrix/

Processed_Data/

K-means/

Result/

Thinking_Length/

Train/

Model candidate pool

Reasoning model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages