This project was developed with the assistance of modern AI-powered development tools, including Cursor IDE and Tongyi Qianwen. All code has been carefully reviewed to ensure originality and compliance with best practices. The implementation represents original work by the authors.
The data and code for the paper FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging.
FinanceReasoning is a a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems.
Based on the difficulty of reasoning, we divided the problems into three subsets: Easy (1,000 examples), Medium (1,000 examples), and Hard (238 examples).
The dataset is provided in json format and contains the following attributes at the data directory:
{
"question_id": "[string] Unique identifier for the question",
"question": "[string] The question text, typically a financial data analysis problem",
"context": "[string] Background information for the question, including tabular data in Markdown format",
"statistics": {
"number_statistics": "[object] Statistics about numbers, including count of numbers in the question",
"operator_statistics": "[object] Statistics about operator usage, tracking frequency of different operators",
"code_statistics": "[object] Code-related statistics, such as number of code lines"
},
"python_solution": "[string] Python solution code written by financial experts, with clear variable names and execution logic",
"ground_truth": "[number / boolean] The standard answer, typically the result of executing the Python solution",
"difficulty": "[float] Difficulty coefficient of the question, higher values indicate greater difficulty",
"level": "[string] Difficulty level classification of the question (e.g., hard, medium, easy)",
"source": "[string] Source identifier of the question"
}The financial functions library is a collection of financial functions that are used to solve the financial numerical reasoning problems. It is provided in json format and contains the following attributes at the data/functions directory:
{
"function_id": "[string] Unique identifier for the function",
"function": "[string] The function code",
"function_docstring": "[string] The docstring of the function"
}The financial documents library is a collection of financial documents that are used to solve the financial numerical reasoning problems. It is provided in json format and contains the following attributes at the data/documents directory:
{
"document_id": "[string] Unique identifier for the document",
"document": "[string] The document text",
"document_docstring": "[string] The docstring of the document"
}You can install the dependencies by the following command:
pip install -r requirements.txtThe config/config.yaml file controls all aspects of inference and evaluation:
- Inference settings (e.g., dataset, subset, model, prompt type)
- Evaluation settings
- Model configurations (API keys, base URLs, sampling parameters)
We support inference with various LLM models through two approaches:
-
Configuration-based Inference
python inference.py --config config/config.yaml
This method uses the configuration file to specify model settings, dataset parameters, and inference options.
-
Batch API Inference
python utils/openai_batch.py \ --dataset "FinanceReasoning" \ --subset "hard" \ --prompt "cot" \ --model "your_model_id" \ --api_key "your_api_key" \ --base_url "your_base_url"
This method allows you to get 50% discount on the openai inference cost.
Inference results are stored in the results directory, organized by:
- Dataset name
- Dataset subset (
hard,medium,easy) - Prompt type
- Model name
Evaluate model performance using:
python evaluation.py --config config/config.yamlReplace the API_KEY and BASE_URL in the .env file with your own API key and base URL.
python serve_retriever_function.pypython serve_retriever_section.pypython rag_parallel.pyYou can set the arguments as follows:
dataset = 'FinanceReasoning' subset = 'hard' prompt_type = 'cot_rag' model_name = 'gpt-4o-2024-11-20' model_name_file = 'gpt-4o-2024-11-20' llm_instruct = True use_article = True use_reasoning = False use_retrieved_cache = False retrieved_type = 'function' judge_useful_functions = True use_useful_cache = True top_k = 30 input_file = f'./data/{dataset}/{subset}.json'
The CoT outputs are stored in the results/FinanceReasoning/hard/raw_cot_outputs and results/FinanceReasoning/hard/processed_cot_outputs directory.
The CoT results are stored in the results/FinanceReasoning/hard/results/hard_cot_results.json
The PoT outputs are stored in the results/FinanceReasoning/hard/raw_pot_outputs and results/FinanceReasoning/hard/processed_pot_outputs directory.
The PoT results are stored in the results/FinanceReasoning/hard/results/hard_pot_results.json