A unified inference framework for large language models with tool-integrated reasoning, supporting high concurrency and custom tools.
This framework provides a clean, modular implementation for tool-using agent inference, including three built-in tools (Search, Visit, Python) and reward calculation capabilities.
- Unified Framework: Support for multiple tools in a single inference pipeline
- High Concurrency: Efficient batch evaluation with configurable concurrency control
- Custom Tools: Easy integration of custom tools through YAML configuration
- Multi-turn Conversations: Support for complex multi-turn tool-using interactions
- Flexible Evaluation: Configurable reward calculation and evaluation metrics
tool-reasoning-framework/
├── tools/ # Tool implementations
│ ├── __init__.py
│ ├── base_tool.py # Base class for all tools
│ ├── search_tool.py # GoogleSearchTool - web search
│ ├── visit_tool.py # VisitTool - webpage visiting
│ └── python_tool.py # PythonTool - code execution
├── reward/ # Reward calculation
│ ├── __init__.py
│ ├── base_reward_calculator.py
│ └── reward_calculator.py # LLM-based semantic reward
├── evaluator/ # Evaluation pipeline
│ ├── __init__.py
│ ├── base_evaluator.py # Main evaluation orchestrator
│ └── base_interaction.py # Interaction management
├── utils/ # Utility functions
│ ├── __init__.py
│ ├── run_evaluation.py # Command-line evaluation script
│ ├── tool_loader.py # Tool loading from YAML
│ ├── class_loader.py # Dynamic class loading
│ └── format_time.py # Time utilities
├── schemas/ # Schema definitions
│ ├── __init__.py
│ └── tool_schemas.py # OpenAI function tool schemas
├── core/ # Core utilities
│ ├── __init__.py
│ └── rollout_trace.py # Rollout tracing decorator
├── configs/ # Tool configurations
│ ├── search_visit_tool_config.yaml
│ ├── python_tool_config.yaml
│ └── search_visit_python_tool_config.yaml
├── scripts/ # Evaluation scripts
│ ├── run_search_visit_evaluation.sh
│ ├── run_python_evaluation.sh
│ └── run_search_visit_python_evaluation.sh
├── examples/ # Usage examples
│ └── example_usage.py
├── data/ # Evaluation datasets
│ ├── simpleqa.json
│ ├── aime25.json
│ ├── aime24.json
│ ├── math500.json
│ └── webinstruct.json
├── result_analysis/ # Result analysis tools
│ ├── calculate_pte.py # PTE cost analysis
│ ├── analyze_timing.py # Timing correlation analysis
│ └── README.md
├── requirements.txt
└── README.md
Performs web searches using Google Search API (via Serper API).
Features:
- Supports single or multiple queries
- Filters out HuggingFace datasets/spaces
- Returns formatted search results with titles, links, and snippets
Visits web pages and extracts relevant information based on user goals.
Features:
- Fetches webpage content via proxy
- Uses LLM to extract relevant information
- Blocks HuggingFace dataset/space URLs
- Returns structured information (evidence, summary)
Executes Python code in a sandboxed environment.
Features:
- Executes Python code remotely via sandbox_fusion
- Returns stdout and stderr
- Supports timeout detection
- Maintains execution history
The reward calculator (GoogleandPythonRewardCalculator) uses LLM-based semantic matching to evaluate the correctness of model outputs against ground truth answers.
Features:
- Extracts answers from model output using
<ANSWER>tags - Uses LLM to judge semantic correctness
- Returns a score between 0 and 1
pip install -r requirements.txtAll dependencies including sandbox_fusion are included in requirements.txt.
Set the following environment variables:
# OpenAI API (for LLM calls)
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="http://your-api-url/v1"
export OPENAI_MODEL="gpt-oss"
# Search Tool Service
export SERPER_SERVICE_URL="http://your-search-service-url/serper" # Your search service URL (FastAPI/Flask)
# Visit Tool Service
export JINA_SERVICE_URL="http://your-jina-service-url/jina" # Your Jina service URL (FastAPI/Flask)
export VISIT_SUMMARY_API_KEY="your-api-key" # API key for summary extraction model
export VISIT_SUMMARY_API_URL="http://your-api-url/v1" # API URL for summary extraction
export VISIT_SUMMARY_MODEL="gpt-oss" # Model name for summary extraction
# Python Tool Service
export PYTHON_SERVICE_URL="http://your-python-service-url" # Your Python execution service URL (FastAPI/Flask)
# Stress Test Tools (No Summary)
export SEARCH_SERVICE_URL="http://your-search-service-url/serper_nosummary" # Search service (no summary)
export JINA_SERVICE_URL="http://your-jina-service-url/jina_nosummary" # Jina service (no summary)Before running evaluations, you need to set up the tool services. We provide a FastAPI service for Search and Visit tools, and you can use SandboxFusion for Python execution.
We provide a ready-to-use FastAPI service for Search (Serper) and Visit (Jina) tools:
# Set your API keys
export SERPER_API_KEY="your-serper-api-key" # Get from https://serper.dev
export JINA_API_KEY="your-jina-api-key" # Get from https://jina.ai
# Run the service (default port 5003)
python services/fastapi_serper_jina.py
# Or with custom port and workers
export SERVICE_PORT=5003
export SERVICE_WORKERS=4
uvicorn services.fastapi_serper_jina:app --host 0.0.0.0 --port 5003 --workers 4The service provides two endpoints:
POST /serper: Google Search via Serper APIGET/POST /jina: Webpage content extraction via Jina API
Health Check: GET /health to verify the service is running.
For Python code execution, we recommend using SandboxFusion by ByteDance.
Quick Start with Docker (recommended):
# Clone SandboxFusion
git clone https://github.com/bytedance/SandboxFusion.git
cd SandboxFusion
# Build and run with Docker
docker build -f ./scripts/Dockerfile.base -t code_sandbox:base .
sed -i '1s/.*/FROM code_sandbox:base/' ./scripts/Dockerfile.server
docker build -f ./scripts/Dockerfile.server -t code_sandbox:server .
docker run -d --rm --privileged -p 8080:8080 code_sandbox:server make run-onlineManual Installation:
# Follow the installation guide at:
# https://github.com/bytedance/SandboxFusion#installationThe service will be available at http://localhost:8080 by default.
After starting the services, update the environment variables in the evaluation scripts or set them globally:
# Search & Visit service (FastAPI)
export SERPER_SERVICE_URL="http://localhost:5003/serper"
export JINA_SERVICE_URL="http://localhost:5003/jina"
# Python service (SandboxFusion)
export PYTHON_SERVICE_URL="http://localhost:8080"Note: The evaluation scripts use localhost by default. Modify the scripts if your services are running on different hosts or ports.
All evaluation scripts are located in the scripts/ directory. Each script automatically configures the required tool service URLs. Simply navigate to the repository root and run the desired script.
Evaluates using Search and Visit tools on SimpleQA dataset:
cd tool-reasoning-framework
bash scripts/run_search_visit_evaluation.shDataset: data/simpleqa.json
Output: ./results/simpleqa_<model_name>/
Tools Used: GoogleSearchTool, VisitTool (with summary extraction)
Evaluates using Python tool on AIME and Math500 datasets:
cd tool-reasoning-framework
bash scripts/run_python_evaluation.shDatasets:
data/aime25.jsondata/aime24.jsondata/math500.json
Output: ./results/<dataset_name>_<model_name>/ for each dataset
Tools Used: PythonTool
Evaluates using all three tools on WebInstruct dataset:
cd tool-reasoning-framework
bash scripts/run_search_visit_python_evaluation.shDataset: data/webinstruct.json
Output: ./results/webinstruct_<model_name>/
Tools Used: GoogleSearchTool, VisitTool (with summary extraction), PythonTool
High-concurrency stress testing with Search + Visit tools (no summary extraction):
cd tool-reasoning-framework
bash scripts/run_stress_test_nosummary.shFeatures:
- No summary extraction for faster processing
- High concurrency support (up to 256 concurrent requests)
- Timing logs included in results
- Optimized for performance testing
Output: ./results/<dataset_name>_<model_name>_timing_log/
Tools Used: ESSearchToolNoSummary, VisitToolNoSummary
Edit the scripts in scripts/ directory to customize:
- API configuration (API_KEY, API_URL, API_MODEL)
- Dataset paths
- Output directories
- Maximum turns and concurrency
- Resume from previous results
You can also use the command-line interface directly:
python -m utils.run_evaluation \
--dataset-path /path/to/dataset.json \
--output-dir ./results \
--api-key "your-api-key" \
--api-url "http://your-api-url/v1" \
--api-model "gpt-oss" \
--reward-calculator-class "reward.reward_calculator.GoogleandPythonRewardCalculator" \
--tool-config ./configs/search_visit_tool_config.yaml \
--max-assistant-turns 10 \
--max-user-turns 99999 \
--max-concurrent 10 \
--verboseimport asyncio
from evaluator.base_evaluator import BaseEvaluator
from reward.reward_calculator import GoogleandPythonRewardCalculator
async def main():
reward_calculator = GoogleandPythonRewardCalculator()
evaluator = BaseEvaluator(
api_key="your-api-key",
reward_calculator=reward_calculator,
api_url="http://your-api-url/v1",
api_model="gpt-oss",
max_assistant_turns=10,
max_user_turns=10,
)
input_data = {
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"reward_model": {"ground_truth": "Paris"},
"extra_info": {
"tools_kwargs": {
"google_search_tool": {"create_kwargs": {}}
}
}
}
results = await evaluator.run_evaluation(
dataset=[input_data],
yaml_tool_path="configs/search_visit_tool_config.yaml",
output_dir="./results",
max_concurrent=1
)
print(f"Score: {results[0]['score']}")
asyncio.run(main())Tools are configured via YAML files. See examples in configs/:
tools:
- class_name: "tools.search_tool.GoogleSearchTool"
config:
type: "native"
tool_schema:
type: "function"
function:
name: "google_search_tool"
description: "Performs web searches..."
parameters:
type: "object"
properties:
query:
type: "array"
items:
type: "string"
required: ["query"]After running evaluations, you can analyze the results using scripts in result_analysis/:
Calculate PTE (Prefill-Token-Equivalent) cost metrics:
python result_analysis/calculate_pte.py \
--jsonl-path ./results/dataset_model/eval_results.jsonl \
--tokenizer-path /path/to/tokenizer \
--gamma 0.00329 \
--model-name "ModelName" \
--output-csv output.csvAnalyze timing correlations (requires timing logs from stress test):
cd tool-reasoning-framework
python result_analysis/analyze_timing.py \
--input-file ./results/dataset_model_timing_log/model/eval_results.jsonl \
--tokenizer-path /path/to/tokenizer \
--gamma 0.000718 \
--output-image latency_correlation.pngExample: Analyzing sample timing data
We provide a sample timing data file for testing:
cd tool-reasoning-framework
python result_analysis/analyze_timing.py \
--input-file ./result_analysis/example_timing_data.jsonl \
--tokenizer-path /path/to/tokenizer \
--gamma 0.000718 \
--output-image ./result_analysis/example_latency_correlation.png \
--max-lines 100This will generate correlation analysis plots showing the relationship between per-step latency, tokens, and PTE metrics.
See result_analysis/README.md for detailed usage.
The main evaluation class that orchestrates the inference pipeline:
- Manages tool execution
- Handles multi-turn conversations
- Calculates rewards
- Supports batch evaluation with concurrency control
- Supports resume from previous results
Base class for all tools. Each tool must implement:
create(): Create a tool instanceexecute(): Execute the tool with given parameterscalc_reward(): Calculate tool-specific rewardrelease(): Release tool instance
Base class for reward calculation. Must implement:
extract_output(): Extract answer from model output_verify_correction(): Verify correctness against ground truth
Evaluation results are saved in JSONL format with the following structure:
{
"input": {...},
"messages": [...],
"extracted_output": "...",
"ground_truth": "...",
"score": 1.0,
"reached_max_turns": false,
"turn_record": {...},
"success": true,
"time_stats": {
"summary": {
"total_llm_inference_seconds": 10.5,
"total_tool_execution_seconds": 5.2,
"total_wall_time_seconds": 15.7
},
"timeline": [...]
},
"evaluation_config": {...}
}See LICENSE file for details.