An intelligent multi-agent system that automatically reproduces research paper experiments and evaluates their reproducibility. This project uses smolagents (HuggingFace's lightweight agent framework) to orchestrate specialized agents that parse papers, find code repositories, run experiments, and evaluate results.
This system uses LLM intelligence for decision-making, not procedural code:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reproducibility Orchestrator β
β (Pipeline Coordinator) β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββ
β β β β
ββββββββΌβββββββ ββββββΌβββββ ββββββββΌβββββββ ββββββΌβββββ
β Paper β β Repo β β Experiment β βEvaluatorβ
β Parser β β Finder β β Runner β β Agent β
β Agent β β Agent β β Agent β β β
β β β β β β β β
β LLM: Extractβ βLLM: Pickβ βLLM: Analyze β βLLM: Evalβ
β structured β βbest repoβ β& build cmd β βresults β
β info β β β β β β β
ββββββββ¬βββββββ ββββββ¬βββββ ββββββββ¬βββββββ ββββββ¬βββββ
β β β β
ββββββββΌβββ ββββββββΌβββ ββββββββΌβββ ββββββββΌβββ
β PDF β βGitHub β β Code β β Result β
β Parser β β β βExecutor β βComparator
β Tool β β β β Tool β β Tool β
βββββββββββ ββββββββββββ βββββββββββ βββββββββββ
Key Principle: Agents THINK (using LLM), Tools DO (using code)
-
Paper Parser Agent
- Tools: Read PDF, extract text
- LLM: Intelligently extract datasets, hyperparameters, methodology
-
Repo Finder Agent
- Tools: Build search queries and URLs for repository discovery
- LLM: Generate search queries based on paper metadata
-
Experiment Runner Agent
- Tools: Read files, run commands
- LLM: Understand repository, determine how to run code, construct command with all required arguments
-
Evaluator Agent
- Tools: Calculate metric differences, vision model for semantic plot comparison, figure/table metric extraction
- LLM: Analyze WHY results differ, semantically compare figures, assess quality, provide recommendations
- NEW: Uses vision-language models (GPT-4V, Claude 3) to:
- Understand plots semantically, not just pixel-wise
- Extract numerical metrics directly from paper figures AND tables when text extraction fails
- Vision model parses figures/charts/plots (images)
- LLM intelligently parses standalone tables (text)
- Automatically handles all three: text tables, figure images, and tables-in-figures
- Clone and setup the environment:
cd ~/esilabs
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp .env.example .env
# Edit .env with your credentialspython launch_agent.py
# or with uv
uv run launch_agent.pyThis project now mirrors the official smolagents quick demo:
- Pick any model/provider via
.envβMODEL_PROVIDERacceptsopenai,azure,anthropic, orollama. - Use
MODEL_NAMEto point at the exact model (for examplegpt-4o,claude-3-sonnet, or a localllama3.1:8b). - Optional: set
LLM_CODE_RETRIESto tell the agent how many times it should gently remind smaller models to answer with Python-only tool calls (default3). - Credentials stay outside the repo: supply the right API key for the provider you choose (OpenAI, Anthropic, Azure, etc.) or keep everything local with Ollama/transformers.
Because smolagents is model-agnostic, you can swap providers without touching the codebase β update the environment variables and rerun launch_agent.py.
Key Concepts:
- Abstract base class pattern
- Tool registration system
- LLM client integration
- Execution history tracking
Example:
class MyAgent(BaseAgent):
def __init__(self):
super().__init__(
agent_name="my_agent",
system_prompt="Your system prompt here"
)
self.register_tool("my_tool", my_tool_function)
def execute(self, task):
result = self.call_llm(messages=[...])
self.log_execution("step_name", result)
return resultEach agent works autonomously using tools:
- Tools:
parse_pdf(reads PDF content) - Autonomous Behavior: Agent reads PDF, analyzes content, extracts structured information (title, authors, datasets, hyperparameters, methodology, experiment results, metrics)
- Result: Structured JSON with all paper information
- Tools: Repository search query builder
- Autonomous Behavior: Agent generates search queries based on paper title and authors, provides GitHub search URLs
- Result: Search queries and URLs to help users find relevant repositories
When automated repository search fails, the system can ask you to provide one manually:
from scientist.main import run_reproducibility_pipeline
# Run with interactive fallback (enabled by default)
result = run_reproducibility_pipeline(
pdf_path="paper.pdf"
)
# Terminal prompts you:
# NO REPOSITORY FOUND - INTERACTIVE MODE
# Choose [1/2] or [q]uit:
# 1. GitHub Repository URL
# 2. Local ZIP file containing code- Tools:
read_file_contents,list_directory_files,run_command_in_repo,create_file_or_directory,extract_metrics - Autonomous Behavior: Agent explores repo, reads README, runs scripts with --help, determines required arguments, creates needed files, installs dependencies, executes experiments, extracts results
- Result: Successfully executed experiments with metrics
- Example: Agent sees
usage: script.py [-h] --text-path TEXT_PATH --out-dir OUT_DIR, creates input.txt and output/, then runs with both args - Smart Caching: Uses hybrid venv strategy for fast, isolated environments
- Tools:
extract_metrics(extracts numerical values),extract_table_metrics(from images),analyze_plot_semantics(vision model) - Autonomous Behavior: Agent compares original and reproduced results, analyzes significance of differences, identifies likely causes, provides recommendations
- Enhanced Metric Extraction (Multi-Source):
- First attempts text-based extraction from paper results
- If no metrics found, automatically extracts from paper figures AND tables:
- Figures (vision model): Uses GPT-4V/Claude to parse charts, plots, and tables embedded in figures
- Tables (LLM parsing): Intelligently extracts from text-based tables (CSV, markdown, plain text)
- Prefixes extracted metrics with source context (e.g., "Figure_1_Recall@10", "Table_2_MRR")
- Handles all scenarios: standalone tables, standalone figures, and tables-in-figures
- Result: Comprehensive reproducibility report with scores, analysis, and actionable insights
Each tool encapsulates a specific capability:
from scientist.tools.pdf_parser import PDFParser
parser = PDFParser()
content = parser.parse_pdf("paper.pdf")
print(content.title, content.abstract)from scientist.tools.code_executor import CodeExecutor
executor = CodeExecutor(sandbox_mode=True, max_timeout=300)
result = executor.execute_command("python train.py")
print(result.stdout, result.duration_seconds)Agents dynamically load their system prompts from config/agent_instructions.yaml. You can customize agent behavior by editing this file:
my_agent:
system_prompt: |
You are an expert at...
Your task is to...
Be careful to...- Create tool in
src/tools/ - Register in agent:
self.register_tool("tool_name", tool_function) - Use in agent:
result = self.tools["tool_name"](...)
The pipeline generates a comprehensive report package in data/outputs/<run_id>/:
data/outputs/20251107_120530/
βββ report_20251107_120530.json # Raw data (machine-readable)
βββ report_20251107_120530.txt # Human-readable report
βββ reproducibility_statement_20251107_120530.md # Journal-ready statement
βββ visualizations/ # π Charts and graphs
βββ visualizations.html # π Interactive dashboard
βββ overall_performance.png # Summary scores
βββ baseline_vs_reproduced.png # Metric comparison
βββ deviation_distribution.png # Error distribution
βββ detailed_comparison.csv # Data for meta-analysis
Open visualizations/visualizations.html in your browser for an interactive view:
- Overall Score: Visual reproducibility assessment
- Metric Comparison: Side-by-side paper vs reproduced values
- Figure Mapping: Paper figures matched to reproduced outputs
- Recommendations: Actionable insights for improvement
{
"run_id": "20251107_120530",
"pipeline": {
"paper_id": "My Paper Title",
"parsed_paper": {...},
"found_repo_url": "https://github.com/...",
"experiment_results": [...],
"evaluation": {
"final_reproducibility_score": 0.85,
"metrics_matched": 8,
"total_metrics": 10,
"visual_score": 0.92,
"figure_mapping": [
{
"paper_figure": "Figure 1",
"reproduced_file": "output/accuracy_plot.png",
"semantic_analysis": "Both plots show accuracy improving from 0.6 to 0.9...",
"match": true
}
],
"issues_found": [...],
"recommendations": [...]
}
}
}The detailed_comparison.csv file contains all metrics in a structured format perfect for:
- Meta-analysis across multiple papers
- Statistical analysis in R/Python
- Journal supplementary materials
- Reproducibility databases
The system now uses vision-language models to deeply understand and compare plots:
Our Enhanced Approach (Semantic):
- β Understands what the plot shows, not just how it looks
- β Compares trends, patterns, and numerical values
- β Handles style variations gracefully
- β Provides human-like analysis: "Both plots show accuracy improving from 0.6 to 0.9 over epochs"
- Agent Design Patterns: How to structure autonomous agents
- Tool Integration: Creating and managing agent capabilities
- LLM Integration: Using modern language models in applications
- Error Handling: Robust error management across pipeline stages
- Configuration Management: Environment-based configuration
- Testing Strategies: Testing multi-agent systems
- DevOps Concepts: Environment management, logging, monitoring
This project includes security features:
- Sandbox Execution: Code runs in isolated environments
- Command Validation: Forbidden commands are blocked
- Timeout Protection: Execution limits prevent infinite loops
- Environment Isolation: Virtual environments for each experiment using smart caching
The system uses a hybrid approach for managing Python environments:
- β
Isolation: Each experiment gets its own
.venvdirectory - β»οΈ Caching: Identical
requirements.txtβ reuse cached venv (fast!) - π Performance: First run ~30s, cached runs ~0.1s
- πΎ Efficiency: Disk space saved via symlinks