Skip to content

owususamuel/esilabs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Research Paper Reproducibility Agent

An intelligent multi-agent system that automatically reproduces research paper experiments and evaluates their reproducibility. This project uses smolagents (HuggingFace's lightweight agent framework) to orchestrate specialized agents that parse papers, find code repositories, run experiments, and evaluate results.

πŸ—οΈ Architecture

LLM-Powered AI Agent System

This system uses LLM intelligence for decision-making, not procedural code:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Reproducibility Orchestrator                 β”‚
β”‚                    (Pipeline Coordinator)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚              β”‚              β”‚              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚ Paper       β”‚  β”‚  Repo   β”‚  β”‚ Experiment  β”‚  β”‚Evaluatorβ”‚
    β”‚ Parser      β”‚  β”‚ Finder  β”‚  β”‚  Runner     β”‚  β”‚  Agent  β”‚
    β”‚ Agent       β”‚  β”‚ Agent   β”‚  β”‚  Agent      β”‚  β”‚         β”‚
    β”‚             β”‚  β”‚         β”‚  β”‚             β”‚  β”‚         β”‚
    β”‚ LLM: Extractβ”‚  β”‚LLM: Pickβ”‚  β”‚LLM: Analyze β”‚  β”‚LLM: Evalβ”‚
    β”‚ structured  β”‚  β”‚best repoβ”‚  β”‚& build cmd  β”‚  β”‚results  β”‚
    β”‚ info        β”‚  β”‚         β”‚  β”‚             β”‚  β”‚         β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
           β”‚              β”‚              β”‚              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”
    β”‚  PDF    β”‚    β”‚GitHub    β”‚   β”‚  Code   β”‚   β”‚ Result  β”‚
    β”‚ Parser  β”‚    β”‚          β”‚   β”‚Executor β”‚   β”‚Comparator
    β”‚ Tool    β”‚    β”‚          β”‚   β”‚  Tool   β”‚   β”‚ Tool    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Principle: Agents THINK (using LLM), Tools DO (using code)

How Each Agent Uses LLM

  1. Paper Parser Agent

    • Tools: Read PDF, extract text
    • LLM: Intelligently extract datasets, hyperparameters, methodology
  2. Repo Finder Agent

    • Tools: Build search queries and URLs for repository discovery
    • LLM: Generate search queries based on paper metadata
  3. Experiment Runner Agent

    • Tools: Read files, run commands
    • LLM: Understand repository, determine how to run code, construct command with all required arguments
  4. Evaluator Agent

    • Tools: Calculate metric differences, vision model for semantic plot comparison, figure/table metric extraction
    • LLM: Analyze WHY results differ, semantically compare figures, assess quality, provide recommendations
    • NEW: Uses vision-language models (GPT-4V, Claude 3) to:
      • Understand plots semantically, not just pixel-wise
      • Extract numerical metrics directly from paper figures AND tables when text extraction fails
      • Vision model parses figures/charts/plots (images)
      • LLM intelligently parses standalone tables (text)
      • Automatically handles all three: text tables, figure images, and tables-in-figures

Installation

  1. Clone and setup the environment:
cd ~/esilabs
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables:
cp .env.example .env
# Edit .env with your credentials

Quick Start

python launch_agent.py

# or with uv
uv run launch_agent.py

LLM Configuration (smolagents style)

This project now mirrors the official smolagents quick demo:

  • Pick any model/provider via .env β€” MODEL_PROVIDER accepts openai, azure, anthropic, or ollama.
  • Use MODEL_NAME to point at the exact model (for example gpt-4o, claude-3-sonnet, or a local llama3.1:8b).
  • Optional: set LLM_CODE_RETRIES to tell the agent how many times it should gently remind smaller models to answer with Python-only tool calls (default 3).
  • Credentials stay outside the repo: supply the right API key for the provider you choose (OpenAI, Anthropic, Azure, etc.) or keep everything local with Ollama/transformers.

Because smolagents is model-agnostic, you can swap providers without touching the codebase β€” update the environment variables and rerun launch_agent.py.

πŸ“š Understanding the Code

1. Base Agent Framework (scientist/agents/base_agent.py)

Key Concepts:

  • Abstract base class pattern
  • Tool registration system
  • LLM client integration
  • Execution history tracking

Example:

class MyAgent(BaseAgent):
    def __init__(self):
        super().__init__(
            agent_name="my_agent",
            system_prompt="Your system prompt here"
        )
        self.register_tool("my_tool", my_tool_function)
    
    def execute(self, task):
        result = self.call_llm(messages=[...])
        self.log_execution("step_name", result)
        return result

2. Specialized Agents

Each agent works autonomously using tools:

Paper Parser Agent

  • Tools: parse_pdf (reads PDF content)
  • Autonomous Behavior: Agent reads PDF, analyzes content, extracts structured information (title, authors, datasets, hyperparameters, methodology, experiment results, metrics)
  • Result: Structured JSON with all paper information

Repo Finder Agent

  • Tools: Repository search query builder
  • Autonomous Behavior: Agent generates search queries based on paper title and authors, provides GitHub search URLs
  • Result: Search queries and URLs to help users find relevant repositories

Interactive Mode 🎯

When automated repository search fails, the system can ask you to provide one manually:

from scientist.main import run_reproducibility_pipeline

# Run with interactive fallback (enabled by default)
result = run_reproducibility_pipeline(
    pdf_path="paper.pdf"
)

# Terminal prompts you:
# NO REPOSITORY FOUND - INTERACTIVE MODE
# Choose [1/2] or [q]uit: 
#   1. GitHub Repository URL
#   2. Local ZIP file containing code

Experiment Runner Agent

  • Tools: read_file_contents, list_directory_files, run_command_in_repo, create_file_or_directory, extract_metrics
  • Autonomous Behavior: Agent explores repo, reads README, runs scripts with --help, determines required arguments, creates needed files, installs dependencies, executes experiments, extracts results
  • Result: Successfully executed experiments with metrics
  • Example: Agent sees usage: script.py [-h] --text-path TEXT_PATH --out-dir OUT_DIR, creates input.txt and output/, then runs with both args
  • Smart Caching: Uses hybrid venv strategy for fast, isolated environments

Evaluator Agent

  • Tools: extract_metrics (extracts numerical values), extract_table_metrics (from images), analyze_plot_semantics (vision model)
  • Autonomous Behavior: Agent compares original and reproduced results, analyzes significance of differences, identifies likely causes, provides recommendations
  • Enhanced Metric Extraction (Multi-Source):
    • First attempts text-based extraction from paper results
    • If no metrics found, automatically extracts from paper figures AND tables:
      • Figures (vision model): Uses GPT-4V/Claude to parse charts, plots, and tables embedded in figures
      • Tables (LLM parsing): Intelligently extracts from text-based tables (CSV, markdown, plain text)
    • Prefixes extracted metrics with source context (e.g., "Figure_1_Recall@10", "Table_2_MRR")
    • Handles all scenarios: standalone tables, standalone figures, and tables-in-figures
  • Result: Comprehensive reproducibility report with scores, analysis, and actionable insights

4. Tools

Each tool encapsulates a specific capability:

PDF Parser Tool

from scientist.tools.pdf_parser import PDFParser

parser = PDFParser()
content = parser.parse_pdf("paper.pdf")
print(content.title, content.abstract)

Code Executor Tool

from scientist.tools.code_executor import CodeExecutor

executor = CodeExecutor(sandbox_mode=True, max_timeout=300)
result = executor.execute_command("python train.py")
print(result.stdout, result.duration_seconds)

Modifying Agent Behavior

Agents dynamically load their system prompts from config/agent_instructions.yaml. You can customize agent behavior by editing this file:

my_agent:
  system_prompt: |
    You are an expert at...
    Your task is to...
    Be careful to...

Adding New Tools

  1. Create tool in src/tools/
  2. Register in agent: self.register_tool("tool_name", tool_function)
  3. Use in agent: result = self.tools["tool_name"](...)

πŸ“Š Output & Visualizations

The pipeline generates a comprehensive report package in data/outputs/<run_id>/:

data/outputs/20251107_120530/
β”œβ”€β”€ report_20251107_120530.json           # Raw data (machine-readable)
β”œβ”€β”€ report_20251107_120530.txt            # Human-readable report
β”œβ”€β”€ reproducibility_statement_20251107_120530.md  # Journal-ready statement
└── visualizations/                        # πŸ“Š Charts and graphs
    β”œβ”€β”€ visualizations.html                # 🌐 Interactive dashboard
    β”œβ”€β”€ overall_performance.png            # Summary scores
    β”œβ”€β”€ baseline_vs_reproduced.png         # Metric comparison
    β”œβ”€β”€ deviation_distribution.png         # Error distribution
    └── detailed_comparison.csv            # Data for meta-analysis

Interactive Dashboard

Open visualizations/visualizations.html in your browser for an interactive view:

  • Overall Score: Visual reproducibility assessment
  • Metric Comparison: Side-by-side paper vs reproduced values
  • Figure Mapping: Paper figures matched to reproduced outputs
  • Recommendations: Actionable insights for improvement

JSON Report Structure

{
  "run_id": "20251107_120530",
  "pipeline": {
    "paper_id": "My Paper Title",
    "parsed_paper": {...},
    "found_repo_url": "https://github.com/...",
    "experiment_results": [...],
    "evaluation": {
      "final_reproducibility_score": 0.85,
      "metrics_matched": 8,
      "total_metrics": 10,
      "visual_score": 0.92,
      "figure_mapping": [
        {
          "paper_figure": "Figure 1",
          "reproduced_file": "output/accuracy_plot.png",
          "semantic_analysis": "Both plots show accuracy improving from 0.6 to 0.9...",
          "match": true
        }
      ],
      "issues_found": [...],
      "recommendations": [...]
    }
  }
}

Export for Meta-Analysis

The detailed_comparison.csv file contains all metrics in a structured format perfect for:

  • Meta-analysis across multiple papers
  • Statistical analysis in R/Python
  • Journal supplementary materials
  • Reproducibility databases

🎨 Semantic Visual Comparison (NEW!)

The system now uses vision-language models to deeply understand and compare plots:

Our Enhanced Approach (Semantic):

  • βœ… Understands what the plot shows, not just how it looks
  • βœ… Compares trends, patterns, and numerical values
  • βœ… Handles style variations gracefully
  • βœ… Provides human-like analysis: "Both plots show accuracy improving from 0.6 to 0.9 over epochs"

πŸŽ“ Learning Objectives

  1. Agent Design Patterns: How to structure autonomous agents
  2. Tool Integration: Creating and managing agent capabilities
  3. LLM Integration: Using modern language models in applications
  4. Error Handling: Robust error management across pipeline stages
  5. Configuration Management: Environment-based configuration
  6. Testing Strategies: Testing multi-agent systems
  7. DevOps Concepts: Environment management, logging, monitoring

πŸ” Security Considerations

This project includes security features:

  • Sandbox Execution: Code runs in isolated environments
  • Command Validation: Forbidden commands are blocked
  • Timeout Protection: Execution limits prevent infinite loops
  • Environment Isolation: Virtual environments for each experiment using smart caching

Virtual Environment Strategy

The system uses a hybrid approach for managing Python environments:

  • βœ… Isolation: Each experiment gets its own .venv directory
  • ♻️ Caching: Identical requirements.txt β†’ reuse cached venv (fast!)
  • πŸš€ Performance: First run ~30s, cached runs ~0.1s
  • πŸ’Ύ Efficiency: Disk space saved via symlinks

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages