Alxandria: AI-Powered ML Research Assistant

https://github.com/Tyronita/Alxandria.git

Inspiration

As ML researchers and practitioners, we constantly face the same frustrating bottleneck: translating curiosity into action. Reading dozens of papers, finding the right datasets, understanding SOTA benchmarks, and setting up experiments takes hours or days. We envisioned Alxandria—named after the ancient library of knowledge—as a tool that compresses weeks of research literature review into minutes, automatically generating executable Kaggle notebooks with battle-tested code, proper citations, and real datasets ready to run.

The inspiration came from watching researchers spend 80% of their time on setup and only 20% on actual experimentation. We wanted to flip that ratio.

What it does

Alxandria transforms a simple research query (like "medical image classification" or "transformer-based time series forecasting") into a comprehensive, executable ML research package in under 5 minutes:

Intelligent Literature Review: Searches academic sources (arXiv, Papers with Code, GitHub) and synthesizes the top 3 most relevant papers with working links, contributions, and code repositories
Gap Analysis: Identifies 2-3 specific research gaps with difficulty ratings, expected impact, and proposed solution approaches
Dataset Discovery: Finds 3-4 relevant datasets from Kaggle/HuggingFace with size, format, SOTA performance metrics, and direct access links
Implementation Roadmap: Generates a technical checklist with environment setup, data pipeline architecture, baseline models, evaluation metrics, and current SOTA benchmarks
One-Click Kaggle Deployment: Automatically pushes a pre-populated Jupyter notebook to your Kaggle account with:
- Full research background with citations
- Identified gaps and opportunities
- Executable dataset loading code (not comments!) that downloads, extracts, and loads data
- PyTorch model templates with training loops
- Evaluation functions and submission helpers

The entire workflow is conversational and guided—users simply enter a topic and click through 4 research steps, then get a shareable Kaggle link instantly.

How we built it

Architecture

Frontend: React with Tailwind CSS provides a minimal, conversational UI. Users progress through a 5-step wizard (Papers → Gaps → Datasets → Implementation → Ship), with real-time loading indicators since each Perplexity API call takes 10-30 seconds.

Backend: FastAPI serves a RESTful API with MongoDB for session persistence. The core /api/research/step endpoint handles the multi-turn research flow, while /api/ship/push-to-kaggle orchestrates notebook generation and deployment.

Perplexity API Integration (The Brain)

The system leverages Perplexity's Chat Completions API (https://api.perplexity.ai/chat/completions) using the sonar-pro model, which combines real-time web search with LLM reasoning:

perplexity_client = OpenAI(
    api_key=PERPLEXITY_API_KEY,
    base_url="https://api.perplexity.ai"
)

response = perplexity_client.chat.completions.create(
    model="sonar-pro",
    messages=[
        {"role": "system", "content": structured_prompt},
        {"role": "user", "content": user_query}
    ],
    extra_body={
        "search_domain_filter": [
            "arxiv.org", 
            "github.com", 
            "paperswithcode.com",
            "kaggle.com"
        ]
    }
)

Each research step uses carefully crafted system prompts that enforce output structure (markdown tables for papers, detailed gap analysis with difficulty ratings, dataset comparisons with SOTA metrics). The search_domain_filter ensures high-quality academic and technical sources rather than general web content.

Critical Design Decision: We chose sonar-pro over sonar-deep-research after discovering the latter caused 60+ second timeouts. The Pro model balances depth with response time (10-30s per step), maintaining user engagement while delivering comprehensive results.

MongoDB Session Management

Each research session generates 4 documents storing:

Step 1 content → research papers and analysis
Step 2 content → gaps (stored as update to step 1 doc)
Step 3 content → datasets (stored as update)
Step 4 content → implementation plan (stored as update)

This allows the notebook generator to reconstruct the entire research journey from a single session_id.

Notebook Generation Engine

The generate_notebook_from_research() function is the culmination of all research steps:

async def generate_notebook_from_research(
    session_id: str, 
    topic: str, 
    dataset: str
) -> dict:
    # Fetch ALL research data
    research_doc = await db.research.find_one({
        "session_id": session_id, 
        "step": 1
    })

    # Extract content from each step
    research_content = research_doc.get('content', '')
    gaps_content = research_doc.get('gaps', '')
    dataset_content = research_doc.get('datasets', '')
    implementation_content = research_doc.get('implementation', '')

    # Build Jupyter notebook JSON structure
    notebook = {
        "cells": [
            # Markdown cells with research background
            {
                "cell_type": "markdown",
                "source": [research_content]  # Full Perplexity response
            },
            # CODE cell with ACTUAL executable dataset loading
            {
                "cell_type": "code",
                "source": [
                    f"!kaggle datasets download -d {dataset} --force\n",
                    "with zipfile.ZipFile(zip_file, 'r') as zip_ref:\n",
                    "    zip_ref.extractall('./data')\n",
                    "df = pd.read_csv(csv_files[0])\n",
                    "print(df.head())"
                ]
            },
            # Model training, evaluation, submission cells...
        ]
    }
    return notebook

Key Innovation: Unlike most notebook generators that use placeholder comments (# TODO: Load your data), Alxandria generates fully executable Python code that actually downloads the Kaggle dataset, extracts it, lists files, and auto-loads CSV data—all without user intervention.

Kaggle CLI Integration

Pushing to Kaggle required deep understanding of their API constraints:

Metadata Requirements: Kaggle's kernel-metadata.json needs specific fields (id, title, code_file, kernel_type, language, is_private, etc.) and the title must resolve to the same slug as the ID
Slug Generation: Topic "Using Transformers to Detect Illegal Deforestation" must convert to alxandria-using-transformers-to-detect-i-{timestamp} (shortened to avoid 50-char limit)
Authentication: Uses environment variables (KAGGLE_USERNAME, KAGGLE_KEY) and adds /root/.venv/bin to PATH for CLI access

The push workflow:

# Generate safe kernel slug
safe_topic = ''.join(c if c.isalnum() else '-' for c in topic.lower())
safe_topic = '-'.join(filter(None, safe_topic.split('-')))[:30]
kernel_slug = f"alxandria-{safe_topic}-{timestamp}"

# Create metadata that matches Kaggle's requirements
metadata = {
    "id": f"{username}/{kernel_slug}",
    "title": f"Alxandria {safe_topic.replace('-', ' ').title()} {timestamp}",
    "code_file": "notebook.ipynb",
    "language": "python",
    "kernel_type": "notebook",
    "is_private": False,
    "enable_gpu": True,
    "enable_internet": True
}

# Push via CLI
run_kaggle_command(['kaggle', 'kernels', 'push', '-p', temp_dir])

# Return shareable link
return f"https://www.kaggle.com/code/{username}/{kernel_slug}"

Challenges we ran into

1. The 403 Forbidden Kaggle Mystery

Problem: Backend returned success messages, generated perfect Kaggle links, but every link led to 404 errors. The notebooks didn't actually exist.

Investigation:

First attempt: "Maybe metadata format is wrong?" → Fixed language field issues
Second attempt: "Maybe the slug doesn't match?" → Fixed title/slug alignment
Third attempt: "Maybe it's the dataset format?" → Added validation
Root cause (after 3 hours): Kaggle API requires phone verification on the account before allowing programmatic kernel creation via API

Solution: User verified their phone number on Kaggle settings, and immediately the 403 errors became 200 success responses. Notebooks started appearing on Kaggle within seconds.

Lesson: Always check API service-level requirements (account verification, rate limits, permissions) before debugging code.

2. Perplexity API Timeouts

Problem: Initial implementation used sonar-deep-research model, which took 60-90 seconds per request. Frontend axios calls timed out at 30 seconds, causing "Failed to load step" errors despite successful backend responses.

Solution:

Switched to sonar-pro model (10-30s response time)
Added explicit 60-second timeout to frontend axios: javascript const response = await axios.post(`${API}/research/step`, payload, { timeout: 60000 // Critical for long-running Perplexity calls });
Added proper error messages for timeout vs network failures

3. Kaggle Slug Title Mismatch (400 Bad Request)

Problem: Kaggle API rejected notebooks with "Your kernel title does not resolve to the specified id" errors. Titles like "Alxandria: Detecting Illegal Deforestation with Transformers" didn't convert to the slug alxandria-detecting-illegal-deforestation-with-transformers-.

Root Cause:

Trailing hyphens in slugs
Title contained special characters (:) that Kaggle strips
Slug exceeded 50-character limit

Solution: Implemented proper slug sanitization:

# Remove special chars, collapse hyphens, trim length
safe_topic = ''.join(c if c.isalnum() else '-' for c in topic.lower())
safe_topic = '-'.join(filter(None, safe_topic.split('-')))[:30]

# Add timestamp for uniqueness
kernel_slug = f"alxandria-{safe_topic}-{timestamp}"

# Match title to slug format
kernel_title = f"Alxandria {safe_topic.replace('-', ' ').title()} {timestamp}"

4. Empty Notebooks on Kaggle

Problem: Early versions pushed notebooks successfully but they contained placeholder text: "Research data not found. Using baseline template."

Root Cause: The notebook generation function was fetching from MongoDB but the research data wasn't being stored properly during the Perplexity API calls.

Solution: Added explicit logging and verified that each research step's content field was being stored:

await db.research.insert_one({
    "session_id": session_id,
    "step": 1,
    "topic": topic,
    "content": response.choices[0].message.content,  # Perplexity response
    "timestamp": datetime.now(timezone.utc).isoformat()
})

# For subsequent steps, update the document
await db.research.update_one(
    {"session_id": session_id, "step": 1},
    {"$set": {"gaps": content}}  # Add gaps, datasets, implementation
)

This ensured notebooks contained full research data (6000+ characters per section).

Accomplishments that we're proud of

1. True End-to-End Automation

Most "notebook generators" create skeleton code with TODOs. Alxandria generates fully executable code that:

Downloads real Kaggle datasets: !kaggle datasets download -d {dataset} --force
Automatically extracts zip files: zipfile.extractall('./data')
Detects and loads CSV files: pd.read_csv(csv_files[0])
Lists all extracted files for user reference
Includes proper error handling and progress messages

Users can literally click "Run All" in Kaggle and watch their entire experiment execute without writing a single line of code.

2. Research Quality with Citations

Unlike generic LLM responses, Alxandria provides:

Verifiable sources: Every claim links back to arXiv papers, GitHub repos, or Papers with Code
Working links: We validate that paper URLs resolve (no broken links)
Structured analysis: Tables comparing papers, gap analysis with difficulty ratings, dataset comparisons with SOTA benchmarks
Up-to-date information: Perplexity's real-time search ensures recent papers (2023-2024) appear in results

Example output quality:

## Top 3 Research Papers
| # | Paper | Authors | Contribution | Links |
|---|-------|---------|--------------|-------|
| 1 | Vision Transformer for Small-Size Datasets | Lee et al. | Shifted Patch Tokenization for low-data regimes | [Paper](arxiv.org/...) · [Code](github.com/...) |

3. 5-Minute Research → Production Pipeline

Traditional ML research workflow:

Day 1-2: Literature review (reading 10-20 papers)
Day 3: Finding datasets and understanding formats
Day 4-5: Setting up environment, writing boilerplate code
Day 6+: Actual experimentation

Alxandria workflow:

Minute 1: Enter research topic
Minute 2-3: Review papers, gaps, datasets (4 guided steps)
Minute 4: Click "Push to Kaggle"
Minute 5: Open notebook, click "Run All", start experimenting

95% time reduction on research setup.

4. Robust Error Handling with User Guidance

When things fail, Alxandria provides actionable feedback:

403 Forbidden → "Your account needs phone verification. Go to kaggle.com/settings"
Timeout errors → "Request timed out. Perplexity API takes 20-30 seconds, please wait"
Invalid dataset format → "Dataset must be in format: username/dataset-name"
Slug mismatch → Automatically fixes by adding timestamps and sanitizing titles

This turns frustrating debugging sessions into guided fixes.

What we learned

1. Real-time LLM APIs Require Patience

Perplexity's search-augmented generation takes 10-30 seconds per request because it's:

Searching the web for relevant sources
Filtering by domain (academic, technical)
Analyzing and synthesizing information
Generating structured markdown with citations

Lesson: Always set frontend timeouts 2-3x longer than expected backend response time. Show progress indicators with realistic time estimates ("This may take 20-30 seconds") to manage user expectations.

2. API Documentation ≠ API Reality

Kaggle's official docs say language is a valid metadata field. Reality: it causes Invalid field 'language' errors in 2024. Phone verification requirement isn't mentioned anywhere in API docs.

Lesson: When facing mysterious API failures:

Check GitHub issues for the API library
Search recent Stack Overflow questions (last 3-6 months)
Test with minimal examples before complex implementations
Use web search tools to find recent changes/requirements

3. Notebook Code Must Be Executable, Not Educational

Early versions had "educational" code with comments explaining concepts:

# First, we need to load the data
# You can use pandas for CSV files:
# df = pd.read_csv('your_data.csv')

Users wanted runnable code:

!kaggle datasets download -d {dataset} --force
with zipfile.ZipFile(f"{dataset.split('/')[-1]}.zip", 'r') as zip_ref:
    zip_ref.extractall('./data')
csv_files = list(Path('./data').glob('**/*.csv'))
df = pd.read_csv(csv_files[0])
print(f"Loaded {len(df)} rows")

Lesson: Code generation tools should prioritize "copy-paste-run" over "read-and-understand". Documentation can be in markdown cells, but code cells must execute.

4. Domain Filtering Dramatically Improves LLM Output Quality

Generic web search returns blog posts, tutorials, and outdated content. Filtering to arxiv.org, github.com, paperswithcode.com ensures:

Peer-reviewed papers (not Medium articles)
Working code repositories (not broken links)
SOTA benchmarks (not inflated marketing claims)
Recent research (2023-2024 papers)

Lesson: When using search-augmented LLMs, always constrain the search space to authoritative sources for your domain.

What's next for Alxandria

1. Multi-Dataset Experiments

Currently, Alxandria generates notebooks for single datasets. Next version will support:

Comparative analysis: Automatically test the same model on 3-4 related datasets
Cross-dataset validation: Train on Dataset A, test on Dataset B to check generalization
Ensemble strategies: Combine predictions from multiple dataset-specific models

2. Automated Hyperparameter Tuning

Generate notebooks with Optuna/Ray Tune integration:

def objective(trial):
    lr = trial.suggest_float('lr', 1e-5, 1e-2, log=True)
    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64])
    # Model training with these hyperparameters
    return validation_accuracy

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

3. GitHub Integration for Version Control

Instead of just Kaggle:

Create GitHub repo with notebook, README, and requirements.txt
Set up GitHub Actions for automated training on commits
Generate reproducible experiment tracking with MLflow/Weights & Biases

4. Smart Dataset Recommendations

Use embeddings to match research topics to datasets:

Embed user query: "medical image classification with limited labels"
Embed dataset descriptions from Kaggle/HuggingFace
Rank by semantic similarity + metadata (size, license, recent updates)
Prioritize datasets with active competitions or high engagement

5. Live Experiment Monitoring

After pushing to Kaggle:

Poll notebook execution status
Display training metrics in real-time (loss curves, accuracy)
Send alerts when training completes or fails
Auto-generate comparison tables if user runs multiple experiments

6. Research Paper Upload

Allow users to upload specific papers (PDFs):

Extract methodology sections with PyPDF2/LlamaParse
Generate notebooks that replicate the paper's approach
Include proper citations and attribution
Highlight differences between original paper code and our implementation

7. Multi-Modal Research Support

Extend beyond computer vision:

NLP: Tokenization, transformer fine-tuning, evaluation metrics (BLEU, ROUGE)
Time Series: ARIMA, Prophet, LSTM templates with proper train/val/test splits
Reinforcement Learning: Environment setup, agent training loops, reward visualization
Audio/Speech: Librosa for feature extraction, wav2vec models

Built With

fastapi
kaggle
openai
perplexity
react

Alxandria - AI/ML Research Pipeline