¡alacard!: The open source instruction layer for AI

Imagine a new model drops hours before your livestream or client call. You need working code, not slides. Today that means a dozen tabs, broken examples, and guesswork.

What we built

¡alacard! turns model cards and tool docs into runnable, traceable, self improving notebooks. You pick a Recipe. The agent composes the notebook, runs it safely in Daytona, records every step in Weights & Biases Weave, detects a failure, proposes a minimal patch, and reruns until it passes. You share the verified notebook, remix it, or deploy the same Recipe on Vertex AI Agent Engine.

Dual innovation

Cookbook Hub (value today): an open hub for cross provider Recipes that combine modular Cards like Weave, Tavily, Daytona, and Vertex into tested workflows that generate pinned notebooks.
RL/RLHF Data Foundation (value tomorrow): each execution logs a structured state-action-reward tuple. We record pass or fail, retries, latency, error class, and patch choice to build a dataset for future policy learning.

How it works

Choose a Recipe: search → browse → execute → score.
Generate a pinned notebook with labeled ops.
Run in Daytona. Open the Weave trace so you see every call.
Fail on purpose. Show the error class.
Apply a policy patch: selector and wait, timeout, dependency pin, or temperature.
Rerun to green.
Speak two metrics: time to green and retries.
Save and Remix. The improvement policy carries over.

Why now

Models and tools evolve faster than adoption. Teams waste time wiring brittle demos. Vendors lose users when examples fail to run. ¡alacard! converts documentation into execution and makes integrations fast, observable, and reusable.

What inspired us

We have felt the panic of a last minute demo. We wanted a system that ships a working notebook on demand and proves behavior with a trace.

Challenges

Pinning dependencies across Cards without bloating install time
Normalizing error classes so one policy can choose the right patch
Keeping the live flow under two minutes with clear Weave spans
Making the first run fail in a safe and repeatable way

How we built it

Google ADK coordinates Composer, Executor, and Improver agents.
Daytona executes notebooks in an isolated sandbox and returns exit codes, stdout, and stderr.
Weights & Biases Weave traces each Card op with @weave.op and logs structured signals.
Vertex AI serves as the optional deployment target from the same Recipe.
Tavily powers the research Card.
Local Postgres stores Recipes, runs, reward history, and Remix lineage.
Google Colab delivers notebooks with one click run links.

What we learned

Judges trust numbers they can see. We keep the Weave trace open and speak time to green and retries the same way every time.
Small, deterministic policy patches beat complicated logic in a hackathon.
A verified notebook beats a perfect slide every time.

Accomplishments

Live fail → patch → rerun to green with visible metric delta
Pinned, shareable notebook others can run and remix
Complete state-action-reward log for future policy learning
One QR to a Weave permalink so judges can inspect the trace

What is next

Learn a policy from the collected tuples and compare to the greedy baseline
Expand Card types and Recipe packs
Add more evals and publish reward dashboards
Grow contribution paths for Cards and Recipes

Reward function

We combine pass, latency, retries, and clarity:

$$ R = w_1 \cdot \text{pass} + w_2 \cdot (1 - \text{latency}_{norm}) + w_3 \cdot (1 - \text{retries}_{norm}) + w_4 \cdot \text{clarity}$$ We accept a patch only if: $$ \Delta R > 0 $$.

Built with

Google’s Agent Development Kit (orchestration)
Weights & Biases Weave (observability)
Daytona (sandboxed execution)
Vertex AI (deployment and scaling)
Tavily (research and documentation)

The result

When a new model drops, ¡alacard! gives you a working notebook before the hype even starts.

Built With

ag-ui
browserbase
copilotkit
daytona
google-adk
google-colab
jupyter
mastra
papermill
postgresql
python
stagehand
tavily
typescript
vertex-ai
weave
weights-and-biases

Submitted to

Weavehacks-2 - Self Improving Agents w/ Google Cloud

Created by

I worked mainly on the first main deep research agentic system which collects documentation and example codes from various sources (Tavily API, Context7 MCP, Deepwiki MCP), and does that parallelly for each selected technology in the technology stack and for each selected AI model (also uses HF API for that to find model cards). This uses ADK orchestration and Weave observability, with the LLM as a Judge in the research loops uploading back signals (confidence score + sufficient / unsufficient findings). A HistoryAnalysis agent can learn from retrieving similar tool calls.

Csaba Toth
Generative AI Engineer at Booz Allen Hamilton, GDG Fresno lead, NJUG co-founder
rliang7
Daniel Green
Amine Lbath

Updates

Daniel Green started this project — Oct 12, 2025 04:18 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.