kAIra

AI that watches you work once — then does it for you forever.


Inspiration

Arca Continental — one of the largest Coca-Cola bottlers in the world — operates in a reality where EDI integration with retail chains like Walmart, Soriana, and Chedraui takes months of negotiation per client. In the meantime, operations teams sit in front of two screens copying purchase orders by hand: one tab with the retailer portal, one tab with the internal system.

The cost of this process scales linearly with volume. If processing one order takes (t) minutes of human time, then processing (n) orders per day across (k) retail chains costs:

$$C_{\text{manual}} = n \cdot k \cdot t \text{ minutes/day}$$

It doesn't scale, it generates transcription errors, and there's no short-term alternative. EDI takes months per client — so the gap between "portal exists" and "data flows automatically" is filled entirely by humans.

We wanted to close that gap without waiting for EDI, by teaching an AI agent to do exactly what the human does — learning from observation rather than pre-programmed rules.


What It Does

kAIra is a three-phase AI system that eliminates manual data transfer between web portals.

Phase 1 — Connect

A Chrome extension silently attaches to two browser tabs: the source portal (e.g. a retailer's order system) and the destination system (Arca Continental's internal platform). No configuration required.

Phase 2 — Observe

The user performs the task once, manually. The extension records every click, keystroke, copy, and paste across both tabs, along with screenshots. When the user clicks Stop + Learn, the trace is sent to Gemini, which infers a natural-language task description and field mapping — no hardcoded rules, no CSS selectors.

Phase 3 — Automate

Clicking Play Agent launches a Playwright browser controlled by Gemini Computer Use. The agent sees the screen like a human, navigates between tabs, reads orders from the source, and fills in the destination — completely autonomously, even with data it has never seen before.

The agent's accuracy per field can be modeled as:

$$P(\text{correct field}) = P(\text{correct mapping}) \cdot P(\text{correct transcription} \mid \text{correct mapping})$$

Because Gemini infers mappings from semantic context rather than positional heuristics, (P(\text{correct mapping})) remains high even when field names differ between systems — which is the core hard case.

Real-time dashboard

A live dashboard at /dashboard auto-refreshes every 1.5 seconds while the agent runs, showing each turn as it completes: a screenshot of the browser at that moment, Gemini's reasoning, the actions executed (click_at, type_text_at, etc.), the active URL, and the exact timestamp. A pulsing green indicator signals an active run; a final summary appears when the agent finishes.

Voice narration

Every agent action is narrated aloud in real time via the ElevenLabs text-to-speech API. As the agent clicks, types, and navigates, a voice describes what it is doing — making the automation legible to anyone in the room watching the demo, not just the person reading the terminal.


How We Built It

The system has four layers working together.

Chrome Extension (Manifest V3)

A content script injected into both tabs captures DOM events — clicks, inputs, copies, pastes, navigation, form submits — with rich context: element selectors, labels, data attributes, copied/pasted values, and periodic screenshots. A background service worker forwards all events and screenshots to the backend over HTTP.

Express Backend (Node.js)

Stores incoming events and screenshots, then on Stop + Learn assembles a multimodal prompt combining the compact event log with up to 3 screenshots and calls Gemini 2.5 Flash. Gemini returns a structured workflow.json with a taskDescription and fieldMappings written in natural language. If Gemini fails, a deterministic heuristic extracts copy-paste mappings as a fallback.

The total token count sent to Gemini is bounded by:

$$T = T_{\text{events}} + 3 \cdot T_{\text{screenshot}} \leq T_{\text{max}}$$

where (T_{\text{events}}) is the compressed event log and (T_{\text{screenshot}}) is the token cost per image. We cap screenshots at 3 to stay well within context limits while still giving the model enough visual context.

A /logs/current endpoint streams the active run state — turn index, screenshots, Gemini reasoning, actions, and URL — which the /dashboard page polls every 1.5 seconds to render the live view.

Computer Use Agent (Python)

Reads the workflow and launches a Playwright browser with a persistent profile. Opens one tab per URL, then runs a turn-based loop:

$$\text{screenshot} \xrightarrow{\text{Gemini}} \text{action(s)} \xrightarrow{\text{Playwright}} \text{new screenshot} \xrightarrow{} \cdots$$

Tab switching is handled natively by intercepting navigate actions and bringing the matching tab to the front — no reload, no new window. After each action is executed, the agent posts the turn data to the backend so the dashboard can display it in real time.

Voice layer (ElevenLabs)

Each action name and its parameters are converted to a short natural-language description and sent to the ElevenLabs API. The audio plays back through the system speakers as the agent works, giving the automation a spoken presence that makes it immediately understandable to non-technical stakeholders watching the demo.


Challenges We Faced

Multi-tab coordination without hardcoding

Playwright's default mode opens a new detached window. We switched to a persistent context so the agent operates in the same browser session the user sees, and built a tab-routing layer that maps navigation actions to already-open pages rather than triggering redundant loads.

Page flicker and premature screenshots

Using domcontentloaded caused the agent to capture screenshots mid-render, giving Gemini a partially-loaded view that led to repeated retries. Switching to the load event with a short stabilization pause (\Delta t = 300\text{ ms}) eliminated the visual noise entirely.

Windows encoding

The Python script used Unicode symbols (, , ) that caused UnicodeEncodeError on Windows cp1252 terminals, crashing the agent before it could take a single action. Every print statement was audited and all non-ASCII characters replaced with plain ASCII equivalents.

Keeping the dashboard and voice layer in sync

The agent loop, the log endpoint, and the ElevenLabs calls all run concurrently. Ensuring the dashboard reflects the correct turn state — and that audio doesn't queue up and lag behind the actual browser — required careful sequencing so each action is logged and narrated before the next one begins.

Convincing Gemini the learning is genuine

The prompt must give Gemini both the field-mapping reference and the full natural-language task description — not just coordinates or selectors. This is what makes the agent robust to layout variations and unseen data. Getting that prompt structure right required several iterations.


What We Learned

The biggest insight was how much a vision-language model can infer from a raw event trace when the trace is semantically rich. By preserving element labels, data attributes, and copied text — not just CSS selectors and pixel coordinates — we gave Gemini enough signal to write a task description that generalizes beyond the specific order used during recording.

We also learned that the boundary between observation and replay is thinner than expected. The same model that learns the workflow from screenshots is the one that replays it — so improving recording quality directly improves replay accuracy with no additional engineering.

Adding voice narration via ElevenLabs changed how we thought about the demo entirely. When stakeholders can hear the agent reasoning through a task — "navigating to orders page", "filling in customer name" — the technology stops feeling like a script running in a terminal and starts feeling like a colleague doing the work. The legibility of AI actions matters as much as their correctness.

Finally, the smallest details have outsized impact on agent reliability. The relationship between per-step failure probability (p) and end-to-end success over (n) steps is:

$$P(\text{success}) = (1 - p)^n$$

For a 20-step task, even a small per-step failure rate like (p = 0.05) yields only ((0.95)^{20} \approx 0.36) end-to-end success. This makes it clear why fixing flicker, encoding bugs, and premature screenshots — each reducing (p) by a few percent — has such a large compounding effect on the overall result.


What's Next

The current version handles one workflow at a time. The immediate next step is a workflow library — letting operators save, name, and switch between multiple learned processes (one per retail chain) without re-recording.

Beyond that:

  • Scheduled execution — the agent runs automatically when new orders appear, with no human trigger required
  • Self-healing — when a page layout changes slightly between runs, the agent detects the mismatch and re-learns the affected step rather than failing silently
  • Zero-dev onboarding — any operations team can onboard a new web portal in under 5 minutes by doing the task once, with no developer involvement
  • Richer voice feedback — moving from action narration to full conversational status updates, so the agent can explain why it made a decision, not just what it did

The longer-term vision is to turn kAIra into the practical bridge between manual operations and full EDI — deployable in an afternoon instead of a quarter.

Built With

Share this project:

Updates