Inspiration

Most automation products start by asking you or your teams to describe their work in flowcharts, prompts, or brittle scripts. But real operations happen across tabs, tools, and people. And the "workflow" only exists in what humans actually do on-screen. AutoAutomation.ai is inspired by the idea that in the Action Era, the fastest way to build reliable automations is to watch reality first: record a team doing the work once, let an agent infer intent and handoffs, then execute the same outcome repeatedly with evidence. We wanted to make automation feel like: "Just show me how you or your team works".

What it does

No setup. No flowcharts. Just show how your team works.

AutoAutomation.ai turns screen recordings into executable, verifiable automations:

  • Record your screen while completing real work (email, ticketing, admin panels, internal tools).
  • Gemini 3 analyzes the video to produce a workflow summary, step-by-step intent, and automation opportunities.
  • It detects collaborators (emails/mentions) to reveal multi-person processes and suggest inviting the right teammates.
  • It converts the discovered workflow into an executable plan that prefers governed tool calls (MCP + Google APIs like gmail + Gemini googleSearch tool + function calls through Gemini) and can fall back to browser automation (Playwright) when no API exists.
  • Every run produces verification artifacts (evidence) such as logs/responses and browser screenshots, enabling auditability and confidence scoring.
  • A user credit system is already in place, to monetize the project in the future. It takes into account the length of the videos and the number of automation runs.
  • Optional "Keep human in the loop" when workflow is executed.
  • The admin panel shows the return on investment, that is, how much is being automated on the team and how much it saves.
  • System for sending notifications via email and on the website itself.

How we built it

We built a full end-to-end SaaS product with a web recorder + an orchestration backend:

  • Frontend web app records the screen (getDisplayMedia/MediaRecorder) and uploads video securely.
  • It's important to note that a first call is made to Gemini to analyze the video, identify automatable steps, and select the necessary subset of tools. Then, when the workflow is executed, a second call is made to Gemini, passing only the subset of MCPs/functions required for that workflow (previously identified in the first video analysis).
  • Backend orchestrator (Node.js/TypeScript) stores workflows, steps, runs, and evidence in Postgres, and stores artifacts in cloud storage.
  • Gemini 3 (gemini-3-pro-preview) performs multimodal video understanding and returns structured JSON: summary, collaborators, automations, and steps.
  • Execution engine supports multiple step types:
    • Tool-driven execution via Gemini function calling (Google APIs (gmail: read emails, send email, read one email, ...) + MCP tools).
    • Browser automation via Playwright when APIs/tools aren’t available.
    • LLM steps for reasoning, extraction, summarization, and decisions.
  • The multi-step workflow execution involves calling Gemini again multiple times and passing it the history (or thought signatures) of previous steps. This was a major challenge.
  • Security + reliability: encrypted session/cookie handling, rate limits/quotas, and evidence capture for reproducible runs.
  • Secure oAuth authorization to access Google services (ie: Gmail) that can then be called in workflow executions.
  • Registration and login are done using Firebase Auth.
  • Multilingual.

Challenges we ran into

  • Functionalities like "GoogleSearch" within Gemini are not compatible with other function calls simultaneously. As a workaround, I had to include the "GoogleSearch" functionality as a separate Gemini instance, which is then called via a function call by the Gemini executing the workflow. This way, "GoogleSearch" and function calls do not coexist within the same Gemini instance; only function calls do.
  • Designing prompts/outputs that are structured enough to execute, but flexible across apps and teams.
  • Safely bridging autonomy with governance (authorized tools only, human approval hooks where needed).
  • Tooling complexity: multiple tool sources (Google APIs, MCP servers, Playwright).
  • Making verification first-class: capturing artifacts that prove the automation did the right thing.
  • "Thought signatures" when sending Function Call responses was a pain. I tried the chat option in the genai SDK and manual implementation.

Accomplishments that we're proud of

  • A true end-to-end loop: record → discover → generate workflow → execute → store evidence → review results.
  • Multimodal workflow discovery that includes collaborator detection for multi-person processes.
  • Hybrid execution strategy (MCP/Google APIs preferred + Playwright fallback) with evidence artifacts for every run.
  • LLM steps integrated as first-class workflow actions (with tests covering LLM execution paths).
  • A dockerized, reproducible prototype designed for judges to run and evaluate consistently.
  • +200 unit tests, Lint, gitleaks, env variables, docker and rate limits, to increase the quality, security, and maintainability of the project.
  • Once workflows are running, teams need visibility. Auto Automation provides a live dashboard showing who’s onboarded, which workflows were discovered, and how much time the team has saved.

What we learned

  • "Automation quality" is mostly about verification: evidence artifacts + clear intent mapping beat clever prompts.
  • Function calling is the right abstraction for safe autonomy: the model plans, tools execute, and the system audits.
  • Progressive disclosure reduces setup friction: ask users for structured inputs only when execution truly needs them.
  • Multi-tenant security and authorization boundaries, and rate limits must be designed early, even in a hackathon prototype.

What's next for Auto Automation

  • Add a feedback loop: users correct discovered steps once, and the agent generalizes improvements across runs.
  • Improve discovery fidelity with timestamped steps and stronger grounding (what changed on-screen, what outcome occurred).
  • Expand MCP ecosystem (Slack/Jira/Gmail/CRM/internal tools).
  • Ship a “trust layer”: policy rules, approvals for high-risk actions, and compliance-friendly audit exports. Currently only the "keep human in the loop", execution logs and execution evidences are in place.
  • Complete the Chrome Extension.

Built With

Share this project:

Updates