OpenAdapt Capture is the data collection component of the OpenAdapt GUI automation ecosystem.
Capture platform-agnostic GUI interaction streams with time-aligned screenshots and audio for training ML models or replaying workflows.
Status: Pre-alpha. See docs/DESIGN.md for architecture discussion.
OpenAdapt GUI Automation Pipeline
=================================
+-----------------+ +------------------+ +------------------+
| | | | | |
| openadapt- | ------> | openadapt-ml | ------> | Deploy |
| capture | Convert | (Train & Eval) | Export | (Inference) |
| | | | | |
+-----------------+ +------------------+ +------------------+
| | |
v v v
- Record GUI - Fine-tune VLMs - Run trained
interactions - Evaluate on agent on new
- Mouse, keyboard, benchmarks (WAA) tasks
screen, audio - Compare models - Real-time
- Privacy scrubbing - Cloud GPU training automation
| Component | Purpose | Repository |
|---|---|---|
| openadapt-capture | Record human demonstrations | GitHub |
| openadapt-ml | Train and evaluate GUI automation models | GitHub |
| openadapt-privacy | PII scrubbing for recordings | Coming soon |
uv add openadapt-captureThis includes everything needed to capture and replay GUI interactions (mouse, keyboard, screen recording).
For audio capture with Whisper transcription (large download):
uv add "openadapt-capture[audio]"from openadapt_capture import Recorder
# Record GUI interactions
with Recorder("./my_capture", task_description="Demo task") as recorder:
# Captures mouse, keyboard, and screen until context exits
input("Press Enter to stop recording...")
print(f"Captured {recorder.event_count} events")from openadapt_capture import Capture
# Load and iterate over time-aligned events
capture = Capture.load("./my_capture")
for action in capture.actions():
# Each action has an associated screenshot
print(f"{action.timestamp}: {action.type} at ({action.x}, {action.y})")
screenshot = action.screenshot # PIL Image at time of actionfrom openadapt_capture import (
create_capture, process_events,
MouseDownEvent, MouseButton,
)
# Create storage (platform and screen size auto-detected)
capture, storage = create_capture("./my_capture")
# Write raw events
storage.write_event(MouseDownEvent(timestamp=1.0, x=100, y=200, button=MouseButton.LEFT))
# Query and process
raw_events = storage.get_events()
actions = process_events(raw_events) # Merges clicks, drags, typed textRaw events (captured):
mouse.move,mouse.down,mouse.up,mouse.scrollkey.down,key.upscreen.frame,audio.chunk
Actions (processed):
mouse.singleclick,mouse.doubleclick,mouse.dragkey.type(merged keystrokes → text)
capture_directory/
├── capture.db # SQLite: events, metadata
├── video.mp4 # Screen recording
└── audio.flac # Audio (optional)
Track event write latency and analyze capture performance:
from openadapt_capture import Recorder
with Recorder("./my_capture") as recorder:
input("Press Enter to stop...")
# Access performance statistics
summary = recorder.stats.summary()
print(f"Mean latency: {summary['mean_latency_ms']:.1f}ms")
# Generate performance plot
recorder.stats.plot(output_path="performance.png")Compare extracted video frames against original images to verify lossless capture:
from openadapt_capture import compare_video_to_images, plot_comparison
# Compare frames
report = compare_video_to_images(
"capture/video.mp4",
[(timestamp, image) for timestamp, image in captured_frames],
)
print(f"Mean diff: {report.mean_diff_overall:.2f}")
print(f"Lossless: {report.is_lossless}")
# Visualize comparison
plot_comparison(report, output_path="comparison.png")Generate animated demos and interactive viewers from recordings:
from openadapt_capture import Capture, create_demo
capture = Capture.load("./my_capture")
create_demo(capture, output="demo.gif", fps=10, max_duration=15)from openadapt_capture import Capture, create_html
capture = Capture.load("./my_capture")
create_html(capture, output="viewer.html", include_audio=True)The HTML viewer includes:
- Timeline scrubber with event markers
- Frame-by-frame navigation
- Synchronized audio playback
- Event list with details panel
- Keyboard shortcuts (Space, arrows, Home/End)
uv run python scripts/generate_readme_demo.py --duration 10| Extra | Features |
|---|---|
audio |
Audio capture + Whisper transcription |
privacy |
PII scrubbing (openadapt-privacy) |
all |
Everything |
Captured recordings can be used to train vision-language models with openadapt-ml.
# 1. Capture a workflow demonstration
uv run python -c "
from openadapt_capture import Recorder
with Recorder('./my_capture', task_description='Turn off Night Shift') as recorder:
input('Perform the task, then press Enter to stop...')
"
# 2. Train a model on the capture (requires openadapt-ml)
uv pip install openadapt-ml
uv run python -m openadapt_ml.cloud.local train \
--capture ./my_capture \
--open # Opens training dashboard
# 3. Compare human vs model predictions
uv run python -m openadapt_ml.scripts.compare \
--capture ./my_capture \
--checkpoint checkpoints/model \
--openFor faster training with cloud GPUs:
# Train on Lambda Labs A10 (~$0.75/hr)
uv run python -m openadapt_ml.cloud.lambda_labs train \
--capture ./my_capture \
--goal "Turn off Night Shift"See the openadapt-ml documentation for cloud setup.
OpenAdapt-ML converts captures to its Episode format automatically:
from openadapt_ml.ingest.capture import capture_to_episode
episode = capture_to_episode("./my_capture")
print(f"Loaded {len(episode.steps)} steps")
print(f"Instruction: {episode.instruction}")The conversion maps capture event types to ML action types:
mouse.singleclick/mouse.click->CLICKmouse.doubleclick->DOUBLE_CLICKmouse.drag->DRAGmouse.scroll->SCROLLkey.type->TYPE
uv sync --dev
uv run pytest- openadapt-ml - Train and evaluate GUI automation models
- Windows Agent Arena - Benchmark for Windows GUI agents
MIT

