Skip to content

Edison-Watch/desktest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

681 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Screenshot 2026-04-01 at 20 38 16

Desktest is a general computer use CLI for automated end-to-end virtualised testing of desktop applications using LLM-powered agents. Spins up a disposable 🐳 Docker container (Linux) or Tart VM (macOS) with a desktop environment, deploys any apps, and runs a computer-use agent that interacts with it based on your prompt. Built with coding agents in mind as first-class citizen users of desktest.

Once happy -> Convert agent trajectories to deterministic CI code

⚠️ Warning: Desktest is beta software under active development. APIs, task schema, and CLI flags may change between releases.

πŸ€– Agent Quickstart

Copy-paste the following prompt into Claude Code/Cursor/Codex (or any coding agent) to install desktest and set up the agent skill:

*πŸ“‹πŸ“‹πŸ“‹ Copy this prompt into your agent*πŸ“‹πŸ“‹πŸ“‹
Install the desktest CLI by running `curl -fsSL https://raw.githubusercontent.com/Edison-Watch/desktest/master/install.sh | sh`. Then copy `skills/desktest-skill.md` from the desktest repo (https://raw.githubusercontent.com/Edison-Watch/desktest/master/skills/desktest-skill.md) to `~/.claude/skills/desktest/SKILL.md` so you have context on how to use it.

Features

  • Prompt β†’ Computer use: Flexible evaluation metrics (see task definitions)
  • Observability: Live monitoring dashboard, video recordings, desktest logs for agents
  • Virtualized OS: Linux, MacOS, Windows (WIP) + Any docker image you want
  • CI integration: Run suite of tests, codified deterministic agent trajectories
  • QA agent (--qa): Autonomous QA reports via slack webhooks/markdown
  • SSH monitoring: access the dashboard and VNC from another machine via SSH or direct network access

Use Cases

Workflow 1: Prompt β†’ Human monitors computer use β†’ Deterministic CI

  1. Define task & config in task_name.json
  2. Monitor your agent using the computer/desktop app: desktest run task_name.json --monitor
  3. Keep looping steps 2,3 until happy with agent computer-use.
    1. β†’ if βœ… β†’ Codify β†’ deterministic python script (reusable for CI/CD) (desktest codify trajectory.jsonl)
    2. β†’ if ❌ β†’ debug with coding agents via desktest logs desktest_artifacts/
  4. desktest run task_name.json --replay (Deterministic replay, reusing agent trajectory with PyAutoGUI code)

Workflow 2: QA Mode β†’ open-ended exploration β†’ reports any bugs it encounters on Slack

  1. Define task & config in task_name.json
  2. Monitor your agent using the computer/desktop app: desktest run task_name.json --monitor --qa
  3. Bugs reported via slack & markdown!

Requirements

TLDR: Run desktest doctor to verify your setup.

Expand

To run tests (Linux β€” default):

  • Linux or macOS host
  • Docker daemon running (Docker Desktop, OrbStack, Colima, etc.)
  • An LLM API key (OpenAI, Anthropic, or compatible), or a CLI-based provider: Claude Code (--provider claude-cli) or Codex CLI (--provider codex-cli) β€” not needed for --replay mode
To run tests (macOS apps)
  • Apple Silicon Mac (M1 or later) running macOS 13+
  • Tart installed (brew install cirruslabs/cli/tart)
  • sshpass installed (brew install hudochenkov/sshpass/sshpass) β€” for golden image provisioning
  • A golden image prepared via desktest init-macos (handles Python, PyAutoGUI, a11y helper, TCC permissions, and SSH key setup automatically)
  • An LLM API key (same as Linux), or --provider claude-cli to use your Claude Code subscription
  • 2-VM limit: Apple's macOS SLA and Virtualization.framework permit max 2 macOS VMs simultaneously per Mac. See macOS Support for details and Apple TOS compliance.
To run tests (Windows apps β€” planned)
  • Windows VM support is planned but not yet designed. Expected to use QEMU/libvirt or Hyper-V with Windows VMs, RDP or VNC for display access, and UI Automation APIs for accessibility. Details TBD.

To build from source (optional):

  • Rust toolchain (cargo)
  • Git
  • Xcode Command Line Tools (for macOS a11y helper binary β€” macOS only)

Installation

One-line install (pre-built binary)

curl -fsSL https://raw.githubusercontent.com/Edison-Watch/desktest/master/install.sh | sh
βš™οΈ Building from source
# Or build from source
git clone https://github.com/Edison-Watch/desktest.git
cd desktest
make install_cli

Example Commands

TLDR: See interactive examples in /examples/README.md

Expand
# Validate a task file
desktest validate elcalc-test.json

# Run a single test
desktest run elcalc-test.json

# Run a test suite
desktest suite tests/

# Interactive debugging (starts container, prints VNC info, pauses)
desktest interactive elcalc-test.json

# Step-by-step mode (pause after each agent action)
desktest interactive elcalc-test.json --step

CLI Commands

TLDR: desktest --help

Expand
desktest [OPTIONS] <COMMAND>

Commands:
  run           Run a single test from a task JSON file (supports --replay for deterministic mode)
  suite         Run all *.json task files in a directory
  interactive   Start container and pause for debugging
  attach        Attach to an existing running container (supports --replay)
  validate      Check task JSON against schema without running
  codify        Convert trajectory to deterministic Python replay script
  review        Generate interactive HTML trajectory viewer
  logs          View trajectory logs in the terminal (supports --steps N, N-M, or N,M,X-Y)
  monitor       Start a persistent monitor server for multi-phase runs
  init-macos    Prepare a macOS golden image for Tart VM testing
  doctor        Check that all prerequisites are installed and configured
  update        Update desktest to the latest release from GitHub

Options:
  --config <FILE>            Config JSON file (optional; API key can come from env vars)
  --output <DIR>             Output directory for results (default: ./test-results/)
  --debug                    Enable debug logging
  --verbose                  Include full LLM responses in trajectory logs
  --record                   Enable video recording
  --monitor                  Enable live monitoring web dashboard
  --monitor-port <PORT>      Port for the monitoring dashboard (default: 7860)
  --monitor-bind-addr <ADDR> Bind address for dashboard (default: 127.0.0.1, use 0.0.0.0 for remote)
  --resolution <WxH>         Display resolution (e.g., 1280x720, 1920x1080, or preset: 720p, 1080p)
  --artifacts-dir <DIR>      Directory for trajectory logs, screenshots, and a11y snapshots
  --no-artifacts             Skip artifact collection entirely
  --artifacts-timeout <SECS> Timeout for artifact collection (default: 120, 0 = no limit)
  --artifacts-exclude <GLOB> Glob patterns to exclude from artifact collection (repeatable)
  --qa                       Enable QA mode: agent reports app bugs during testing
  --with-bash                Allow the agent to run bash commands inside the container (disabled by default)
  --no-network               Disable outbound network from the container (Docker network mode "none")
  --provider <PROVIDER>      LLM provider: anthropic, openai, openrouter, cerebras, gemini, claude-cli, codex-cli, custom
  --model <MODEL>            LLM model name (overrides config file)
  --api-key <KEY>            API key for the LLM provider (prefer env vars to avoid shell history exposure)
  --llm-max-retries <N>      Max retry attempts for retryable LLM API failures

Computer Use Agent Task Definition

Expand

Tests are defined in JSON files. Here's a complete example that tests a calculator app:

{
  "schema_version": "1.0",        // Required: task schema version
  "id": "elcalc-addition",        // Unique test identifier
  "instruction": "Using the calculator app, compute 42 + 58.",  // What the agent should do
  "completion_condition": "The calculator display shows 100 as the result.",  // Success criteria (optional)
  "app": {
    "type": "appimage",            // How to deploy the app (see App Types below)
    "path": "./elcalc-2.0.3-x86_64.AppImage"
  },
  "evaluator": {
    "mode": "llm",                 // Validation mode: "llm", "programmatic", or "hybrid"
    "llm_judge_prompt": "Does the calculator display show the number 100 as the result? Answer pass or fail."
  },
  "timeout": 120                   // Max seconds before the test is aborted
}

The optional completion_condition field lets you define the success criteria separately from the task instruction. When present, it's appended to the instruction sent to the agent, and rendered as a collapsible section in the review and live dashboards.

See examples/ for more examples including folder deploys and custom Docker images.

App Types

Type Description
appimage Deploy a single AppImage file
folder Deploy a directory with an entrypoint script
docker_image Use a pre-built custom Docker image
vnc_attach Attach to an existing running desktop (see Attach Mode)
macos_tart macOS app in a Tart VM β€” isolated, destroyed after test (see macOS Support)
macos_native macOS app on host desktop, no VM isolation (see macOS Support)
windows (Planned) Windows app in a VM β€” details TBD

Electron apps: Add "electron": true to your app config to use the desktest-desktop:electron image with Node.js pre-installed. See examples/ELECTRON_QUICKSTART.md.

Evaluation Metrics

Metric Description
file_compare Compare a container file against an expected file (exact or normalized)
file_compare_semantic Parse and compare structured files (JSON, YAML, XML, CSV)
command_output Run a command, check stdout (contains, equals, regex)
file_exists Check if a file exists (or doesn't) in the container
exit_code Run a command, check its exit code
script_replay Run a Python replay script, check for REPLAY_COMPLETE + exit 0

Live Monitoring

TLDR: Do desktest run task_name.json --monitor to launch real-time agent monitoring dashboard, desktest review for post-run dashboard.

Expand

Add --monitor to any run or suite command to launch a real-time web dashboard that streams the agent's actions as they happen:

# Watch a single test live
desktest run task.json --monitor

# Watch a test suite with progress tracking
desktest suite tests/ --monitor

# Use a custom port
desktest run task.json --monitor --monitor-port 8080

Open http://localhost:7860 in your browser to see:

  • Live step feed: screenshots, agent thoughts, and action code appear as each step completes
  • Test info header: test ID, instruction, VNC link, and max steps
  • Suite progress: progress bar showing completed/total tests during suite runs
  • Status indicator: pulsing dot shows connection state (live vs disconnected)

The dashboard uses the same UI as desktest review β€” a sidebar with step navigation, main panel with screenshot/thought/action details. The difference is that steps stream in via Server-Sent Events (SSE) instead of being loaded from a static file.

QA Mode

TLDR: Let the agent report bugs in your application on slack, with some guidance

Expand

Add --qa to any run, suite, or attach command to enable bug reporting. The agent will complete its task as normal, but also watch for application bugs and report them as markdown files:

# Run a test with QA bug reporting
desktest run task.json --qa

# QA mode in a test suite
desktest suite tests/ --qa

When --qa is enabled:

  • The agent gains a BUG command to report application bugs it discovers
  • Bash access is automatically enabled for diagnostic investigation (log files, process state, etc.)
  • Bug reports are written to desktest_artifacts/bugs/BUG-001.md, BUG-002.md, etc.
  • Each report includes: summary, description, screenshot reference, accessibility tree state
  • The agent continues its task after reporting β€” multiple bugs can be found per run
  • Bug count is included in results.json and the test output

Slack Notifications

Expand

Optionally send bug reports to Slack as they're discovered. Add an integrations section to your config JSON:

{
  "integrations": {
    "slack": {
      "webhook_url": "https://hooks.slack.com/services/T.../B.../xxx",
      "channel": "#qa-bugs"
    }
  }
}

Or set the DESKTEST_SLACK_WEBHOOK_URL environment variable (takes precedence over config). The channel field is optional β€” webhooks already target a default channel. Notifications are fire-and-forget and never block the test.

Architecture

Expand
Developer writes task.json
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ desktest CLI  β”‚  validate / run / suite / interactive
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β”œβ”€β”€β”€ Linux ────────────────────┐     β”œβ”€β”€β”€ macOS ────────────────────┐
        β”‚  Docker Container            β”‚     β”‚  Tart VM (or native host)    β”‚
        β”‚  Xvfb + XFCE + x11vnc        β”‚     β”‚  Native macOS desktop        β”‚
        β”‚  PyAutoGUI (X11)             β”‚     β”‚  PyAutoGUI (Quartz)          β”‚
        β”‚  pyatspi (AT-SPI2)           β”‚     β”‚  a11y-helper (AXUIElement)   β”‚
        β”‚  scrot (screenshot)          β”‚     β”‚  screencapture (screenshot)  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚ screenshot + a11y tree             β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  LLM Agent Loop  β”‚  observe β†’ think β†’ act β†’ repeat
                     β”‚  (PyAutoGUI code)β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  Evaluator       β”‚  programmatic checks / LLM judge / hybrid
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                     results.json + recording.mp4 + trajectory.jsonl

File Artifacts

Files generated as a result of a desktest run.

Expand

Each test run produces:

test-results/
  results.json                # Structured test results (always)

desktest_artifacts/
  recording.mp4               # Video of the test session (with --record)
  trajectory.jsonl            # Step-by-step agent log (always)
  agent_conversation.json     # Full LLM conversation (always)
  step_001.png                # Screenshot per step (always)
  step_001_a11y.txt           # Accessibility tree per step (always)
  bugs/                       # Bug reports (with --qa)
    BUG-001.md                # Individual bug report (with --qa)

Exit Codes

Expand
Code Meaning
0 Test passed
1 Test failed
2 Configuration error
3 Infrastructure error
4 Agent error

Environment Variables

TLDR: LLM API keys + Webhooks for QA mode

Expand
Variable Description
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
OPENROUTER_API_KEY OpenRouter API key
CEREBRAS_API_KEY Cerebras API key
GEMINI_API_KEY Gemini API key
CODEX_API_KEY Codex CLI API key (alternative to ChatGPT login)
LLM_API_KEY Fallback API key for any provider
DESKTEST_SLACK_WEBHOOK_URL Slack Incoming Webhook URL for QA bug notifications (overrides config)
GITHUB_TOKEN GitHub token (used by desktest update)

About

πŸ–₯️ desktest CLI: "Playwright for a full computer tests": Prompt what to test β†’ agent tests your app E2E in a Docker container β†’ review trajectory, if happy codify trajectory into deterministic scripts for CI

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors