Skip to content

Experimental support to capture screen contents and return b64 encoded image#21

Merged
jph00 merged 7 commits intomainfrom
austinvhuang/multimodal-tool-support
Jul 18, 2025
Merged

Experimental support to capture screen contents and return b64 encoded image#21
jph00 merged 7 commits intomainfrom
austinvhuang/multimodal-tool-support

Conversation

@austinvhuang
Copy link
Contributor

@austinvhuang austinvhuang commented Jul 14, 2025

Provides experimental capture_screen(), a function to look at the contents of the screen and return it as base64.

Has dependencies on solveit:

  • (minimally) to provide communication (currently using push/pop data) between js and python repl
  • for tool use integration, requires the ability for tool calls to return values that are structured rather than string only

TODO:

  • Use streaming video instead of screenshot capture so that permissions is only requested once.
  • Add more flexibility for resolution (currently fixed at 512x512). This is required for screen reading to be useful but requires making the async frontend logic more robust and possibly optimizing the image representation (eg reducing the color depth).

@austinvhuang austinvhuang changed the title WIP do not merge Experimental support to capture screen contents and return b64 encoded image (WIP do not merge) Jul 14, 2025
@austinvhuang austinvhuang marked this pull request as ready for review July 15, 2025 15:45
@austinvhuang austinvhuang changed the title Experimental support to capture screen contents and return b64 encoded image (WIP do not merge) Experimental support to capture screen contents and return b64 encoded image Jul 15, 2025
@austinvhuang austinvhuang requested a review from jph00 July 16, 2025 13:04
@austinvhuang austinvhuang marked this pull request as draft July 17, 2025 16:53
@austinvhuang austinvhuang force-pushed the austinvhuang/multimodal-tool-support branch from c315eb5 to 77a6534 Compare July 17, 2025 18:06
@austinvhuang
Copy link
Contributor Author

(rebased to current main branch state)

…act, poll until navigator.mediaDevices is available, ToolImageResult -> claudette.ToolResult
@jph00
Copy link
Contributor

jph00 commented Jul 18, 2025

This requires the new version of claudette too. I'll add it to settings.ini now.

@jph00 jph00 merged commit e954e76 into main Jul 18, 2025
@jph00
Copy link
Contributor

jph00 commented Jul 18, 2025

Hmmm actually I'd rather not add a large new dep here. I've merged this for now so we can start using it, but @austinvhuang could we instead return a dict with a special mimetype key or something, instead of requiring claudette?

@jph00 jph00 added the enhancement New feature or request label Jul 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants