Skip to content

fix(agent): harden _extract_start_url URL gating — skip local paths (incl. quoted) and match file extensions on the path, not as a substring (#4794)#4983

Open
r266-tech wants to merge 3 commits into
browser-use:mainfrom
r266-tech:fix-extract-start-url-local-paths-4794
Open

fix(agent): harden _extract_start_url URL gating — skip local paths (incl. quoted) and match file extensions on the path, not as a substring (#4794)#4983
r266-tech wants to merge 3 commits into
browser-use:mainfrom
r266-tech:fix-extract-start-url-local-paths-4794

Conversation

@r266-tech

@r266-tech r266-tech commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Fixes #4794.

_extract_start_url (used when directly_open_url=True, the default) scans the task text and force-injects an initial navigate action before the LLM acts. This PR hardens that URL-gating against two false-decision classes, in both the agent and rust copies, with regression tests covering both implementations.

1. Local filesystem paths were force-navigated (#4794).
A local file path in the task — e.g. an uploaded /app/x_capabilities.html — matched the domain regex (the domain char class [a-zA-Z0-9-] excludes _, so x_capabilities.html is captured as capabilities.html) and was navigated to as the agent's first action, then blocked by the SecurityWatchdog, derailing the run. Such candidates are now skipped. Quoted/parenthesised local paths (e.g. "/app/x.html") are skipped too — leading quote/paren delimiters are stripped before the local-path check.

2. Everyday sites were wrongly excluded as "files".
The excluded-extension filter tested f'.{ext}' as a substring of the whole URL, so any host/path merely containing a short extension token was dropped from auto-navigation: docs.python.org (.py), www.python.org (.py), my.docs.google.com (.doc), and any .css/.js/.md host. Exclusion is now decided from the final path segment (scheme/query/fragment stripped, trailing slash removed, percent-decoded, ;path-params split), so genuine downloadable-file URLs still drop (report.pdf, report.pdf/, data.json;v=1, report%2Epdf, archive.tar.gz) while real pages are kept (index.html, example.com/report.pdf/view — a page, not a file).

Deliberate scope note: exclusion keys on the path, not the query string. A URL like example.com/download?file=report.pdf (a download endpoint) or example.com/view?doc=report.pdf (a viewer page) is kept — the old substring check dropped both, which over-excluded navigable viewer/search pages. Navigation policy is still enforced downstream by the SecurityWatchdog.

…_start_url (browser-use#4794)

A local file path in the task (e.g. an uploaded /app/x_capabilities.html)
matched the domain regex and was force-navigated as the agent's first
action. Skip URL candidates whose whitespace-delimited token is a local
filesystem path (/, ~/, ./, ../, or a drive). Fixes both the agent and
rust _extract_start_url copies; adds a regression test. Avoids the
extension-only approach, which would regress legit .html/.htm URLs.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="browser_use/agent/service.py">

<violation number="1" location="browser_use/agent/service.py:2391">
P2: Quoted/parenthesized local file paths bypass the local-path filter because surrounding delimiters are included in `local_path_token`, causing the anchored regex to fail.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread browser_use/agent/service.py
…ract_start_url

The excluded-extension filter tested f'.{ext}' as a SUBSTRING of the whole
URL, so everyday sites whose host/path merely contains a short extension
token were wrongly dropped from auto-navigation (e.g. '.py' in
docs.python.org, '.doc' in my.docs.google.com, '.js'/'.css' hosts). Decide
exclusion from the final path segment instead (scheme/query/fragment
stripped, trailing slash removed, percent-decoded, path-params split), so
genuine downloadable-file URLs (report.pdf, data.json;v=1, archive.tar.gz)
still drop. Also strip leading quote/paren delimiters before the local-path
guard so quoted local paths are skipped too. Mirrored in agent + rust copies;
regression tests exercise both implementations.
@r266-tech r266-tech changed the title fix(agent): don't auto-navigate to local filesystem paths in _extract_start_url (#4794) fix(agent): harden _extract_start_url URL gating — skip local paths (incl. quoted) and match file extensions on the path, not as a substring (#4794) Jun 7, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread browser_use/rust/service.py Outdated
…_extract_start_url

The local-path guard lstrip()'d quote/paren delimiters ('"(<[{) but not the
Markdown backtick, so a backtick-wrapped local path (e.g. `/app/x.html`) kept
its leading backtick, the anchored filesystem-path regex failed, and since
'html' is intentionally not an excluded extension the path was auto-navigated
as a bare URL. Add ` to the stripped delimiter set in both agent + rust copies;
regression tests cover backtick-wrapped local paths in both implementations.
@r266-tech

Copy link
Copy Markdown
Contributor Author

Good catch — addressed in 4267497. The local-path guard's delimiter strip set covered '"(<[{ but not the Markdown backtick, so a backtick-wrapped local path (/app/x_capabilities.html) kept its leading backtick, the anchored filesystem-path regex didn't match, and since html is intentionally not in excluded_extensions it was treated as a URL and auto-navigated. Added the backtick to the stripped delimiter set in both the agent and rust copies, with regression tests covering backtick-wrapped local paths in both implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: uploading an html file causes the agent to try to navigate to a web page

1 participant