fix(agent): harden _extract_start_url URL gating — skip local paths (incl. quoted) and match file extensions on the path, not as a substring (#4794)#4983
Conversation
…_start_url (browser-use#4794) A local file path in the task (e.g. an uploaded /app/x_capabilities.html) matched the domain regex and was force-navigated as the agent's first action. Skip URL candidates whose whitespace-delimited token is a local filesystem path (/, ~/, ./, ../, or a drive). Fixes both the agent and rust _extract_start_url copies; adds a regression test. Avoids the extension-only approach, which would regress legit .html/.htm URLs.
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="browser_use/agent/service.py">
<violation number="1" location="browser_use/agent/service.py:2391">
P2: Quoted/parenthesized local file paths bypass the local-path filter because surrounding delimiters are included in `local_path_token`, causing the anchored regex to fail.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
…ract_start_url
The excluded-extension filter tested f'.{ext}' as a SUBSTRING of the whole
URL, so everyday sites whose host/path merely contains a short extension
token were wrongly dropped from auto-navigation (e.g. '.py' in
docs.python.org, '.doc' in my.docs.google.com, '.js'/'.css' hosts). Decide
exclusion from the final path segment instead (scheme/query/fragment
stripped, trailing slash removed, percent-decoded, path-params split), so
genuine downloadable-file URLs (report.pdf, data.json;v=1, archive.tar.gz)
still drop. Also strip leading quote/paren delimiters before the local-path
guard so quoted local paths are skipped too. Mirrored in agent + rust copies;
regression tests exercise both implementations.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
…_extract_start_url
The local-path guard lstrip()'d quote/paren delimiters ('"(<[{) but not the
Markdown backtick, so a backtick-wrapped local path (e.g. `/app/x.html`) kept
its leading backtick, the anchored filesystem-path regex failed, and since
'html' is intentionally not an excluded extension the path was auto-navigated
as a bare URL. Add ` to the stripped delimiter set in both agent + rust copies;
regression tests cover backtick-wrapped local paths in both implementations.
|
Good catch — addressed in 4267497. The local-path guard's delimiter strip set covered |
Fixes #4794.
_extract_start_url(used whendirectly_open_url=True, the default) scans the task text and force-injects an initialnavigateaction before the LLM acts. This PR hardens that URL-gating against two false-decision classes, in both the agent and rust copies, with regression tests covering both implementations.1. Local filesystem paths were force-navigated (#4794).
A local file path in the task — e.g. an uploaded
/app/x_capabilities.html— matched the domain regex (the domain char class[a-zA-Z0-9-]excludes_, sox_capabilities.htmlis captured ascapabilities.html) and was navigated to as the agent's first action, then blocked by the SecurityWatchdog, derailing the run. Such candidates are now skipped. Quoted/parenthesised local paths (e.g."/app/x.html") are skipped too — leading quote/paren delimiters are stripped before the local-path check.2. Everyday sites were wrongly excluded as "files".
The excluded-extension filter tested
f'.{ext}'as a substring of the whole URL, so any host/path merely containing a short extension token was dropped from auto-navigation:docs.python.org(.py),www.python.org(.py),my.docs.google.com(.doc), and any.css/.js/.mdhost. Exclusion is now decided from the final path segment (scheme/query/fragment stripped, trailing slash removed, percent-decoded,;path-params split), so genuine downloadable-file URLs still drop (report.pdf,report.pdf/,data.json;v=1,report%2Epdf,archive.tar.gz) while real pages are kept (index.html,example.com/report.pdf/view— a page, not a file).Deliberate scope note: exclusion keys on the path, not the query string. A URL like
example.com/download?file=report.pdf(a download endpoint) orexample.com/view?doc=report.pdf(a viewer page) is kept — the old substring check dropped both, which over-excluded navigable viewer/search pages. Navigation policy is still enforced downstream by the SecurityWatchdog.