Skip to content

Headless/Headful mode#356

Merged
NGTmeaty merged 77 commits into
mainfrom
headless-new
Aug 26, 2025
Merged

Headless/Headful mode#356
NGTmeaty merged 77 commits into
mainfrom
headless-new

Conversation

@yzqzss

@yzqzss yzqzss commented Jun 24, 2025

Copy link
Copy Markdown
Collaborator

close: #347

@CorentinB

Copy link
Copy Markdown
Collaborator

Incredible work!

@yzqzss yzqzss marked this pull request as ready for review June 27, 2025 03:59
@yzqzss

yzqzss commented Jun 27, 2025

Copy link
Copy Markdown
Collaborator Author

I'll add PR details later.
Code is ready for review.

@NGTmeaty NGTmeaty left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! I have a couple of small comments but primarily I need to do some validation of the final WARC to ensure all relevant headers are being recorded correctly!

Comment thread go.mod
Comment thread internal/pkg/archiver/headless/archiver.go
Comment thread internal/pkg/archiver/headless/archiver.go
Comment thread internal/pkg/archiver/headless/archiver.go Outdated
Comment thread internal/pkg/archiver/headless/archiver.go Outdated
Comment thread cmd/get.go Outdated
Comment thread cmd/get.go Outdated
@NGTmeaty

Copy link
Copy Markdown
Collaborator

More of a question than anything else - but --max-hops does not seem to be respected in headless mode (which is fine), I assume autoclick emulates that behavior? However, most of the time when running with any headless-behaviors selected, I get the following error:
Run with ./Zeno get url --headless --headfull --headless-dev-tools --max-hops 2 --headless-behaviors autoclick "https://jakel.rocks/"

ERROR archiver.go:221     | unable to run behaviors script component=archiver.archiveHeadless.page item_id=a10d0 seed_id=a10d0 url=https://jakel.rocks/ error={-32000 Inspected target navigated or closed }

Is there anyway to set "max-hops" or allow it to discover further URLs yet?

Thanks again for this PR! It's looking great so far!

@yzqzss yzqzss force-pushed the headless-new branch 3 times, most recently from 905bba1 to 87167d4 Compare July 1, 2025 09:45
@yzqzss yzqzss changed the title Headless/Headfull mode Headless/Headful mode Jul 1, 2025
@CorentinB

Copy link
Copy Markdown
Collaborator

Is there anyway to set "max-hops" or allow it to discover further URLs yet?

Just my 2cts regarding @NGTmeaty's msg, I think using headless should absolutely give the same hop behavior than not using headless. Basically the only difference between using headless or not using it would be GET vs full browser and assets being captured while loading the page as opposed to being extracted and GET-ed, the rest should be on-par with "regular usage" of Zeno, as much as possible. If not, it should be clearly documented.

@yzqzss

yzqzss commented Jul 14, 2025

Copy link
Copy Markdown
Collaborator Author

outlinks crawling (--max-hops > 0) now works, but not reliably. :(

2025-07-15.05-39-09.mp4

@CorentinB

Copy link
Copy Markdown
Collaborator

outlinks crawling (--max-hops > 0) now works, but not reliably. :(

2025-07-15.05-39-09.mp4

Look mom, I'm on TV!

Can you please describe exactly what's unreliable?

NGTmeaty
NGTmeaty previously approved these changes Aug 18, 2025

@NGTmeaty NGTmeaty left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I have a few comments below, but it is very close to the finish line i think!

@willmhowes should be completing his review soon as well.

Comment thread internal/pkg/postprocessor/extractor/link_header_test.go Outdated
Comment thread internal/pkg/postprocessor/outlinks.go
Comment thread internal/pkg/archiver/headless/archiver.go Outdated
Comment thread internal/pkg/archiver/headless/archiver.go Outdated
@yzqzss

yzqzss commented Aug 19, 2025

Copy link
Copy Markdown
Collaborator Author

So, let me pin the default chromium revision first.

I think in production environment, we can consider to use the release version of chromium binary from OS distribution. (--headless-chromium-bin)

@willmhowes willmhowes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really solid work @yzqzss! I'm still wondering if there is room to improve the structure of the internal/pkg/archiver directory so that it's more clear what code drives the "standard mode" crawls vs "headless mode" crawls, but overall very impressed

Comment thread internal/pkg/postprocessor/item.go
Comment thread internal/pkg/archiver/body/body.go
Comment thread internal/pkg/archiver/body/body.go
@yzqzss yzqzss requested review from NGTmeaty and willmhowes August 22, 2025 19:52

@willmhowes willmhowes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the changes! Once @NGTmeaty finishes final review, I'm happy to see this merged

@NGTmeaty NGTmeaty left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks again for your hard work here!

@NGTmeaty NGTmeaty merged commit 599c566 into main Aug 26, 2025
5 checks passed
CorentinB pushed a commit that referenced this pull request Aug 27, 2025
* feat: headless!

* fix: use cobra StringSlice

* typo: headfull -> headful

* bump behaviors.js to v0.9.0

* add `--headless-chromium-revision` option to custom chromium version

* update: merge main

* headless: discard hooks

* headless: retry on clientDo() errors (config.Get().MaxRetry)

* disable assets extraction for headless

* expose `GlobalPreprocessor.Seenchecker`

* get `Document` from the page and store it in item

* seencheck headless sub requests

* make postprocessor headless compatible

* refactor: move our non-headless archiver code into `general` package

* update

* feat: set `--headless-chromium-revision` to `-1` to use latest stable Chromium version

* failfast if the request is canceled

* headless: add bucketmanager

* feat: prevent most outlinks extractors from running in headless mode

* feat: indicate headless mode in warcinfo

* feat: add `software` section to warcinfo

* chore: reorganize command flags for better structure

* chore: small explanation for what `headful` actually does

* feat: drop request not in http/https scheme

* add stats and WARCWriteAsync

* typo

* Update cmd/get.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pkg/postprocessor/extractor/css.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* chore: complete the comment for `LogFunc`

* panic for unsupported DevTools-Protocol version

* chore: clean up temporary debug code

* chore: rename

* remove unnecessary `CloseIdleConnections()`

* go mod tidy

* fix merge conflict

* use the latest Stable Chromium by default

* chore: add SeencheckerFunc type

* set mimetype

* feat: drop discardable seen GET resources

* feat: `headless-page-timeout` hard timeout for page

* chore: log mimetype

* refactor: `ExtractURLsFromHeader()` using interfaces

* placeholder URL for headless seencheck

* add e2e test

* CI: disable AppArmor unprivileged user namespaces limitations

* rename `headless-post-load-delay` to `headless-page-post-load-delay`

* chore

* doc: add a nice Gantt diagram to explain various timeout settings for headless

* dependency: upgrade browsertrix-behaviors to v0.9.2

* revert: remove bx_logger log suppressing

* use the second newest revision by default to avoid Chromium build not existing

* FIX: URLs appear randomly !!!

* update test

* fix merge conflict

* chore: remove outdated comment

* chore: remove my debug code

* chore: `ExtractLink()`: do not return `nil` placeholder error

* fix: pin Chromium revision

* fix merge conflect

* doc: add more details to headless README

* doc: AI polishing :)

* chore: nil check for item's response in case

* chore: update comments

* doc: update

* doc: add help for outlinks extractor

* rename body package to connutil

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@CorentinB CorentinB deleted the headless-new branch August 27, 2025 18:04
@CorentinB

Copy link
Copy Markdown
Collaborator

You dropped that, King: 👑

@yzqzss yzqzss mentioned this pull request Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request GSoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GSoC] Headless

6 participants