Headless/Headful mode#356
Conversation
|
Incredible work! |
|
I'll add PR details later. |
NGTmeaty
left a comment
There was a problem hiding this comment.
Looks great so far! I have a couple of small comments but primarily I need to do some validation of the final WARC to ensure all relevant headers are being recorded correctly!
|
More of a question than anything else - but Is there anyway to set "max-hops" or allow it to discover further URLs yet? Thanks again for this PR! It's looking great so far! |
905bba1 to
87167d4
Compare
Just my 2cts regarding @NGTmeaty's msg, I think using headless should absolutely give the same hop behavior than not using headless. Basically the only difference between using headless or not using it would be GET vs full browser and assets being captured while loading the page as opposed to being extracted and GET-ed, the rest should be on-par with "regular usage" of Zeno, as much as possible. If not, it should be clearly documented. |
|
outlinks crawling (--max-hops > 0) now works, but not reliably. :( 2025-07-15.05-39-09.mp4 |
Look mom, I'm on TV! Can you please describe exactly what's unreliable? |
NGTmeaty
left a comment
There was a problem hiding this comment.
Looking good! I have a few comments below, but it is very close to the finish line i think!
@willmhowes should be completing his review soon as well.
|
So, let me pin the default chromium revision first. I think in production environment, we can consider to use the release version of chromium binary from OS distribution. ( |
willmhowes
left a comment
There was a problem hiding this comment.
This is really solid work @yzqzss! I'm still wondering if there is room to improve the structure of the internal/pkg/archiver directory so that it's more clear what code drives the "standard mode" crawls vs "headless mode" crawls, but overall very impressed
willmhowes
left a comment
There was a problem hiding this comment.
Love the changes! Once @NGTmeaty finishes final review, I'm happy to see this merged
NGTmeaty
left a comment
There was a problem hiding this comment.
Looks great! Thanks again for your hard work here!
* feat: headless! * fix: use cobra StringSlice * typo: headfull -> headful * bump behaviors.js to v0.9.0 * add `--headless-chromium-revision` option to custom chromium version * update: merge main * headless: discard hooks * headless: retry on clientDo() errors (config.Get().MaxRetry) * disable assets extraction for headless * expose `GlobalPreprocessor.Seenchecker` * get `Document` from the page and store it in item * seencheck headless sub requests * make postprocessor headless compatible * refactor: move our non-headless archiver code into `general` package * update * feat: set `--headless-chromium-revision` to `-1` to use latest stable Chromium version * failfast if the request is canceled * headless: add bucketmanager * feat: prevent most outlinks extractors from running in headless mode * feat: indicate headless mode in warcinfo * feat: add `software` section to warcinfo * chore: reorganize command flags for better structure * chore: small explanation for what `headful` actually does * feat: drop request not in http/https scheme * add stats and WARCWriteAsync * typo * Update cmd/get.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pkg/postprocessor/extractor/css.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * chore: complete the comment for `LogFunc` * panic for unsupported DevTools-Protocol version * chore: clean up temporary debug code * chore: rename * remove unnecessary `CloseIdleConnections()` * go mod tidy * fix merge conflict * use the latest Stable Chromium by default * chore: add SeencheckerFunc type * set mimetype * feat: drop discardable seen GET resources * feat: `headless-page-timeout` hard timeout for page * chore: log mimetype * refactor: `ExtractURLsFromHeader()` using interfaces * placeholder URL for headless seencheck * add e2e test * CI: disable AppArmor unprivileged user namespaces limitations * rename `headless-post-load-delay` to `headless-page-post-load-delay` * chore * doc: add a nice Gantt diagram to explain various timeout settings for headless * dependency: upgrade browsertrix-behaviors to v0.9.2 * revert: remove bx_logger log suppressing * use the second newest revision by default to avoid Chromium build not existing * FIX: URLs appear randomly !!! * update test * fix merge conflict * chore: remove outdated comment * chore: remove my debug code * chore: `ExtractLink()`: do not return `nil` placeholder error * fix: pin Chromium revision * fix merge conflect * doc: add more details to headless README * doc: AI polishing :) * chore: nil check for item's response in case * chore: update comments * doc: update * doc: add help for outlinks extractor * rename body package to connutil --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
You dropped that, King: 👑 |
close: #347