Skip to content

Conversation

@benmuth
Copy link
Contributor

@benmuth benmuth commented Feb 27, 2024

Summary

This PR lets the title extractor use the COOKIES_FILE, if available. This helps avoid extracting the titles of captcha or login pages.

Related issues

#761

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@pirate pirate added touches: configuration why: functionality Intended to improve ArchiveBox functionality or features status: wip Work is in-progress / has already been partially completed type: enhancement expected: next release labels Feb 29, 2024
@pirate
Copy link
Member

pirate commented Feb 29, 2024

Down to merge this if you've tested it! Just change it to Ready for Review when you're ready :)

@benmuth
Copy link
Contributor Author

benmuth commented Mar 2, 2024

I've been testing it and I'm actually having trouble verifying that this works. I thought it was working when I opened this PR, but it seems I was wrong. For my test case, the title isn't correct if the title extractor is the only enabled extractor. It is correct when wget is enabled, so the cookie file is being read correctly by ArchiveBox, but my fix here doesn't seem to be working. Not sure where it's failing yet.

@benmuth
Copy link
Contributor Author

benmuth commented Mar 5, 2024

After throwing some stuff at the wall, I got to a solution that works for me. For some reason, the cookie file only seems to work with requests when I set each cookie one by one, like

        for cookie in cookie_jar:
            session.cookies.set(cookie.name, cookie.value, domain=cookie.domain, path=cookie.path)

It didn't work for me when I tried session.cookies = cookie_jar or session.cookies.update(cookie_jar), and from the docs I can't tell why. If anyone could tell me what I'm missing, I'd appreciate it.

@benmuth benmuth marked this pull request as ready for review March 5, 2024 08:10
@pirate pirate merged commit 48f4b12 into ArchiveBox:dev Mar 14, 2024
@benmuth benmuth deleted the title-cookies-file branch March 15, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

expected: next release status: wip Work is in-progress / has already been partially completed touches: configuration why: functionality Intended to improve ArchiveBox functionality or features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants