Skip to content

Conversation

@prnake
Copy link
Contributor

@prnake prnake commented Feb 8, 2022

Summary

This PR trying to get title from offline html, such as singlefile first, the same idea as readability extractor, in order to handle some anti-scraping pages like https://xz.aliyun.com/t/10870 that can not get title from wget or curl.

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2022

This pull request introduces 1 alert when merging de8e22e into bf432d4 - view on LGTM.com

new alerts:

  • 1 for Unused import

@pirate
Copy link
Member

pirate commented Mar 16, 2022

This is a good idea but the reason but I'm mildly concerned that putting title so late in the process means any failures during archiving will leave many URLs without titles. Have you tested this with a big import of several hundred URLs?

Thanks for this work! Excited to merge it.

@pirate pirate merged commit 950b5cb into ArchiveBox:dev May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants