improve title extractor #924

prnake · 2022-02-08T15:23:34Z

Summary

This PR trying to get title from offline html, such as singlefile first, the same idea as readability extractor, in order to handle some anti-scraping pages like https://xz.aliyun.com/t/10870 that can not get title from wget or curl.

Changes these areas

lgtm-com · 2022-02-08T16:52:59Z

This pull request introduces 1 alert when merging de8e22e into bf432d4 - view on LGTM.com

new alerts:

1 for Unused import

pirate · 2022-03-16T20:18:30Z

This is a good idea but the reason but I'm mildly concerned that putting title so late in the process means any failures during archiving will leave many URLs without titles. Have you tested this with a big import of several hundred URLs?

Thanks for this work! Excited to merge it.

improve title extractor

de8e22e

remove unused import

011bd10

pirate merged commit 950b5cb into ArchiveBox:dev May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

improve title extractor #924

improve title extractor #924

Uh oh!

prnake commented Feb 8, 2022

Uh oh!

lgtm-com bot commented Feb 8, 2022

Uh oh!

pirate commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

improve title extractor #924

improve title extractor #924

Uh oh!

Conversation

prnake commented Feb 8, 2022

Summary

Changes these areas

Uh oh!

lgtm-com bot commented Feb 8, 2022

Uh oh!

pirate commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants