-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Open
New Extractor Idea:
scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs#720Enhancement
Copy link
Labels
expected: release after nextsize: mediumstatus: wipWork is in-progress / has already been partially completedWork is in-progress / has already been partially completedtouches: API/CLI/Spectouches: configurationtouches: dependencies/packagingIssues or changes that add/remove/affect dependenciesIssues or changes that add/remove/affect dependenciestouches: docswhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features
Milestone
Description
New extractor idea: SCIHUB
e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1
If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a ./archive/<timestmap>/scihub/ output folder.
# try downloading via verbatim URL first
$ scihub.py -d https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1'
DEBUG:Sci-Hub:Successfully downloaded file with identifier https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1We could also look for a DOI number in the page URL or page html contents e.g.: 10.1016/j.cub.2019.11.030 using a regex and try downloading that.
# otherwise try downloading via any regex-extracted bare DOI numbers on the page or in the URL
$ scihub.py -d '10.1016/j.cub.2019.11.030'
DEBUG:Sci-Hub:Successfully downloaded file with identifier 10.1016/j.cub.2019.11.030
$ ls
c28dc1242df6f931c29b9cd445a55597-.cub.2019.11.030.pdfNew Dependencies:
New Extractors:
extractors/scihub.py
New Config Options:
SAVE_SCIHUB=True
danisztls, siboehm and ivanmarribasFinkregh and virtadpt
Metadata
Metadata
Assignees
Labels
expected: release after nextsize: mediumstatus: wipWork is in-progress / has already been partially completedWork is in-progress / has already been partially completedtouches: API/CLI/Spectouches: configurationtouches: dependencies/packagingIssues or changes that add/remove/affect dependenciesIssues or changes that add/remove/affect dependenciestouches: docswhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features