-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
help wantedsize: easywhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features
Description
Describe the bug
I'm setting up Pocket importing for the first time, meaning I'm importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the since value in pocket_api.db is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived.
Steps to reproduce
- Set up Pocket config, per Add parser for Pocket API #528
- Have a URL in pocket on a domain which refuses connections, or does not exist
- Import from Pocket:
$ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619618411-import.txt > Parsed 169 URLs from input (Pocket API) [*] Starting crawl of 169 sites 1 hop out from starting point > Downloading http://my-working-url.com/ contents > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt > Parsed 12 URLs from input (Generic TXT) > Downloading http://my-defunct-url.com/ contents [!] Failed to download http://my-defunct-url.com/ HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known')) - Remove broken URL from Pocket
- Try importing again:
$ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619620775-import.txt 0.1% (0/240sec) [X] No links found using Pocket API parser Hint: Try a different parser or double check the input? > Parsed 0 URLs from input (Pocket API) > Found 0 new URLs not already in index [*] [2021-04-28 14:39:35] Writing 0 links to main index... √ ./index.sqlite3
ArchiveBox version
ArchiveBox v0.6.2
Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9
√ DJANGO_BINARY v3.1.8 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.64.0 valid /usr/bin/curl
√ WGET_BINARY v1.20.1 valid /usr/bin/wget
√ NODE_BINARY v15.14.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.20.1 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2021.04.07 valid /usr/local/bin/youtube-dl
√ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium
√ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 22 files valid /app/archivebox
√ TEMPLATES_DIR 3 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 28 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 0 files valid ./archive
√ CONFIG_FILE 204.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 204.0 KB valid ./index.sqlite3
bltavares
Metadata
Metadata
Assignees
Labels
help wantedsize: easywhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features