-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
I have a setup where (through a cronjob) Archivebox fetches a RSS feed of my archived (aka read) articles from Wallabag.it and imports them. This way I have a redundant archive of everything I read in Wallabag. Overkill? maybe.
Somewhere around 2022-03-31 the parsing of this RSS feed started to fail with the following error:
[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
> Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
[ ... redacted ...]
File "/app/archivebox/index/schema.py", line 165, in typecheck
assert isinstance(self.url, str) and '://' in self.url
AssertionError
Not long before 2022-03-31 Wallabag has released a new version: https://github.com/wallabag/wallabag/releases/tag/2.4.3 which includes a PR that modifies the formatting of the RSS feed it provides: wallabag/wallabag#5347. I suspect this to be the culprit.
I am not exactly sure where the responsibility of fixing this lies but I want to at least document that I ran into this in case someone else experiences a similar issue.
Steps to reproduce
I have created a test account with one archived article on Wallabag.it. This account will expire in 14 days, but you can easily create a new one for testing purposes.
curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom
Screenshots or log output
Full log
[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
> Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
File "/usr/local/bin/archivebox", line 33, in <module>
sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
File "/app/archivebox/cli/__init__.py", line 140, in main
run_subcommand(
File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/app/archivebox/cli/archivebox_add.py", line 103, in main
add(
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/main.py", line 588, in add
new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/index/__init__.py", line 275, in parse_links_from_source
raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/parsers/__init__.py", line 101, in parse_links
links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
File "/app/archivebox/parsers/__init__.py", line 115, in run_parser_functions
parsed_links = list(parser_func(to_parse, root_url=root_url))
File "/app/archivebox/parsers/wallabag_atom.py", line 51, in parse_wallabag_atom_export
yield Link(
File "<string>", line 11, in __init__
File "/app/archivebox/index/schema.py", line 141, in __post_init__
self.typecheck()
File "/app/archivebox/index/schema.py", line 165, in typecheck
assert isinstance(self.url, str) and '://' in self.url
AssertionError
ArchiveBox version
ArchiveBox v0.6.2
Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9
√ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.64.0 valid /usr/bin/curl
√ WGET_BINARY v1.20.1 valid /usr/bin/wget
√ NODE_BINARY v15.14.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.20.1 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl
√ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium
√ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 23 files valid /app/archivebox
√ TEMPLATES_DIR 3 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 8 files valid /data
√ SOURCES_DIR 948 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 168 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 1.9 MB valid ./index.sqlite3