Skip to content

Bug: Parsing Wallabag RSS feed fails #971

@peterrus

Description

@peterrus

Describe the bug

I have a setup where (through a cronjob) Archivebox fetches a RSS feed of my archived (aka read) articles from Wallabag.it and imports them. This way I have a redundant archive of everything I read in Wallabag. Overkill? maybe.

Somewhere around 2022-03-31 the parsing of this RSS feed started to fail with the following error:

[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
[ ... redacted ...]
  File "/app/archivebox/index/schema.py", line 165, in typecheck
    assert isinstance(self.url, str) and '://' in self.url
AssertionError

Not long before 2022-03-31 Wallabag has released a new version: https://github.com/wallabag/wallabag/releases/tag/2.4.3 which includes a PR that modifies the formatting of the RSS feed it provides: wallabag/wallabag#5347. I suspect this to be the culprit.

I am not exactly sure where the responsibility of fixing this lies but I want to at least document that I ran into this in case someone else experiences a similar issue.

Steps to reproduce

I have created a test account with one archived article on Wallabag.it. This account will expire in 14 days, but you can easily create a new one for testing purposes.

  1. curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom

Screenshots or log output

Full log

[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 588, in add
    new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 275, in parse_links_from_source
    raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/parsers/__init__.py", line 101, in parse_links
    links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
  File "/app/archivebox/parsers/__init__.py", line 115, in run_parser_functions
    parsed_links = list(parser_func(to_parse, root_url=root_url))
  File "/app/archivebox/parsers/wallabag_atom.py", line 51, in parse_wallabag_atom_export
    yield Link(
  File "<string>", line 11, in __init__
  File "/app/archivebox/index/schema.py", line 141, in __post_init__
    self.typecheck()
  File "/app/archivebox/index/schema.py", line 165, in typecheck
    assert isinstance(self.url, str) and '://' in self.url
AssertionError

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /data                                                                       
 √  SOURCES_DIR           948 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           168 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.9 MB          valid     ./index.sqlite3    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions