Skip to content

PSA: archiving large wallabag collections #233

@anarcat

Description

@anarcat

Wiki Page URL

I'm not sure where to put this.. I found a mention of Wallabag in https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage but also https://github.com/pirate/ArchiveBox#can-import-links-from-many-formats

Suggested Edit

I'm not sure how to phrase this either, but in the former, it says I can export my list of URLs from wallabag using the "Export" button. It doesn't say which format should be used but, in my case, i have over 10 000 links archived in Wallabag, which makes it impractical, to say the least, to export. Any of the buttons just fails with a blank page.

So I made this horrid script to pull all the links into JSON files. It's atrocious, but it works.

page=1; while http --check-status GET 'https://lib3.net/wallabag/api/entries.json?perPage=100&page='$page 'Authorization:Bearer [REDACTED]' > entries-p$page.json; do
    sleep 1
    page=$(($page +1))
   echo "fetching page $page"
done

How to get the Bearer token is explained in "How to create my first app" in the very intuitively named "API clients management" section, e.g. http://wallabag.example.com/developer/howto/first-app

This loop will never complete, because the http client is too dumb to return proper exit codes when it hits a 404, and I was too lazy to fix that in script. Just let it run for a while until you start seeing files like this appear --check-status FTW!

Then the real fun begins! While we say archivebox can parse stuff from wallabag, it can't actually parse that JSON!

$ archivebox add pages/entries-p2.json 
    > ./sources/entries-p2.json-1557178820.txt

[*] [2019-05-06 21:40:20] Parsing new links from output/sources/entries-p2.json-1557178820.txt...
    > Parsed 0 links as Failed to parse (0 new links added)                                                                                                            

[*] [2019-05-06 21:40:20] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                                                                                                                        

[▶] [2019-05-06 21:40:21] Updating content for 0 matching pages in archive...

[√] [2019-05-06 21:40:21] Update of 0 pages complete (0.00 sec)
    - 0 links skipped
    - 0 links updated
    - 0 links had errors

    To view your archive, open:
        /srv/backup/archive/archivebox/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2019-05-06 21:40:21] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                         

undettered, and unwilling to redownload everything again, I made this stupid Python script to generate a list of URLs:

#!/usr/bin/python3

import json
import sys


def find_urls(fp):
    try:
        blob = json.load(fp)
    except json.decoder.JSONDecodeError:
        return
    if '_embedded' not in blob:
        return
    for item in blob['_embedded']['items']:
        yield item['url']


def open_files(paths):
    for path in paths:
        with open(path) as fp:
            for url in find_urls(fp):
                yield url


def main():
    for url in open_files(sys.argv[1:]):
        print(url)


if __name__ == '__main__':
    main()

... which you call like this:

../wtf-wallabag.py $(ls | sort -tp -k2 -n) > ../wallabag.list

(the sort pipeline is to sort the list of files by number, another oversight of my poor design)

Then you can throw archivebox at wallabag.list for a long time.

I know this is not very useful in itself - it would be best to have this in a wiki page. But I truly didn't know where to put it in your carefully crafted wiki (nor if i could!) so I figured it would still be useful to post this here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions