PSA: archiving large wallabag collections

## Wiki Page URL

I'm not sure where to put this.. I found a mention of Wallabag in https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage but also https://github.com/pirate/ArchiveBox#can-import-links-from-many-formats

## Suggested Edit

I'm not sure how to phrase this either, but in the former, it says I can export my list of URLs from wallabag using the "Export" button. It doesn't say which format should be used but, in my case, i have over 10 000 links archived in Wallabag, which makes it impractical, to say the least, to export. Any of the buttons just fails with a blank page.

So I made this horrid script to pull all the links into JSON files. It's atrocious, but it works.

```
page=1; while http --check-status GET 'https://lib3.net/wallabag/api/entries.json?perPage=100&page='$page 'Authorization:Bearer [REDACTED]' > entries-p$page.json; do
    sleep 1
    page=$(($page +1))
   echo "fetching page $page"
done
```

How to get the Bearer token is explained in "How to create my first app" in the very intuitively named "API clients management" section, e.g. http://wallabag.example.com/developer/howto/first-app

<del>This loop will never complete, because the `http` client is too dumb to return proper exit codes when it hits a 404, and I was too lazy to fix that in script. Just let it run for a while until you start seeing files like this appear</del> `--check-status` FTW!

Then the *real* fun begins! While we say archivebox can parse stuff from wallabag, it can't actually parse that JSON!

```
$ archivebox add pages/entries-p2.json 
    > ./sources/entries-p2.json-1557178820.txt

[*] [2019-05-06 21:40:20] Parsing new links from output/sources/entries-p2.json-1557178820.txt...
    > Parsed 0 links as Failed to parse (0 new links added)                                                                                                            

[*] [2019-05-06 21:40:20] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                                                                                                                        

[▶] [2019-05-06 21:40:21] Updating content for 0 matching pages in archive...

[√] [2019-05-06 21:40:21] Update of 0 pages complete (0.00 sec)
    - 0 links skipped
    - 0 links updated
    - 0 links had errors

    To view your archive, open:
        /srv/backup/archive/archivebox/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2019-05-06 21:40:21] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                         
```

undettered, and unwilling to redownload everything again, I made this stupid Python script to generate a list of URLs:

```
#!/usr/bin/python3

import json
import sys


def find_urls(fp):
    try:
        blob = json.load(fp)
    except json.decoder.JSONDecodeError:
        return
    if '_embedded' not in blob:
        return
    for item in blob['_embedded']['items']:
        yield item['url']


def open_files(paths):
    for path in paths:
        with open(path) as fp:
            for url in find_urls(fp):
                yield url


def main():
    for url in open_files(sys.argv[1:]):
        print(url)


if __name__ == '__main__':
    main()
```

... which you call like this:

```
../wtf-wallabag.py $(ls | sort -tp -k2 -n) > ../wallabag.list
```

(the `sort` pipeline is to sort the list of files by number, another oversight of my poor design)

Then you can throw archivebox at `wallabag.list` for a long time.

I know this is not very useful in itself - it would be best to have this in a wiki page. But I truly didn't know where to put it in your carefully crafted wiki (nor if i could!) so I figured it would still be useful to post this here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PSA: archiving large wallabag collections #233

Wiki Page URL

Suggested Edit

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PSA: archiving large wallabag collections #233

Description

Wiki Page URL

Suggested Edit

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions