-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Wiki Page URL
I'm not sure where to put this.. I found a mention of Wallabag in https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage but also https://github.com/pirate/ArchiveBox#can-import-links-from-many-formats
Suggested Edit
I'm not sure how to phrase this either, but in the former, it says I can export my list of URLs from wallabag using the "Export" button. It doesn't say which format should be used but, in my case, i have over 10 000 links archived in Wallabag, which makes it impractical, to say the least, to export. Any of the buttons just fails with a blank page.
So I made this horrid script to pull all the links into JSON files. It's atrocious, but it works.
page=1; while http --check-status GET 'https://lib3.net/wallabag/api/entries.json?perPage=100&page='$page 'Authorization:Bearer [REDACTED]' > entries-p$page.json; do
sleep 1
page=$(($page +1))
echo "fetching page $page"
done
How to get the Bearer token is explained in "How to create my first app" in the very intuitively named "API clients management" section, e.g. http://wallabag.example.com/developer/howto/first-app
This loop will never complete, because the http client is too dumb to return proper exit codes when it hits a 404, and I was too lazy to fix that in script. Just let it run for a while until you start seeing files like this appear--check-status FTW!
Then the real fun begins! While we say archivebox can parse stuff from wallabag, it can't actually parse that JSON!
$ archivebox add pages/entries-p2.json
> ./sources/entries-p2.json-1557178820.txt
[*] [2019-05-06 21:40:20] Parsing new links from output/sources/entries-p2.json-1557178820.txt...
> Parsed 0 links as Failed to parse (0 new links added)
[*] [2019-05-06 21:40:20] Writing 101 links to main index...
√ /srv/backup/archive/archivebox/index.sqlite3
√ /srv/backup/archive/archivebox/index.json
√ /srv/backup/archive/archivebox/index.html
[▶] [2019-05-06 21:40:21] Updating content for 0 matching pages in archive...
[√] [2019-05-06 21:40:21] Update of 0 pages complete (0.00 sec)
- 0 links skipped
- 0 links updated
- 0 links had errors
To view your archive, open:
/srv/backup/archive/archivebox/index.html
Or run the built-in webserver:
archivebox server
[*] [2019-05-06 21:40:21] Writing 101 links to main index...
√ /srv/backup/archive/archivebox/index.sqlite3
√ /srv/backup/archive/archivebox/index.json
√ /srv/backup/archive/archivebox/index.html
undettered, and unwilling to redownload everything again, I made this stupid Python script to generate a list of URLs:
#!/usr/bin/python3
import json
import sys
def find_urls(fp):
try:
blob = json.load(fp)
except json.decoder.JSONDecodeError:
return
if '_embedded' not in blob:
return
for item in blob['_embedded']['items']:
yield item['url']
def open_files(paths):
for path in paths:
with open(path) as fp:
for url in find_urls(fp):
yield url
def main():
for url in open_files(sys.argv[1:]):
print(url)
if __name__ == '__main__':
main()
... which you call like this:
../wtf-wallabag.py $(ls | sort -tp -k2 -n) > ../wallabag.list
(the sort pipeline is to sort the list of files by number, another oversight of my poor design)
Then you can throw archivebox at wallabag.list for a long time.
I know this is not very useful in itself - it would be best to have this in a wiki page. But I truly didn't know where to put it in your carefully crafted wiki (nor if i could!) so I figured it would still be useful to post this here.