30k articles wallabag import report #1119

Open
opened 2026-02-26 17:39:36 +01:00 by anarcat · 2 comments

What happened?

I have just migrated from Wallabag to Readeck, and i'm really happy!

i'm not exactly sure how to file this, because I was told to summarize my "stream of consciousness" comments I made on matrix somewhere, and figured this was as good as any...

The context is that I was using a wallabag hosted self-hosted by a friend at another location, on another residential connection. The import went like this:

  • 15:24 import started
  • 15:46 28043/28166 imported, extraction in progress (+22 minutes)
  • 20:28 import finished (+5h 4min)

I have the import log on file here, but I'm a little hesitant in sharing it publicly because (1) it's huge and (2) it probably contains some private information (like the stuff I read!). Here's a quick analysis of what it contains:

root@marcos:/etc/docker/readeck# cat wallabag-import-2026-02-26T01\:04Z.log | grep ' ERR ' | sed 's/.* ERR //;s/@.*//' | sort | uniq -c | sort
      4 no body found in document 
     22 error during extraction 
    109 content script error 
    191 could not extract content 
    822 cannot load resource 
root@marcos:/etc/docker/readeck# cat wallabag-import-2026-02-26T01\:04Z.log | grep ' WRN ' | sed 's/.* WRN //;s/@.*//' | sort | uniq -c | sort
      2 documentLoaded 
      2 unsupported JSON-LD structure 
      4 Could not load notes. 
      4 error decoding JSON-LD 
      4 parse microdata 
      9 oembed error 
     93 extract link read error 
    107 processMeta 
    286 open image 
    591 failed to fetch resource 
   2319 cannot load picture 
   3795 invalid image size 
  29338 couldn't find image candidate 
  40417 cannot load image 
  87616 extract link fetch error 

None of those match the "123" count.

What outcome was expected?

Overall, I'm really impressed with the whole thing. Everything went smoothly: the remote host didn't experience any trouble, and everything went much faster than I expected (I thought it would take at least 12h if not multiple days).

I have a couple of comments regarding the usability of Readeck in general, and the import in particular. I am not sure if those should be filed as separate issues and, quite frankly, perhaps this whole issue can be closed as it's mostly a "wow, that works really well!" issue, do let me know..

Here are the issues that could be improved:

  • all bookmarks to have been imported: 123 are missing, and i don't know which: how can i make a list of all the URLs in readeck? (and wallabag, for that matter?) I still have the logs of the import, but they're massive, so I'm not sure they hold that information.
  • the mobile readeck interface could refocus text field after tapping "add": right now, to add labels to a bookmark in the web interface, on my phone, i need to:
    1. type the label
    2. tap "add label"
    3. tap back in the text field to refocus
    4. go to 1
      I think step 3 could be skipped. Note that this doesn't affect desktop usage when using only the keyboard: if you hit "enter" to add the label, you keep the focus.
  • when the import completes, i was hoping there would be feedback on the number of articles imported, how many failed and so on. instead, the web interface only says "completed" and this line is logged: 2026-02-26T01:03:04.796261183Z INF import finished @id=08bb0286/c949-00000021 bookmark_id=25496. this could include the total run time, how many articles were imported, extracted, failed to extract and so on. the web interface could also trickle all that up
  • the import progress bar looked... weird. it went forward a bit, then just got stuck there and essentially stopped for 5 hours. i don't actually know if it moved, but it certainly looked stuck there for at least half an hour. it could have given progress on the different stages of the import, for example above all the articles were effectively there after 20 minutes, and we were in the "extraction" stage... this doesn't matter for "small" imports like mine, but it will be much more significant for larger collections

Again, let me know if I can/should file issues about this, and I welcome any comments or help, particularly about those pesky 123 items missing. ;)

Note that during my research, I also found a few (existing) bugs with eckard as well, which i note here for my own posterity:

Relevant log output

No response

OS you installed Readeck on

Linux

### What happened? I have just migrated from Wallabag to Readeck, and i'm really happy! i'm not exactly sure how to file this, because I was told to summarize my "stream of consciousness" comments I made on matrix somewhere, and figured this was as good as any... The context is that I was using a wallabag hosted self-hosted by a friend at another location, on another residential connection. The import went like this: - 15:24 import started - 15:46 28043/28166 imported, extraction in progress (+22 minutes) - 20:28 import finished (+5h 4min) I have the import log on file here, but I'm a little hesitant in sharing it publicly because (1) it's huge and (2) it *probably* contains some private information (like the stuff I read!). Here's a quick analysis of what it contains: ``` root@marcos:/etc/docker/readeck# cat wallabag-import-2026-02-26T01\:04Z.log | grep ' ERR ' | sed 's/.* ERR //;s/@.*//' | sort | uniq -c | sort 4 no body found in document 22 error during extraction 109 content script error 191 could not extract content 822 cannot load resource root@marcos:/etc/docker/readeck# cat wallabag-import-2026-02-26T01\:04Z.log | grep ' WRN ' | sed 's/.* WRN //;s/@.*//' | sort | uniq -c | sort 2 documentLoaded 2 unsupported JSON-LD structure 4 Could not load notes. 4 error decoding JSON-LD 4 parse microdata 9 oembed error 93 extract link read error 107 processMeta 286 open image 591 failed to fetch resource 2319 cannot load picture 3795 invalid image size 29338 couldn't find image candidate 40417 cannot load image 87616 extract link fetch error ``` None of those match the "123" count. ### What outcome was expected? Overall, I'm really impressed with the whole thing. Everything went smoothly: the remote host didn't experience any trouble, and everything went much faster than I expected (I thought it would take at least 12h if not multiple days). I have a couple of comments regarding the usability of Readeck in general, and the import in particular. I am not sure if those should be filed as separate issues and, quite frankly, perhaps this whole issue can be closed as it's mostly a "wow, that works really well!" issue, do let me know.. Here are the issues that could be improved: - all bookmarks to have been imported: 123 are missing, and i don't know which: how can i make a list of all the URLs in readeck? (and wallabag, for that matter?) I still have the logs of the import, but they're massive, so I'm not sure they hold that information. - the mobile readeck interface could refocus text field after tapping "add": right now, to add labels to a bookmark in the web interface, on my phone, i need to: 1. type the label 2. tap "add label" 3. tap back in the text field to refocus 4. go to 1 I think step 3 could be skipped. Note that this doesn't affect desktop usage when using only the keyboard: if you hit "enter" to add the label, you keep the focus. - when the import completes, i was hoping there would be feedback on the number of articles imported, how many failed and so on. instead, the web interface only says "completed" and this line is logged: `2026-02-26T01:03:04.796261183Z INF import finished @id=08bb0286/c949-00000021 bookmark_id=25496`. this could include the total run time, how many articles were imported, extracted, failed to extract and so on. the web interface could also trickle all that up - the import progress bar looked... weird. it went forward a bit, then just got stuck there and essentially stopped for 5 hours. i don't actually *know* if it moved, but it certainly looked stuck there for at least half an hour. it could have given progress on the different stages of the import, for example above all the articles were effectively there after 20 minutes, and we were in the "extraction" stage... this doesn't matter for "small" imports like mine, but it will be much more significant for larger collections Again, let me know if I can/should file issues about this, and I welcome any comments or help, particularly about those pesky 123 items missing. ;) Note that during my research, I also found a few (existing) bugs with eckard as well, which i note here for my own posterity: - overlap in the UI: https://codeberg.org/gollyhatch/eckard/issues/3 - add a justified formatting: https://codeberg.org/gollyhatch/eckard/issues/19 - more metadata display: https://codeberg.org/gollyhatch/eckard/issues/20 ### Relevant log output _No response_ ### OS you installed Readeck on Linux
Member

Hi, thanks for the feedback! I'll address more of the items you brought up in due time, but for now:

  • how can i make a list of all the URLs in readeck? (and wallabag, for that matter?)

Here's a tiny wallabag API script for you https://gist.github.com/mislav/0ebe0da24f9510f609db2a86a9d711f4 (requires curl, jq)

You could invoke it like so to extract certain fields, like entry id, article url, and the size of article contents:

bash api.sh 'entries?perPage=100' | \
  jq -r '.[] | [.id, .url, (if (.content | test("^wallabag")) then 0 else (.content | length) end)] | @tsv'

This snippet iterates through all of your wallabag entries and outputs their information in tab-separated format. You can pipe that to a file and load up in spreadsheet software if you want to have overview of your libary.

Of course, I don't expect you to ever share your full reading list, but I'd be particularly interested in the number of articles in your library that have no cached article contents. That would cause Readeck to refetch the original url during import, and if that failed due to HTTP 404 or network or other issues, it would lead to some of the errors you were seeing in your logs:

  • "could not extract content"
  • "cannot load resource"

"content script error"

I'm not sure if it's desirable for content scripts to run on cached versions of articles from wallabag 🤔 Wallabag stores readable versions of articles, not the original HTML markup, so some content scripts are bound to fail since they were designed to run on the original markup of the page. Maybe we should disable content scripts when importing cached contents?

Hi, thanks for the feedback! I'll address more of the items you brought up in due time, but for now: > * how can i make a list of all the URLs in readeck? (and wallabag, for that matter?) Here's a tiny wallabag API script for you https://gist.github.com/mislav/0ebe0da24f9510f609db2a86a9d711f4 (requires `curl`, `jq`) You could invoke it like so to extract certain fields, like entry `id`, article `url`, and the size of article contents: ```sh bash api.sh 'entries?perPage=100' | \ jq -r '.[] | [.id, .url, (if (.content | test("^wallabag")) then 0 else (.content | length) end)] | @tsv' ``` This snippet iterates through _all_ of your wallabag entries and outputs their information in tab-separated format. You can pipe that to a file and load up in spreadsheet software if you want to have overview of your libary. Of course, I don't expect you to ever share your full reading list, but I'd be particularly interested in the number of articles in your library that have no cached article contents. That would cause Readeck to refetch the original `url` during import, and if that failed due to HTTP 404 or network or other issues, it would lead to some of the errors you were seeing in your logs: - "could not extract content" - "cannot load resource" > "content script error" I'm not sure if it's desirable for content scripts to run on cached versions of articles from wallabag 🤔 Wallabag stores readable versions of articles, not the original HTML markup, so some content scripts are bound to fail since they were designed to run on the original markup of the page. Maybe we should disable content scripts when importing cached contents?
Author

Here's a tiny wallabag API script for you https://gist.github.com/mislav/0ebe0da24f9510f609db2a86a9d711f4 (requires curl, jq)

fantastic, that works well!

... but what about the readeck side? :)

it seems like it's something like:

#!/bin/sh

set -e

offset=0

tmpfile=$(mktemp)
while true; do
    curl -sSL -f -X GET "https://readeck.anarc.at/api/bookmarks?limit=100&offset=$offset" \
         -H 'accept: application/json' -H "Authorization: Bearer $TOKEN"  > $tmpfile
    jq -r .[].url < $tmpfile
    offset=$(($offset + 100))
done

unfortunately, that doesn't quite work, as it seems the wallabag API flakes out:

> wc -l wallabag-urls.txt readeck-urls.txt
  15916 wallabag-urls.txt
  28034 readeck-urls.txt

i.e. the API script only finds 15k articles. I had a little more success with the simpler:

bash api.sh 'entries?perPage=100' | jq -r .[].url > urls-wallabag2.txt

With that, I get:

  28034 urls-readeck.txt
  28169 urls-wallabag2.txt

Maybe I just don't understand your jq invocation there. :)

One thing that's interesting here is that I feel the readeck count is wrong here. Remember how I reported readeck imported 28043 items from wallabag? That's exactly the number above, here. This sounds great until you know (like me) that I have actually added a lot more URLs in there since I started! There should be, in other words, more than 28034 URLs in readeck now, even though the user interface reports that number.

So that's really strange, and a bit concerning. If I add https://example.com to readeck right now, the count does increment to 28035, so I'm not sure what's going on.

here's another interesting thing: i tried to sort the URL lists and then remove duplicates. below, the -sorted files are just sorted, and the -sorted-uniq files are deduplicated:

anarcat@angela:~/bin> wc -l urls-*
  27957 urls-readeck-sorted-uniq.txt
  28034 urls-readeck-sorted.txt
  28034 urls-readeck.txt
  28042 urls-wallabag2-sorted-uniq.txt
  28169 urls-wallabag2-sorted.txt
  28169 urls-wallabag2.txt

so, of course, just sorting the files doesn't change anything, but removing duplicates deduplicates more entries in readeck than wallabag, which i find just bizarre.

grepping around the diff, i can see a few reasons for the discrepancies:

  1. wallabag and readeck normalize URIs differently, for example:
  • wallabag: http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&amp;utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+Slashdot%2Fslashdot+%28Slashdot%29
  • readeck: http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&amp;utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+Slashdot/slashdot+(Slashdot)
  1. readeck removes anchors, for example:
  • wallabag: http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.html#sec8
  • readeck: http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.html
  1. readeck normalized on https, for example:
  • wallbag: http://kedpm.sourceforge.net
  • readeck: http://kedpm.sourceforge.net/
  1. readeck normalizes on a trailing slash, but not always!

  2. readeck normalized some redirections

Another mind-boggling thing is that readeck seems to have URLs that are not in wallabag, and that I don't even remember adding. For example, I have a bookmark for some random control panel that I added in 12 April 2007, according to readeck, yet it's not in the URL list according to wallabag's API.

It's pretty hard to come up with a definitive list of what's missing. Here's the final diff, after removing anchors (but not encoding issues, those are harder):

anarcat@angela:~/bin[1]> diff -u urls-wallabag2-sorted-uniq-noanchor.txt urls-readeck-sorted-uniq.txt | diffstat 
 urls-readeck-sorted-uniq.txt | 1617 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------------------------------------------------------------------
 1 file changed, 774 insertions(+), 843 deletions(-)

That's over 800 bookmarks to inspect! After filtering some of the trailing slash and https stuff, i got down to:

anarcat@angela:~/bin[1]> diff -u urls-wallabag2-sorted-uniq-noanchor-https.txt urls-readeck-sorted-uniq-https.txt | diffstat
 urls-readeck-sorted-uniq-https.txt | 1212 +++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------
 1 file changed, 572 insertions(+), 640 deletions(-)

... which is still a lot!

I suspect a lot of the failed links are links that readeck can't read anymore. For example, one of the links missing is https://framagit.org/abakkk/DrawOnYourScreen. That, indeed, does not load anymore and is not in the URL list returned by the readeck API. but, much more strangely, if i search for DrawOnYourScreen in readeck, I do find a bookmark! There, it's tagged as https://framagit.org/users/sign_in.

So I guess one thing we could do, when importing, is to avoid following redirects: those are likely dead links that, instead of returning a proper 403 or 404, return you to a login or parked domain page.

At this point, I'm not sure I want to dig into this any deeper. It's pretty hard to figure out exactly what's going on between the two, and I still can't quite figure whether I even really have 123 links missing.

Again, having better counts in the import process would help a lot in dealing with those issues...

> Here's a tiny wallabag API script for you https://gist.github.com/mislav/0ebe0da24f9510f609db2a86a9d711f4 (requires curl, jq) fantastic, that works well! ... but what about the readeck side? :) it seems like it's something like: ```sh #!/bin/sh set -e offset=0 tmpfile=$(mktemp) while true; do curl -sSL -f -X GET "https://readeck.anarc.at/api/bookmarks?limit=100&offset=$offset" \ -H 'accept: application/json' -H "Authorization: Bearer $TOKEN" > $tmpfile jq -r .[].url < $tmpfile offset=$(($offset + 100)) done ``` unfortunately, that doesn't quite work, as it seems the wallabag API flakes out: ``` > wc -l wallabag-urls.txt readeck-urls.txt 15916 wallabag-urls.txt 28034 readeck-urls.txt ``` i.e. the API script only finds 15k articles. I had a little more success with the simpler: ``` bash api.sh 'entries?perPage=100' | jq -r .[].url > urls-wallabag2.txt ``` With that, I get: ``` 28034 urls-readeck.txt 28169 urls-wallabag2.txt ``` Maybe I just don't understand your `jq` invocation there. :) One thing that's interesting here is that I feel the readeck count is wrong here. Remember how I reported readeck imported `28043` items from wallabag? That's exactly the number above, here. This sounds great until you know (like me) that I have actually added a *lot* more URLs in there since I started! There should be, in other words, more than 28034 URLs in readeck now, even though the user interface reports that number. So that's really strange, and a bit concerning. If I add `https://example.com` to readeck right now, the count *does* increment to 28035, so I'm not sure what's going on. here's another interesting thing: i tried to sort the URL lists and then remove duplicates. below, the `-sorted` files are just sorted, and the `-sorted-uniq` files are deduplicated: ``` anarcat@angela:~/bin> wc -l urls-* 27957 urls-readeck-sorted-uniq.txt 28034 urls-readeck-sorted.txt 28034 urls-readeck.txt 28042 urls-wallabag2-sorted-uniq.txt 28169 urls-wallabag2-sorted.txt 28169 urls-wallabag2.txt ``` so, of course, just sorting the files doesn't change anything, but removing duplicates deduplicates *more* entries in readeck than wallabag, which i find just bizarre. grepping around the diff, i can see a few reasons for the discrepancies: 1. wallabag and readeck normalize URIs differently, for example: - wallabag: `http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&amp;utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+Slashdot%2Fslashdot+%28Slashdot%29` - readeck: `http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&amp;utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+Slashdot/slashdot+(Slashdot)` 2. readeck removes anchors, for example: - wallabag: `http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.html#sec8` - readeck: `http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.html` 3. readeck normalized on https, for example: - wallbag: `http://kedpm.sourceforge.net` - readeck: `http://kedpm.sourceforge.net/` 4. readeck normalizes on a trailing slash, but not always! 5. readeck normalized some redirections Another mind-boggling thing is that readeck seems to have URLs that are not in wallabag, and that I don't even remember adding. For example, I have a bookmark for some random control panel that I added in 12 April 2007, according to readeck, yet it's not in the URL list according to wallabag's API. It's pretty hard to come up with a definitive list of what's missing. Here's the final diff, after removing anchors (but not encoding issues, those are harder): ``` anarcat@angela:~/bin[1]> diff -u urls-wallabag2-sorted-uniq-noanchor.txt urls-readeck-sorted-uniq.txt | diffstat urls-readeck-sorted-uniq.txt | 1617 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------- 1 file changed, 774 insertions(+), 843 deletions(-) ``` That's over 800 bookmarks to inspect! After filtering some of the trailing slash and https stuff, i got down to: ``` anarcat@angela:~/bin[1]> diff -u urls-wallabag2-sorted-uniq-noanchor-https.txt urls-readeck-sorted-uniq-https.txt | diffstat urls-readeck-sorted-uniq-https.txt | 1212 +++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------- 1 file changed, 572 insertions(+), 640 deletions(-) ``` ... which is still a lot! I suspect a lot of the failed links are links that readeck can't read anymore. For example, one of the links missing is <https://framagit.org/abakkk/DrawOnYourScreen>. That, indeed, does not load anymore and is not in the URL list returned by the readeck API. but, much more strangely, if i search for `DrawOnYourScreen` in readeck, I *do* find a bookmark! There, it's tagged as `https://framagit.org/users/sign_in`. So I guess one thing we could do, when importing, is to avoid following redirects: those are likely dead links that, instead of returning a proper 403 or 404, return you to a login or parked domain page. At this point, I'm not sure I want to dig into this any deeper. It's pretty hard to figure out exactly what's going on between the two, and I still can't quite figure whether I even really *have* 123 links missing. Again, having better counts in the import process would help a lot in dealing with those issues...
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
readeck/readeck#1119
No description provided.