30k articles wallabag import report #1119
Labels
No labels
Extraction
Error
Self Hosted
a11y
Bug
Chore
Documentation
Enhancement
Epic
Feature
i18n
l10n
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Security
Status
Abandoned
Status
Blocked
Status
Need More Info
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
readeck/readeck#1119
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What happened?
I have just migrated from Wallabag to Readeck, and i'm really happy!
i'm not exactly sure how to file this, because I was told to summarize my "stream of consciousness" comments I made on matrix somewhere, and figured this was as good as any...
The context is that I was using a wallabag hosted self-hosted by a friend at another location, on another residential connection. The import went like this:
I have the import log on file here, but I'm a little hesitant in sharing it publicly because (1) it's huge and (2) it probably contains some private information (like the stuff I read!). Here's a quick analysis of what it contains:
None of those match the "123" count.
What outcome was expected?
Overall, I'm really impressed with the whole thing. Everything went smoothly: the remote host didn't experience any trouble, and everything went much faster than I expected (I thought it would take at least 12h if not multiple days).
I have a couple of comments regarding the usability of Readeck in general, and the import in particular. I am not sure if those should be filed as separate issues and, quite frankly, perhaps this whole issue can be closed as it's mostly a "wow, that works really well!" issue, do let me know..
Here are the issues that could be improved:
I think step 3 could be skipped. Note that this doesn't affect desktop usage when using only the keyboard: if you hit "enter" to add the label, you keep the focus.
2026-02-26T01:03:04.796261183Z INF import finished @id=08bb0286/c949-00000021 bookmark_id=25496. this could include the total run time, how many articles were imported, extracted, failed to extract and so on. the web interface could also trickle all that upAgain, let me know if I can/should file issues about this, and I welcome any comments or help, particularly about those pesky 123 items missing. ;)
Note that during my research, I also found a few (existing) bugs with eckard as well, which i note here for my own posterity:
Relevant log output
No response
OS you installed Readeck on
Linux
Hi, thanks for the feedback! I'll address more of the items you brought up in due time, but for now:
Here's a tiny wallabag API script for you https://gist.github.com/mislav/0ebe0da24f9510f609db2a86a9d711f4 (requires
curl,jq)You could invoke it like so to extract certain fields, like entry
id, articleurl, and the size of article contents:This snippet iterates through all of your wallabag entries and outputs their information in tab-separated format. You can pipe that to a file and load up in spreadsheet software if you want to have overview of your libary.
Of course, I don't expect you to ever share your full reading list, but I'd be particularly interested in the number of articles in your library that have no cached article contents. That would cause Readeck to refetch the original
urlduring import, and if that failed due to HTTP 404 or network or other issues, it would lead to some of the errors you were seeing in your logs:I'm not sure if it's desirable for content scripts to run on cached versions of articles from wallabag 🤔 Wallabag stores readable versions of articles, not the original HTML markup, so some content scripts are bound to fail since they were designed to run on the original markup of the page. Maybe we should disable content scripts when importing cached contents?
fantastic, that works well!
... but what about the readeck side? :)
it seems like it's something like:
unfortunately, that doesn't quite work, as it seems the wallabag API flakes out:
i.e. the API script only finds 15k articles. I had a little more success with the simpler:
With that, I get:
Maybe I just don't understand your
jqinvocation there. :)One thing that's interesting here is that I feel the readeck count is wrong here. Remember how I reported readeck imported
28043items from wallabag? That's exactly the number above, here. This sounds great until you know (like me) that I have actually added a lot more URLs in there since I started! There should be, in other words, more than 28034 URLs in readeck now, even though the user interface reports that number.So that's really strange, and a bit concerning. If I add
https://example.comto readeck right now, the count does increment to 28035, so I'm not sure what's going on.here's another interesting thing: i tried to sort the URL lists and then remove duplicates. below, the
-sortedfiles are just sorted, and the-sorted-uniqfiles are deduplicated:so, of course, just sorting the files doesn't change anything, but removing duplicates deduplicates more entries in readeck than wallabag, which i find just bizarre.
grepping around the diff, i can see a few reasons for the discrepancies:
http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2Fslashdot+%28Slashdot%29http://yro.slashdot.org/story/10/01/09/0341208/Politicians-Worldwide-Asking-Questions-About-ACTA?from=rss&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+Slashdot/slashdot+(Slashdot)http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.html#sec8http://xtalk.msk.su/~ott/en/writings/emacs-devenv/EmacsCedet.htmlhttp://kedpm.sourceforge.nethttp://kedpm.sourceforge.net/readeck normalizes on a trailing slash, but not always!
readeck normalized some redirections
Another mind-boggling thing is that readeck seems to have URLs that are not in wallabag, and that I don't even remember adding. For example, I have a bookmark for some random control panel that I added in 12 April 2007, according to readeck, yet it's not in the URL list according to wallabag's API.
It's pretty hard to come up with a definitive list of what's missing. Here's the final diff, after removing anchors (but not encoding issues, those are harder):
That's over 800 bookmarks to inspect! After filtering some of the trailing slash and https stuff, i got down to:
... which is still a lot!
I suspect a lot of the failed links are links that readeck can't read anymore. For example, one of the links missing is https://framagit.org/abakkk/DrawOnYourScreen. That, indeed, does not load anymore and is not in the URL list returned by the readeck API. but, much more strangely, if i search for
DrawOnYourScreenin readeck, I do find a bookmark! There, it's tagged ashttps://framagit.org/users/sign_in.So I guess one thing we could do, when importing, is to avoid following redirects: those are likely dead links that, instead of returning a proper 403 or 404, return you to a login or parked domain page.
At this point, I'm not sure I want to dig into this any deeper. It's pretty hard to figure out exactly what's going on between the two, and I still can't quite figure whether I even really have 123 links missing.
Again, having better counts in the import process would help a lot in dealing with those issues...