Skip to content

Unbreak MD5 on ARM#1077

Merged
Frenzie merged 1 commit intokoreader:masterfrom
NiLuJe:master
Apr 14, 2020
Merged

Unbreak MD5 on ARM#1077
Frenzie merged 1 commit intokoreader:masterfrom
NiLuJe:master

Conversation

@NiLuJe
Copy link
Copy Markdown
Member

@NiLuJe NiLuJe commented Apr 13, 2020

Regression since #1065

Probably because of implicit type conversions to/from Lua
and the fact that Lua numbers are double internally?
Or something.


This change is Reviewable

Regression since koreader#1065

Probably because of implicit type conversions to/from Lua
and the fact that Lua numbers are double internally?
Or something.
@NiLuJe
Copy link
Copy Markdown
Member Author

NiLuJe commented Apr 13, 2020

This leads me to wonder what the hell we use MD5 for in the first place, since we failed to notice visible breakage despite everything being hashed as ffffffffffffffffffffffffffffffff for a month ;D.

@NiLuJe
Copy link
Copy Markdown
Member Author

NiLuJe commented Apr 13, 2020

(Caught this purely by accident while checking out a metadata record manually while testing koreader/koreader#3785 (comment) :D).

@Frenzie Frenzie merged commit 4fc1cda into koreader:master Apr 14, 2020
@Frenzie
Copy link
Copy Markdown
Member

Frenzie commented Apr 14, 2020

I think it's mainly for progress sync, which I personally don't use.

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

Mhhh, I think I just witnessed (still with v2020.03.2-24-g5f771ed_2020-04-05) something with md5, which is also used by the statistics plugin where a book is uniquely identified bt (title, authors, md5).

These are all different books, except this morning, 1078 appeared, which is a duplicate of 1074 (so, my stats for this book are splitted in 2).
Also, 1073, 1074 and 1075 are 3 different books, sharing the same md5 !

sqlite> select id, pages, md5, total_read_time from book order by id desc limit 10;
id          pages       md5                               total_read_time
----------  ----------  --------------------------------  ---------------
1078        700         985835031646affe3f3962f501aa81d2  1528
1077        296         00000000abe5be96b0ae184f00000000  NulL
1076        42          1e0fcc4556567e3650ccdb24eeb59519  NulL
1075        24          26176cd752b358196c04f64beb7e997e  546
1074        696         26176cd752b358196c04f64beb7e997e  70295
1073        491         26176cd752b358196c04f64beb7e997e  NulL
1072        8           00000000000000000000000000000000  1160
1071        291         ffffffffffffffffffffffffffffffff  35451
1070        26          69cd9f97494cb37fec65c2bfec7009fb  3713
1069        430         49c4cdad5fb709cd602a924bc3dd7689  40365

Is this related to this issue & fix ?

@Frenzie
Copy link
Copy Markdown
Member

Frenzie commented Apr 14, 2020

Oh right, statistics too. 👍

@NiLuJe
Copy link
Copy Markdown
Member Author

NiLuJe commented Apr 14, 2020

The ffffffff... one is certainly what I saw before this PR. So I'd assume that the 00000000... one could be that, too.

The mixed stuff bracketed in 00000000 on both edges looks like the CI's Clang failures, so, :?.

No idea about the clashes.

After a quick look at my own db (on the H2O, which I don't really use), the only collisions I have are for djvu documents (I actually have a bunch of different id's for the same djvu document (the one from the testsuite)?!

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

Looking at my statistics.sqlite3, my first duplicates (same md5 shared by 2 different books) has a last_open 2019/11/20.
I only have ~10 duplicated md5 for 20 entries, so that's around 2% of my stats.
So, we probably have a bug somewhere.

When opening a book, purging manually the .sdr/, re-opening, the md5sum in metadata.lua is consistent, both on x86 and arm.
So, possibly a md5 from one document leaking when opening another in some conditions...

@NiLuJe
Copy link
Copy Markdown
Member Author

NiLuJe commented Apr 14, 2020

I'll double-check on the Forma, but given that I rarely document swap, I don't expect to have anything much to add to the hunt :/.

@Frenzie
Copy link
Copy Markdown
Member

Frenzie commented Apr 14, 2020

There's a function called partial MD5 somewhere. It only takes the first few hundred kilobytes or something. Even so 2% seems absurdly high, that should be more like increasing the chance of collision from minuscule to extremely unlikely.

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

Probably caused by and since koreader/koreader#5528.

self.data being a single object attached to the constant/created at parse time local ReaderStatistics and not re-created on each instance/book load:
https://github.com/koreader/koreader/blob/9a789335c3904081d61dab2160e186185793d3ce/plugins/statistics.koplugin/main.lua#L34-L65

So, we might be re-using the one from the previous book, and when a new book is opened, it's not computed ? Or something... (because if true, that should have happened quite more often...)

edit: unless plugins are fully re-parsed: https://github.com/koreader/koreader/blob/f756a6aaa75b81201fbf302adc9d21cef560c810/frontend/pluginloader.lua#L80

@Frenzie
Copy link
Copy Markdown
Member

Frenzie commented Apr 14, 2020

I'm not fully sure how Lua deals with that in the context of pcall. Normally speaking it shouldn't be.

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

OK, they are pcall(dofile(...) once, and this is cached.
Then, on each ReaderUI new instance, PluginLoader:createPluginInstance() is called, which does pcall(plugin.new, plugin, attr).
So, at least, it's coherent with how the other modules are handled (parsed once, :init()/:new() called for each new instance).

Confirmed with some logs, switching

$ koreader-emu |grep md5
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9

Cleaning sdr/s , starting with the 2nd book

$ koreader-emu |grep md5
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5:
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:init(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:initData 1(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:init(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:initData 1(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0

Cleaning sdr/s , opening one:

$ koreader-emu |grep md5
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5:
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9

Quitting, relauching on this one just opened, and loading the other one (with no sdr yet):

$ koreader-emu |grep md5
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5:
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

I guess this is just enough to solve the issue:

--- a/plugins/statistics.koplugin/main.lua
+++ b/plugins/statistics.koplugin/main.lua
@@ -1960,11 +1960,11 @@ function ReaderStatistics:saveSettings()
     }
     G_reader_settings:saveSetting("statistics", settings)
 end

 function ReaderStatistics:onReadSettings(config)
-    self.data = config.data.stats
+    self.data = config.data.stats or {}
 end

 function ReaderStatistics:onReaderReady()
     -- we have correct page count now, do the actual initialization work
     self:initData()

On a newly opened book, the doc_setting.stats does not exist yet, so self.data (it is assigned nil, so not created and not masking the base object one) is the "global/base object" one, that is filled with the md5 (and saved to the doc_setting);
On an existing book (reopening of the previous one), the one saved in the existing doc_setting is used: self.data is a new object proper to this ReaderStatistics instance.
On another new book, the doc_setting.stats does not exist yet, so self.data (it is assigned nil, so not created and not masking the base object one) is the "global/base object" one, that was filled with the md5 of the first new book.

$ koreader-emu |grep md5
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5:
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5:
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:initData 2(): self.data.md5: ad5479c77be4b209c4cbd7c6f7a224b9
ReaderStatistics:init(): self.data.md5:
ReaderStatistics:initData 1(): self.data.md5: 9a21757d899b17344089600d90bfcba0
ReaderStatistics:initData 2(): self.data.md5: 9a21757d899b17344089600d90bfcba0

@poire-z
Copy link
Copy Markdown
Contributor

poire-z commented Apr 14, 2020

OK, pushed a fix, that looks sane enough.

I'm still unsure how this happened to me for these books: I guess the first entry might have had a wrong md5 leaked from a previous book - but then, one need to have Purge .sdr to start from blank and have the md5 recomputed (or leaked from another book) - and I don't remember having done that this morning...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants