Some fixes for multithreaded Context parsing#1073
Conversation
|
Bump. This fixes a real bug and a race condition. |
| end | ||
|
|
||
| function findnextnewline(pos, stop, buf, opts) | ||
| while pos < stop |
There was a problem hiding this comment.
stop here seems to be the last byte of the current chunk; why is it < here instead of <=? I understand the change below from len = ranges[i + 1] to len = ranges[i + 1] - 1, but it seems like we still want to check each byte in the chunk?
There was a problem hiding this comment.
(for example, on line 372 we're doing while pos <= len...)
There was a problem hiding this comment.
Thanks for the review!
This function findnextnewline only returns the position of the first encountered newline and defaults by returning stop. So, whether there is no newline, or whether the newline happens on the last byte, it will return stop either way and there is no side-effect to the function, so we might as well skip checking that last byte.
The default return value comes from the assumption that each chunk should end with a newline. Line 450 this assumption is verified by the preprocessing of ranges from lines 466-480, while on the call line 470 stop is simply the end of the buffer.
quinnj
left a comment
There was a problem hiding this comment.
This looks pretty good IMO; I left one question though. I'm also not sure why CI is not running at all here? We definitely want to make sure we have a good test run before merging.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1073 +/- ##
==========================================
+ Coverage 90.22% 90.40% +0.18%
==========================================
Files 9 9
Lines 2270 2293 +23
==========================================
+ Hits 2048 2073 +25
+ Misses 222 220 -2
☔ View full report in Codecov by Sentry. |
|
Thanks @Liozou! |
|
You're welcome, thanks for the review and merge! I believe this should also fix #1089, I checked locally that the bug occurs when using e.g. |
* Reinitialize column types after failed chunking * Fix race condition on ranges in findchunkrowstart * Reinitialize column type after limit row start detection * Preprocess ranges to avoid cutting values in chunk row start detection * Fix using too large limit * Fix off-by-one in chunk row detection for last line * Add tests * Make multithreaded parsing failure loud
I had a CSV file with only
Float64whose columns were sometimes parsed asString, sometimes asFloat64when using multiple threads. Whether it happened and which columns were affected changed from one execution to the other, which made the issue difficult to track.This PR fixes this behaviour and a couple of issues encountered in the same area of the code. Most notably:
findchunkrowstart, where threadireadsranges[i+1]and writes toranges[i].ntasksvalue #1010).I also made multi-threading parsing failure a loud
@errorto prevent performance pitfalls to go unnoticed (last commit). This can be easily reverted if you think it can lead to too many spurious logs.Fix #1010, fix #1047 and added tests (which now include one occurrence of the new
@error).