Skip to content

Use bufio.Reader for large lines processing#23

Merged
satyrius merged 9 commits intosatyrius:masterfrom
przmv:long_token
Nov 23, 2015
Merged

Use bufio.Reader for large lines processing#23
satyrius merged 9 commits intosatyrius:masterfrom
przmv:long_token

Conversation

@przmv
Copy link
Copy Markdown

@przmv przmv commented Nov 15, 2015

This PR introduces using of bufio.Reader (instead of bufio.Scanner) for large lines processing.

@satyrius
Copy link
Copy Markdown
Owner

Explain how it helps plz. Why it is better?

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 15, 2015

When the log file line is longer than bufio.Scanner buffer (MaxScanTokenSize = 64 * 1024) it returns ErrTooLong. That means, that you can't process lines longer than 64KB with bufio.Scanner.

With bufio.Reader you can check if the buffer contains the whole line, if not you can't do one more buffered read.

So, this solution is better because it allows to process files with long (> 64KB) lines.

@satyrius
Copy link
Copy Markdown
Owner

Oh man... logs with over 64Kb lines long :/ What are you parsing?

Could you make a bench on Scanner vs Reader? Just to be sure we have no valuable speed degradation on normal size logs.

reader_test.go Outdated
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate this string instead of pasting a lot ok kilobytes of text plz.

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 15, 2015

There's no speed difference between Scanner and Reader:

$ make bench 
go test -bench .
..................................
34 total assertions

..............
48 total assertions

.........
57 total assertions

.....
62 total assertions

........................................
102 total assertions

PASS
BenchmarkParseSimpleLogRecord-4   200000          6291 ns/op
BenchmarkParseLogRecord-4          50000         30310 ns/op
BenchmarkScannerReader-4        2000000000           0.00 ns/op
BenchmarkReaderReader-4         2000000000           0.00 ns/op
ok      github.com/pshevtsov/gonx   3.275s

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 15, 2015

Oh man... logs with over 64Kb lines long :/ What are you parsing?

Yup. It's not the ordinary log file but some huge TSV dump with a few such long lines.

@satyrius
Copy link
Copy Markdown
Owner

Benchmark tests looks strange. First of all bench tests should measure one single operation (read line, in our case), but you read whole file. How mane lines in this file? And at the end, results are a little confusing... 0 ns/op?

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 16, 2015

How many lines in this file?

It reads this file. But all right, I'm going to redo the test cases to read just one line.

And at the end, results are a little confusing... 0 ns/op?

Yes, I also found it a bit weird. Probably that's because the whole file was loaded into memory. I'm going to check.

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 16, 2015

Oh, I definetely need more sleep (or more ☕ )

I've just corrected the benchmarks:

BenchmarkScannerReader-4          500000          2040 ns/op
BenchmarkReaderReader-4           500000          2206 ns/op

@satyrius
Copy link
Copy Markdown
Owner

So we get 2.0 ms to read line with Scanner and 2.2 to read with Reader. 10% slower on reading.

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 18, 2015

$ go test -bench Reader -benchmem -benchtime 1m
..................................
34 total assertions

..............
48 total assertions

.........
57 total assertions

.....
62 total assertions

........................................
102 total assertions

PASS
BenchmarkScannerReader-4        50000000          2296 ns/op        4096 B/op          1 allocs/op
BenchmarkReaderReaderAppend-4   50000000          2625 ns/op        4096 B/op          1 allocs/op
BenchmarkReaderReaderBuffer-4   30000000          2818 ns/op        4208 B/op          2 allocs/op
ok      github.com/pshevtsov/gonx   392.334s

@satyrius
Copy link
Copy Markdown
Owner

Good job! Do you want to add something or we can merge it?

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 18, 2015

Please hold on for some time. I'm working on some improvements — e.g. no need for a loop when reading short lines — so both techniques will work the same for the most common use cases (i.e. reading short lines). Also I'd like to test which technique (append or bytes.Buffer) will work better for long lines.

I'll let you know soon.
Cheers

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 20, 2015

Hey,

It turns out that for reading long lines using bytes.Buffer is better:

BenchmarkScannerReader-4              500000          2679 ns/op        4305 B/op          5 allocs/op
BenchmarkReaderReaderAppend-4         500000          2650 ns/op        4305 B/op          5 allocs/op
BenchmarkReaderReaderBuffer-4         500000          2640 ns/op        4305 B/op          5 allocs/op
BenchmarkLongReaderReaderAppend-4        500       2588516 ns/op     5666680 B/op         29 allocs/op
BenchmarkLongReaderReaderBuffer-4       1000       2276376 ns/op     4076730 B/op         20 allocs/op

@przmv
Copy link
Copy Markdown
Author

przmv commented Nov 21, 2015

Hey @satyrius when are you going to merge this PR? I have a blocking issue because of inability to read long lines in the application I'm currently working on. Thanks!

@satyrius
Copy link
Copy Markdown
Owner

Well done! Going to master.

satyrius added a commit that referenced this pull request Nov 23, 2015
Use bufio.Reader for large lines processing
@satyrius satyrius merged commit 7a047c4 into satyrius:master Nov 23, 2015
@przmv przmv deleted the long_token branch December 1, 2015 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants