Skip to content

Faster tokenization by converting byte arrays to strings#6898

Closed
GuptaManan100 wants to merge 4 commits intovitessio:masterfrom
GuptaManan100:parser-modification
Closed

Faster tokenization by converting byte arrays to strings#6898
GuptaManan100 wants to merge 4 commits intovitessio:masterfrom
GuptaManan100:parser-modification

Conversation

@GuptaManan100
Copy link
Copy Markdown
Contributor

@GuptaManan100 GuptaManan100 commented Oct 19, 2020

Overview of the enhancement

The current version of the tokenizer uses a byte array to store the SQL input which comes as a string. In order to eliminate the unnecessary copy of the input string, the tokenizer is changed to use strings instead.

Initial Byte Array Implementation
BenchmarkParse1-16 13350778 10810 ns/op 2431 B/op 76 allocs/op
BenchmarkParse2-16 4257296 33834 ns/op 8672 B/op 266 allocs/op
BenchmarkParse2Parallel-16 25955780 6434 ns/op 5904 B/op 175 allocs/op
BenchmarkParse3-16 47076 2968535 ns/op 6337672 B/op 359 allocs/op
BenchmarkParseBigQuery-16 8679 16439766 ns/op 2541376 B/op 133468 allocs/op
BenchmarkWithNormalizer-16 40436 7523121 ns/op 7450373 B/op 638 allocs/op

Strings Implementation
BenchmarkParse1-16 13014916 11534 ns/op 2375 B/op 91 allocs/op
BenchmarkParse2-16 4180118 34032 ns/op 8511 B/op 315 allocs/op
BenchmarkParse2Parallel-16 25814864 6391 ns/op 5785 B/op 224 allocs/op
BenchmarkParse3-16 40479 3474247 ns/op 6293153 B/op 405 allocs/op
BenchmarkParseBigQuery-16 8089 17375202 ns/op 2663017 B/op 213834 allocs/op
BenchmarkWithNormalizer-16 33951 5906691 ns/op 8474301 B/op 695 allocs/op

Strings With Copy On Write Implementation
BenchmarkParse1-16 13779513 10704 ns/op 2167 B/op 56 allocs/op
BenchmarkParse2-16 4595913 31356 ns/op 7557 B/op 160 allocs/op
BenchmarkParse2Parallel-16 29976123 5207 ns/op 4834 B/op 69 allocs/op
BenchmarkParse3-16 39510 3700988 ns/op 6128685 B/op 314 allocs/op
BenchmarkParseBigQuery-16 10000 14835516 ns/op 1614678 B/op 52400 allocs/op
BenchmarkWithNormalizer-16 29342 5871095 ns/op 8309850 B/op 604 allocs/op

CPU and Memory Profiles of these benchmarks are available at https://drive.google.com/drive/folders/1jTFyySRiVyTSYdFV3U_H8HGDUHF3443Y?usp=sharing

@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Parser labels Oct 19, 2020
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
@sougou
Copy link
Copy Markdown
Contributor

sougou commented Nov 21, 2020

Closing this. But let's keep the branch alive in case we want to revisit this.

@sougou sougou closed this Nov 21, 2020
@vmg vmg mentioned this pull request Mar 5, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Query Serving Status: Won't Fix Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants