Skip to content

Improve kv-ir deserialization speed #1541

@LinZhihao-723

Description

@LinZhihao-723

Request

The current kv-ir deserialization speed is not ideal. From our profiling experiment, around 30-40% of time is spent to construct/destruct the variable dictionary list (which is parsed and stored here: https://github.com/y-scope/clp/blob/main/components/core/src/clp/ffi/ir_stream/decoding_methods.cpp#L388)
We should come up a workaround to avoid this memory allocation overhead.

Possible implementation

We should improve how EncodedTextAst is implemented. Instead of storing var strings as a vector of strings, we can store all var strings and the logtype in a concat string. For example:

logtype: "id=%s, passwd=%s" (
vars: ["x", "y"]

We can store it as:

string_buffer (as a string): ["xyid=%s, passwd=%s"]

with a possition vector to track how to partition substrings:

pos = [0, 1, 2]

In this way, we don't need to allocate strings for each var string and the logtype, and improve the spatial locality. In the meantime, we still preserve the capability to randomly access var strings or the logtype from the string buffer.

An early-stage experiment shows that this implementation leads to a 1.67x speedup, tested on two datasets.

Milestones:

  • Implement the string buffer (StringBlob).
  • Use the string buffer to re-implement the encoded text AST.
  • Implement decoding methods based on the string buffer implementation.
  • Replace the existing AST implementation by the new encoded text AST implementation.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions