Improve kv-ir deserialization speed

### Request

The current kv-ir deserialization speed is not ideal. From our profiling experiment, around 30-40% of time is spent to construct/destruct the variable dictionary list (which is parsed and stored here: https://github.com/y-scope/clp/blob/main/components/core/src/clp/ffi/ir_stream/decoding_methods.cpp#L388)
We should come up a workaround to avoid this memory allocation overhead.

### Possible implementation

We should improve how `EncodedTextAst` is implemented. Instead of storing var strings as a vector of strings, we can store all var strings and the logtype in a concat string. For example:
```
logtype: "id=%s, passwd=%s" (
vars: ["x", "y"]
```
We can store it as:
```
string_buffer (as a string): ["xyid=%s, passwd=%s"]
```
with a possition vector to track how to partition substrings:
```
pos = [0, 1, 2]
```
In this way, we don't need to allocate strings for each var string and the logtype, and improve the spatial locality. In the meantime, we still preserve the capability to randomly access var strings or the logtype from the string buffer.

An early-stage experiment shows that this implementation leads to a 1.67x speedup, tested on two datasets.

Milestones:
* [ ] Implement the string buffer (StringBlob).
* [ ] Use the string buffer to re-implement the encoded text AST.
* [ ] Implement decoding methods based on the string buffer implementation.
* [ ] Replace the existing AST implementation by the new encoded text AST implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve kv-ir deserialization speed #1541

Request

Possible implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve kv-ir deserialization speed #1541

Description

Request

Possible implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions