ffi: Add SchemaTree implementation to support the next IR stream format. by LinZhihao-723 · Pull Request #411 · y-scope/clp

LinZhihao-723 · 2024-05-22T03:58:02Z

References

Description

This PR introduces a streaming schema tree implementation designed for IR v2. It has the following components:

SchemaTreeNode: A class that specifies the node information for each node in the schema tree, including a unique ID, a parent ID, key name, type, and children IDs.
SchemaTree: A tree built with SchemaTreeNode, with methods to insert/get tree nodes.
Unit test: unit test cases to cover basic functionality.

Notice that we already have a schema tree implementation in clp-s. The reasons to re-implement the schema tree are the following:

The schema tree node maintains different information to track a tree node:
- The types of schema tree nodes in the IR format differ from those in the Archive format. The types we support in IR format are the following:
  - Integer
  - Float
  - String
  - Boolean
  - Unstructured Array
  - Object
- There is no need to track the count of each node.
The schema tree is designed to be used in our new IR stream. Compared to the existing implementation, this PR makes it more lightweight:
- The schema tree does not maintain a hash map for existing nodes. This reduces memory usage and doesn't require absl::flat_hash_map when building our FFI libraries. As a tradeoff, the worst-case time complexity of node finding takes O(n) instead of O(1). However, from existing profiling results, this change has negligible influence on the IR v2 stream serialization/deserialization, even when the tree max depth and max width are large.
- By making these changes, the memory used by the schema tree can be approximated more simply. This can be helpful if we need to build a heuristic to determine when to rotate an IR stream.
When IR serialization fails, the tree needs to be recovered back to the state before the serialization starts, meaning that all nodes inserted during a failed serialization must be removed. SchemaTree has an efficient implementation for this scenario.

Validation performed

Passed clang-tidy linter check.
Ensured the code can be successfully built with unitTest.
Ensured new unit tests passed.

gibber9809

Nice work! Broadly speaking this looks good to me, I just have a few questions.

If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?

In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).

Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.

gibber9809 · 2024-05-24T01:06:26Z

+     * the parent id, the key name, and the node type to locate a unique tree node. This class wraps
+     * the location information as a non-integer identifier to locate a unique node in the tree.
+     */
+    class TreeNodeLocator {


This comment could probably be rephrased to be more clear. Maybe something like

"When constructing the schema tree we uniquely identify the location of a node being appended to the try by the unique triple of parent id, key name, and node type. This class
stores that triple, and can act as a unique identifier for a node in the tree."

How about "appended" -> "inserted"? "Appended" is more implementation-specific, and the doc string doesn't necessarily expose this detail.

Shall we add a sentence to explain why the triple is unique? Essentially, it's because key name + node type should not have any ambiguity for a parent node

Those both sound good to me.

LinZhihao-723 · 2024-05-30T07:04:59Z

Nice work! Broadly speaking this looks good to me, I just have a few questions.

If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?

In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).

Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.

For unstructured logs, our higher-level APIs (FFI) should structure the log event. For example, a normal unstructured log event should be serialized to sth like: {"timestamp": 100000, "log_level": "INFO", "log_message": "xxxxx"}. In the IR level, we don't differentiate whether the input src is a structured or unstructured log. The only thing that we might have special handling is the timestamp in the future.
What does "reading back a stream" mean? Did you mean deserializing the stream? In general, node insertion and lookup are not O(1). For lookup, we are traversing all children of a parent as we don't have a hashmap storing location to node id mapping. For insertion, we add a sanity check to ensure the node of the given location doesn't exist (which requires a lookup). From my previous benchmark, this shouldn't be the bottleneck for both serialization and deserialization. I don't think this check can be skipped during deserialization as the stream might be corrupted: users could abuse our format to generate an illegal stream. In both cases, the bottleneck is to traverse the children to find if {key_name, type} pair already exists. If this becomes the bottleneck in the future, we can optimize the implementation by introducing a hashmap when the number of children exceeds some threshold.

gibber9809

LGTM. PR title is also good for commit message.

kirkrodrigues

Mainly docs + style changes with a few concerns about logic.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

kirkrodrigues

A few touch-ups. For the PR title, how about:

ffi: Add SchemaTree implementation to support the next IR stream format.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

LinZhihao-723 · 2024-06-04T00:45:26Z

A few touch-ups. For the PR title, how about:

ffi: Add SchemaTree implementation to support the next IR stream format.

The commit message lgtm

…at. (y-scope#411)

LinZhihao-723 added 2 commits May 21, 2024 23:56

Implement schema tree

2a1938e

Clean headers

a9f7c2f

gibber9809 requested changes May 24, 2024

View reviewed changes

LinZhihao-723 and others added 2 commits May 31, 2024 17:35

Update components/core/src/clp/ffi/SchemaTree.hpp

cb885eb

Update tree node locator description

d97b2e9

LinZhihao-723 requested a review from gibber9809 June 1, 2024 21:11

gibber9809 previously approved these changes Jun 3, 2024

View reviewed changes

Refactoring...

d02968a

LinZhihao-723 dismissed gibber9809’s stale review via d02968a June 3, 2024 18:00

kirkrodrigues requested changes Jun 3, 2024

View reviewed changes

LinZhihao-723 and others added 2 commits June 3, 2024 17:00

Apply suggestions from code review

7afbbd2

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

Apply code review changes

2623c8f

kirkrodrigues previously approved these changes Jun 3, 2024

View reviewed changes

Apply suggestions from code review

0c8a09a

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

LinZhihao-723 dismissed kirkrodrigues’s stale review via 0c8a09a June 4, 2024 00:44

kirkrodrigues approved these changes Jun 4, 2024

View reviewed changes

LinZhihao-723 changed the title ~~FFI: Add support for schema tree.~~ ffi: Add SchemaTree implementation to support the next IR stream format. Jun 4, 2024

LinZhihao-723 merged commit 3e00d50 into y-scope:main Jun 4, 2024

junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026

ffi: Add SchemaTree implementation to support the next IR stream form…

4ebd9d2

…at. (y-scope#411)

Conversation

LinZhihao-723 commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Description

Validation performed

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibber9809 May 24, 2024

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 May 30, 2024

Choose a reason for hiding this comment

Uh oh!

gibber9809 May 31, 2024

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 May 31, 2024

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 commented May 30, 2024

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

kirkrodrigues left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kirkrodrigues left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LinZhihao-723 commented May 22, 2024 •

edited

Loading