ffi: Add SchemaTree implementation to support the next IR stream format.#411
Conversation
gibber9809
left a comment
There was a problem hiding this comment.
Nice work! Broadly speaking this looks good to me, I just have a few questions.
- If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?
In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).
- Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.
| * the parent id, the key name, and the node type to locate a unique tree node. This class wraps | ||
| * the location information as a non-integer identifier to locate a unique node in the tree. | ||
| */ | ||
| class TreeNodeLocator { |
There was a problem hiding this comment.
This comment could probably be rephrased to be more clear. Maybe something like
"When constructing the schema tree we uniquely identify the location of a node being appended to the try by the unique triple of parent id, key name, and node type. This class
stores that triple, and can act as a unique identifier for a node in the tree."
There was a problem hiding this comment.
- How about "appended" -> "inserted"? "Appended" is more implementation-specific, and the doc string doesn't necessarily expose this detail.
- Shall we add a sentence to explain why the triple is unique? Essentially, it's because key name + node type should not have any ambiguity for a parent node
There was a problem hiding this comment.
Those both sound good to me.
|
gibber9809
left a comment
There was a problem hiding this comment.
LGTM. PR title is also good for commit message.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Mainly docs + style changes with a few concerns about logic.
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
kirkrodrigues
left a comment
There was a problem hiding this comment.
A few touch-ups. For the PR title, how about:
ffi: Add SchemaTree implementation to support the next IR stream format.
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
The commit message lgtm |
References
Description
This PR introduces a streaming schema tree implementation designed for IR v2. It has the following components:
SchemaTreeNode: A class that specifies the node information for each node in the schema tree, including a unique ID, a parent ID, key name, type, and children IDs.SchemaTree: A tree built with SchemaTreeNode, with methods to insert/get tree nodes.Notice that we already have a schema tree implementation in
clp-s. The reasons to re-implement the schema tree are the following:absl::flat_hash_mapwhen building our FFI libraries. As a tradeoff, the worst-case time complexity of node finding takes O(n) instead of O(1). However, from existing profiling results, this change has negligible influence on the IR v2 stream serialization/deserialization, even when the tree max depth and max width are large.SchemaTreehas an efficient implementation for this scenario.Validation performed
clang-tidylinter check.unitTest.