docs(IRv2): Add documentation for current IRv2 implementation#707
docs(IRv2): Add documentation for current IRv2 implementation#707AVMatthews wants to merge 23 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
LinZhihao-723
left a comment
There was a problem hiding this comment.
Some high-level comments before we proceed:
- In the md file, let's wrap the line to ensure each line doesn't exceed 100-char limit. This helps increase the readability when directly editing the md file.
- iiuc, "IRv2" is the internal name we used. For the formal release, we should use "key-value pair IR stream" (like we did here: https://github.com/y-scope/clp-ffi-py?tab=readme-ov-file#using-key-value-pair-ir-streams)
- @kirkrodrigues do we need spec of how packets/IR units are serialized in this doc?
| separate. Each IR stream maintains auto-generated keys and user-generated keys in two independent | ||
| schema trees, ensuring clear differentiation and avoiding unintended interference. | ||
|
|
||
| #### Example |
There was a problem hiding this comment.
We need to refactor this example to showcase auto-gen and user-gen kv-pairs
kirkrodrigues
left a comment
There was a problem hiding this comment.
Ideally, these docs should explain the motivation behind the KV-pair IR format both at the high level as well as the individual pieces. Think about it from the perspective of a developer who wants to modify/work with the format---what is the minimum amount of information they need to understand why the format is what it is and how to extend it.
Although the PR already contains a lot of the pieces necessary for the docs, they're:
- missing some info (e.g., the schema tree diagram doesn't actually explain what the nodes are); remember that developers are time-constrained, so they likely won't read the clp-s paper.
- going bottom-up rather than top-down.
- missing the motivation for why we designed things in the way we did.
I would suggest structuring things roughly as follows:
- A concise but sufficient intro to clp-s' compression algorithm
- This should explain how to convert the example JSON records into clp-s' format.
- Overview of the KV-pair IR format
- Motivation and specification of each part of the format (if this ends up being too convoluted, we could defer the specification to another page that we link to).
In the specification of each part, I would highlight the differences between the format and clp-s (e.g., the type differences).
Since this is a big task, I would suggest going piece by piece (e.g., update the intro to clp-s' compression algorithm first) and then we can review that before moving onto to the next section.
LinZhihao-723
left a comment
There was a problem hiding this comment.
- Left some comments on the intro part. Will need Kirk to give more detailed feedback on the CLP-S intro section.
- Also reviewed other modified parts to correct some terms/concepts.
- It would be helpful if u could run a spelling check before asking for review comments.
| units and log event IR units, one set for each log message. The log event unit contains a set of | ||
| [key and value IR packets](#auto-generated-schema-tree-node-id-value-pairs) for the auto | ||
| generated schema nodes, and then a set of | ||
| [key IR packets and corresponding value IR packets](#user-generated-schema-tree-node-id-value-pairs) | ||
| for each user generated key value pair in the log entry (???). To differnetiate between the auto | ||
| and user generated node ids we use the 1's complement to encode the auto generated node ids. | ||
| Rather than organizing the IR units so that each value comes directly after its key, we group the | ||
| keys and values together. We found that in practice the similarities between keys and between | ||
| values allows for better compression ratios when stored together. If a log record does not result | ||
| in any new schema tree nodes, there will be no new schema tree growth IR units for that record. | ||
| While we stream the IR units we keep the schema tree we’ve built in memory for easy reference, so | ||
| all packets in the stream can reference the same set of schema tree nodes. |
There was a problem hiding this comment.
As an overview, we might not need to cover these details.
|
@AVMatthews Could you do the following?
|
AVMatthews
left a comment
There was a problem hiding this comment.
Content looks pretty good to me. Most of my comments are about the details of how it looks or reads. It's hard to say if I think we should add anything until I see the draft of the rest of the content.
| :gutter: 2 | ||
|
|
||
| :::{grid-item-card} | ||
| :link: kv-ir/background |
There was a problem hiding this comment.
This link is broken ... goes to /dev-guide/kv-ir/kv-ir/background instead of /dev-guide/kv-ir/background.html.
|
|
||
| The key-value pair internal representation (abbreviated as kv-ir) stream format is a storage format | ||
| for dynamically structured (e.g. JSON) logs. Compared to a JSON file, the kv-ir stream format is | ||
| smaller and faster for clp-s to compress. Compared to clp-s' archive format, the kv-ir stream format |
There was a problem hiding this comment.
I know kv-ir is smaller, but is it faster. Do we have updated numbers on that. If so we should make sure to put that in the blog or somewhere.
|
|
||
| To understand the format, it's necessary to briefly understand: | ||
|
|
||
| * how clp-s encodes dynamically-structured log events and serializes them into an archive; |
There was a problem hiding this comment.
Maybe capitalize the start of all the bullets.
| ## Encoding & storing structured log events in an archive | ||
|
|
||
| The processing of encoding a log event can conceptually be broken down into the following steps. | ||
|
|
There was a problem hiding this comment.
I'd add a list of the steps here and link to the below sections. That way they can be skimmed before getting into details.
|
|
||
| ### 1. Determine the log event's schema | ||
|
|
||
| clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the |
There was a problem hiding this comment.
Should we capitalize clp-s at the beginning of sentences (or in general). It feels a little strange reading it without a capital at the beginning at least.
| ### 1. Determine the log event's schema | ||
|
|
||
| clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the | ||
| log event. For example, Figure 1 shows two log events, while Tables 1 & 2 show the schemas for these |
There was a problem hiding this comment.
Can we add links from the figures and table references to the figures and tables?
| "<key>: <type>" and each arrow is from a parent to a child node. | ||
| :::: | ||
|
|
||
| Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types |
There was a problem hiding this comment.
| Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types | |
| Notice that although both `level` and `message` are strings, clp-s assigns them more specific types |
| timersObj --> timersStage2Null | ||
| ::: | ||
| +++ | ||
| **Figure 3**: The archive schema tree after adding log event #1 in Figure 1. Each node's label is of |
There was a problem hiding this comment.
Since this figure is essentially the same as figure 2, I think we could say that after adding the first log event its just the schema tree of log event 1 and then wait to show how the archive schema tree grows below.
| **Table 4**: The ERT generated by encoding log event #2 in Figure 1. | ||
| ::: | ||
|
|
||
| ## Parsing & encoding unstructured text |
There was a problem hiding this comment.
Do we need this background for kv-ir? Is it just to reference how it is different later?
| For instance, consider the structured Go log printing statement (LPS) in Figure 6 (that uses the | ||
| [Zap](https://github.com/uber-go/zap) logging library) and its output in Figure 7. | ||
|
|
||
| :::{card} |
There was a problem hiding this comment.
Generally I think this is a good example, but do auto generated keys always end up in the message like this? How do we distinguish them in the log message and so if you have two timestamps for example do they both show up in the JSON message.
Description
Draft of documentation for IRv2 Implementation.
Pulled Materials from IRv2 Design Doc and draft IRv2 Blog Post
Still need many images converted to Mermaid and some of the Mermaid visuals need to be cleaned up to be more readable.
Some visuals need auto generated keys to be integrated.
Checklist
breaking change.
Validation performed