Skip to content

docs(IRv2): Add documentation for current IRv2 implementation#707

Draft
AVMatthews wants to merge 23 commits into
y-scope:mainfrom
AVMatthews:IRv2-Documentation
Draft

docs(IRv2): Add documentation for current IRv2 implementation#707
AVMatthews wants to merge 23 commits into
y-scope:mainfrom
AVMatthews:IRv2-Documentation

Conversation

@AVMatthews

@AVMatthews AVMatthews commented Jan 30, 2025

Copy link
Copy Markdown
Contributor

Description

Draft of documentation for IRv2 Implementation.

Pulled Materials from IRv2 Design Doc and draft IRv2 Blog Post

Still need many images converted to Mermaid and some of the Mermaid visuals need to be cleaned up to be more readable.

Some visuals need auto generated keys to be integrated.

Checklist

  • The PR satisfies the contribution guidelines.
  • [x ] This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • [ x] Necessary docs have been updated, OR no docs need to be updated.

Validation performed


@coderabbitai

coderabbitai Bot commented Jan 30, 2025

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments before we proceed:

separate. Each IR stream maintains auto-generated keys and user-generated keys in two independent
schema trees, ensuring clear differentiation and avoiding unintended interference.

#### Example

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor this example to showcase auto-gen and user-gen kv-pairs

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, these docs should explain the motivation behind the KV-pair IR format both at the high level as well as the individual pieces. Think about it from the perspective of a developer who wants to modify/work with the format---what is the minimum amount of information they need to understand why the format is what it is and how to extend it.

Although the PR already contains a lot of the pieces necessary for the docs, they're:

  • missing some info (e.g., the schema tree diagram doesn't actually explain what the nodes are); remember that developers are time-constrained, so they likely won't read the clp-s paper.
  • going bottom-up rather than top-down.
  • missing the motivation for why we designed things in the way we did.

I would suggest structuring things roughly as follows:

  • A concise but sufficient intro to clp-s' compression algorithm
    • This should explain how to convert the example JSON records into clp-s' format.
  • Overview of the KV-pair IR format
  • Motivation and specification of each part of the format (if this ends up being too convoluted, we could defer the specification to another page that we link to).

In the specification of each part, I would highlight the differences between the format and clp-s (e.g., the type differences).

Since this is a big task, I would suggest going piece by piece (e.g., update the intro to clp-s' compression algorithm first) and then we can review that before moving onto to the next section.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Left some comments on the intro part. Will need Kirk to give more detailed feedback on the CLP-S intro section.
  • Also reviewed other modified parts to correct some terms/concepts.
  • It would be helpful if u could run a spelling check before asking for review comments.

Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
Comment on lines +110 to +121
units and log event IR units, one set for each log message. The log event unit contains a set of
[key and value IR packets](#auto-generated-schema-tree-node-id-value-pairs) for the auto
generated schema nodes, and then a set of
[key IR packets and corresponding value IR packets](#user-generated-schema-tree-node-id-value-pairs)
for each user generated key value pair in the log entry (???). To differnetiate between the auto
and user generated node ids we use the 1's complement to encode the auto generated node ids.
Rather than organizing the IR units so that each value comes directly after its key, we group the
keys and values together. We found that in practice the similarities between keys and between
values allows for better compression ratios when stored together. If a log record does not result
in any new schema tree nodes, there will be no new schema tree growth IR units for that record.
While we stream the IR units we keep the schema tree we’ve built in memory for easy reference, so
all packets in the stream can reference the same set of schema tree nodes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an overview, we might not need to cover these details.

Comment thread docs/src/dev-guide/design-key-value-pair-ir-stream.md Outdated
@kirkrodrigues

Copy link
Copy Markdown
Member

@AVMatthews Could you do the following?

  1. Re-add the png files from docs/src/dev-guide/images as follows:

    • Install git-lfs if you don't have it already.
    • Copy the images to a temporary location.
    • git rm -f all the images.
    • Commit the change.
    • Move the images back from the temporary location.
    • Add the images again.
    • Commit the change.
  2. Review the new docs I added.

@AVMatthews AVMatthews left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content looks pretty good to me. Most of my comments are about the details of how it looks or reads. It's hard to say if I think we should add anything until I see the draft of the rest of the content.

:gutter: 2

:::{grid-item-card}
:link: kv-ir/background

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is broken ... goes to /dev-guide/kv-ir/kv-ir/background instead of /dev-guide/kv-ir/background.html.


The key-value pair internal representation (abbreviated as kv-ir) stream format is a storage format
for dynamically structured (e.g. JSON) logs. Compared to a JSON file, the kv-ir stream format is
smaller and faster for clp-s to compress. Compared to clp-s' archive format, the kv-ir stream format

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know kv-ir is smaller, but is it faster. Do we have updated numbers on that. If so we should make sure to put that in the blog or somewhere.


To understand the format, it's necessary to briefly understand:

* how clp-s encodes dynamically-structured log events and serializes them into an archive;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe capitalize the start of all the bullets.

## Encoding & storing structured log events in an archive

The processing of encoding a log event can conceptually be broken down into the following steps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a list of the steps here and link to the below sections. That way they can be skimmed before getting into details.


### 1. Determine the log event's schema

clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we capitalize clp-s at the beginning of sentences (or in general). It feels a little strange reading it without a capital at the beginning at least.

### 1. Determine the log event's schema

clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the
log event. For example, Figure 1 shows two log events, while Tables 1 & 2 show the schemas for these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add links from the figures and table references to the figures and tables?

"<key>: <type>" and each arrow is from a parent to a child node.
::::

Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types
Notice that although both `level` and `message` are strings, clp-s assigns them more specific types

timersObj --> timersStage2Null
:::
+++
**Figure 3**: The archive schema tree after adding log event #1 in Figure 1. Each node's label is of

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this figure is essentially the same as figure 2, I think we could say that after adding the first log event its just the schema tree of log event 1 and then wait to show how the archive schema tree grows below.

**Table 4**: The ERT generated by encoding log event #2 in Figure 1.
:::

## Parsing & encoding unstructured text

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this background for kv-ir? Is it just to reference how it is different later?

For instance, consider the structured Go log printing statement (LPS) in Figure 6 (that uses the
[Zap](https://github.com/uber-go/zap) logging library) and its output in Figure 7.

:::{card}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this is a good example, but do auto generated keys always end up in the message like this? How do we distinguish them in the log message and so if you have two timestamps for example do they both show up in the JSON message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants