docs(IRv2): Add documentation for current IRv2 implementation by AVMatthews · Pull Request #707 · y-scope/clp

AVMatthews · 2025-01-30T21:42:21Z

Description

Draft of documentation for IRv2 Implementation.

Pulled Materials from IRv2 Design Doc and draft IRv2 Blog Post

Still need many images converted to Mermaid and some of the Mermaid visuals need to be cleaned up to be more readable.

Some visuals need auto generated keys to be integrated.

Checklist

The PR satisfies the contribution guidelines.
[x ] This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
[ x] Necessary docs have been updated, OR no docs need to be updated.

Validation performed

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1212851383737245

coderabbitai · 2025-01-30T21:42:29Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

LinZhihao-723

Some high-level comments before we proceed:

In the md file, let's wrap the line to ensure each line doesn't exceed 100-char limit. This helps increase the readability when directly editing the md file.
iiuc, "IRv2" is the internal name we used. For the formal release, we should use "key-value pair IR stream" (like we did here: https://github.com/y-scope/clp-ffi-py?tab=readme-ov-file#using-key-value-pair-ir-streams)
@kirkrodrigues do we need spec of how packets/IR units are serialized in this doc?

LinZhihao-723 · 2025-02-09T04:56:10Z

+separate. Each IR stream maintains auto-generated keys and user-generated keys in two independent
+schema trees, ensuring clear differentiation and avoiding unintended interference.
+
+#### Example


We need to refactor this example to showcase auto-gen and user-gen kv-pairs

kirkrodrigues

Ideally, these docs should explain the motivation behind the KV-pair IR format both at the high level as well as the individual pieces. Think about it from the perspective of a developer who wants to modify/work with the format---what is the minimum amount of information they need to understand why the format is what it is and how to extend it.

Although the PR already contains a lot of the pieces necessary for the docs, they're:

missing some info (e.g., the schema tree diagram doesn't actually explain what the nodes are); remember that developers are time-constrained, so they likely won't read the clp-s paper.
going bottom-up rather than top-down.
missing the motivation for why we designed things in the way we did.

I would suggest structuring things roughly as follows:

A concise but sufficient intro to clp-s' compression algorithm
- This should explain how to convert the example JSON records into clp-s' format.
Overview of the KV-pair IR format
Motivation and specification of each part of the format (if this ends up being too convoluted, we could defer the specification to another page that we link to).

In the specification of each part, I would highlight the differences between the format and clp-s (e.g., the type differences).

Since this is a big task, I would suggest going piece by piece (e.g., update the intro to clp-s' compression algorithm first) and then we can review that before moving onto to the next section.

LinZhihao-723

Left some comments on the intro part. Will need Kirk to give more detailed feedback on the CLP-S intro section.
Also reviewed other modified parts to correct some terms/concepts.
It would be helpful if u could run a spelling check before asking for review comments.

LinZhihao-723 · 2025-02-13T03:43:06Z

+units and log event IR units, one set for each log message. The log event unit contains a set of 
+[key and value IR packets](#auto-generated-schema-tree-node-id-value-pairs) for the auto 
+generated schema nodes, and then a set of 
+[key IR packets and corresponding value IR packets](#user-generated-schema-tree-node-id-value-pairs)
+ for each user generated key value pair in the log entry (???). To differnetiate between the auto 
+ and user generated node ids we use the 1's complement to encode the auto generated node ids. 
+ Rather than organizing the IR units so that each value comes directly after its key, we group the 
+ keys and values together. We found that in practice the similarities between keys and between 
+ values allows for better compression ratios when stored together. If a log record does not result 
+ in any new schema tree nodes, there will be no new schema tree growth IR units for that record. 
+ While we stream the IR units we keep the schema tree we’ve built in memory for easy reference, so 
+ all packets in the stream can reference the same set of schema tree nodes. 


As an overview, we might not need to cover these details.

kirkrodrigues · 2025-04-14T12:23:29Z

@AVMatthews Could you do the following?

Re-add the png files from docs/src/dev-guide/images as follows:
- Install git-lfs if you don't have it already.
- Copy the images to a temporary location.
- git rm -f all the images.
- Commit the change.
- Move the images back from the temporary location.
- Add the images again.
- Commit the change.
Review the new docs I added.

AVMatthews

Content looks pretty good to me. Most of my comments are about the details of how it looks or reads. It's hard to say if I think we should add anything until I see the draft of the rest of the content.

AVMatthews · 2025-04-15T20:26:09Z

+:gutter: 2
+
+:::{grid-item-card}
+:link: kv-ir/background


This link is broken ... goes to /dev-guide/kv-ir/kv-ir/background instead of /dev-guide/kv-ir/background.html.

AVMatthews · 2025-04-15T20:27:43Z

+
+The key-value pair internal representation (abbreviated as kv-ir) stream format is a storage format
+for dynamically structured (e.g. JSON) logs. Compared to a JSON file, the kv-ir stream format is
+smaller and faster for clp-s to compress. Compared to clp-s' archive format, the kv-ir stream format


I know kv-ir is smaller, but is it faster. Do we have updated numbers on that. If so we should make sure to put that in the blog or somewhere.

AVMatthews · 2025-04-15T20:29:31Z

+
+To understand the format, it's necessary to briefly understand:
+
+* how clp-s encodes dynamically-structured log events and serializes them into an archive;


Maybe capitalize the start of all the bullets.

AVMatthews · 2025-04-15T20:31:16Z

+## Encoding & storing structured log events in an archive
+
+The processing of encoding a log event can conceptually be broken down into the following steps.
+


I'd add a list of the steps here and link to the below sections. That way they can be skimmed before getting into details.

AVMatthews · 2025-04-15T20:33:37Z

+
+### 1. Determine the log event's schema
+
+clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the


Should we capitalize clp-s at the beginning of sentences (or in general). It feels a little strange reading it without a capital at the beginning at least.

AVMatthews · 2025-04-15T20:42:13Z

+### 1. Determine the log event's schema
+
+clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the
+log event. For example, Figure 1 shows two log events, while Tables 1 & 2 show the schemas for these


Can we add links from the figures and table references to the figures and tables?

AVMatthews · 2025-04-15T20:44:20Z

+"<key>: <type>" and each arrow is from a parent to a child node.
+::::
+
+Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types


Suggested change

Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types

Notice that although both `level` and `message` are strings, clp-s assigns them more specific types

AVMatthews · 2025-04-15T20:47:41Z

+  timersObj --> timersStage2Null
+:::
+++
+**Figure 3**: The archive schema tree after adding log event #1 in Figure 1. Each node's label is of


Since this figure is essentially the same as figure 2, I think we could say that after adding the first log event its just the schema tree of log event 1 and then wait to show how the archive schema tree grows below.

AVMatthews · 2025-04-15T20:52:07Z

+**Table 4**: The ERT generated by encoding log event #2 in Figure 1.
+:::
+
+## Parsing & encoding unstructured text


Do we need this background for kv-ir? Is it just to reference how it is different later?

AVMatthews · 2025-04-15T21:00:06Z

+For instance, consider the structured Go log printing statement (LPS) in Figure 6 (that uses the
+[Zap](https://github.com/uber-go/zap) logging library) and its output in Figure 7.
+
+:::{card}   


Generally I think this is a good example, but do auto generated keys always end up in the message like this? How do we distinguish them in the log message and so if you have two timestamps for example do they both show up in the JSON message.

AVMatthews and others added 2 commits January 30, 2025 16:25

first draft IRv2 docs

816f8ee

Merge branch 'y-scope:main' into IRv2-Documentation

4590f18

LinZhihao-723 requested changes Jan 31, 2025

View reviewed changes

AVMatthews and others added 3 commits February 3, 2025 11:50

rename IR format and reformat line length

cb532fc

Rewriting intro

fc54e85

Rewrite design overview.

ad14c83

LinZhihao-723 reviewed Feb 9, 2025

View reviewed changes

LinZhihao-723 added 10 commits February 9, 2025 00:00

...

b3dcedc

Add format specification section

79de30f

Add overview title

a556fc4

Add int and float packet spec

2967fbe

Add packet spec

1894958

Add IR units skeleton

7e73f5c

Add definition for two IR units..

e07ad64

Add Log Event Unit

73b30b2

Add IR stream section.

8aed402

Remove the old IR unit section.

01954fd

kirkrodrigues requested changes Feb 11, 2025

View reviewed changes

reorder sections and itergrate versions part 1: Introduction

2b8c4b0

LinZhihao-723 requested changes Feb 13, 2025

View reviewed changes

AVMatthews and others added 5 commits February 13, 2025 05:14

Update first paragraph and introduction section

c73a92b

Example and explanations added

ca6cfa3

add missing picture

095043e

Merge branch 'main' into IRv2-Documentation

ae1c707

Start new kv-ir docs with new background section.

fcf01bf

AVMatthews added 2 commits April 15, 2025 15:55

remove images temporarily

188f016

Re-add images

003c000

AVMatthews commented Apr 15, 2025

View reviewed changes

kirkrodrigues mentioned this pull request Apr 17, 2025

docs(kv-ir): Add background necessary to understand the kv-ir stream format. #833

Merged

3 tasks

kirkrodrigues mentioned this pull request May 8, 2025

docs(kv-ir): Explain support for auto-generated KV pairs. #888

Merged

3 tasks

junhaoliao added this to the Backlog milestone Jan 19, 2026

junhaoliao force-pushed the main branch from 4505f8c to 8977316 Compare May 13, 2026 22:48


		To understand the format, it's necessary to briefly understand:

		* how clp-s encodes dynamically-structured log events and serializes them into an archive;

		## Encoding & storing structured log events in an archive

		The processing of encoding a log event can conceptually be broken down into the following steps.


		### 1. Determine the log event's schema

		clp-s first examines the log event's schema--i.e., the key and value type for each kv-pair in the

	Notice that although both `level` and `mesage` are strings, clp-s assigns them more specific types
	Notice that although both `level` and `message` are strings, clp-s assigns them more specific types

Conversation

AVMatthews commented Jan 30, 2025 • edited by junhaoliao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Uh oh!

coderabbitai Bot commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kirkrodrigues left a comment

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kirkrodrigues commented Apr 14, 2025

Uh oh!

AVMatthews left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AVMatthews commented Jan 30, 2025 •

edited by junhaoliao

Loading

coderabbitai Bot commented Jan 30, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)