feat: support character based column number by HerringtonDarkholme · Pull Request #1634 · ast-grep/ast-grep

HerringtonDarkholme · 2024-12-01T03:40:21Z

BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594

Summary by CodeRabbit

New Features
- Introduced a new byte_offset field in the Position struct for improved source code position representation.
- Added get_char_column method to the Content trait for character column calculations, accommodating UTF-8 and UTF-16 encodings.
Bug Fixes
- Updated position calculations in SgNode methods for consistency and clarity.
Tests
- Enhanced test_unicode function to ensure the correctness of column index assertions.

BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594

coderabbitai · 2024-12-01T03:40:28Z

Walkthrough

The pull request introduces significant modifications to the Position struct and its methods in the node.rs file, adding a byte_offset field to enhance position representation in source code. The constructor and several methods are updated to accommodate this new field. Additionally, a new get_char_column method is added to the Content trait, affecting both the String type and the Wrapper struct. The SgNode struct is also modified to simplify position calculations. Finally, test cases are updated to validate these changes, particularly concerning Unicode character handling.

Changes

File Path	Change Summary
crates/core/src/node.rs	- Added `byte_offset` field to `Position` struct. - Updated `new`, `column`, `start_pos`, and `end_pos` methods to use `byte_offset`.
crates/core/src/source.rs	- Added `get_char_column` method to `Content` trait and its implementation for `String`.
crates/napi/src/doc.rs	- Added `get_char_column` method to `Content` trait implementation for `Wrapper`.
crates/napi/src/sg_node.rs	- Updated `to_pos` and `replace` methods to simplify column calculations.
crates/pyo3/tests/test_range.py	- Uncommented assertion in `test_unicode` to validate column index of a node's range.

Assessment against linked issues

Objective	Addressed	Explanation
Ensure accurate end column reporting for special characters (#[1594])	✅

Possibly related PRs

feat: return char offset in pyo3 API #1178: Introduces a new method get_char_column in the Content trait, related to the new column calculations in the Position struct.
test: add test for unicode range test #1630: Adds tests for the Position struct, ensuring validation of the new functionality introduced in this PR.
refactor: expose new struct Position for abstracting column #1632: Introduces a new Position struct that abstracts column handling, closely related to the modifications made in this PR.

Poem

🐰 In the code where rabbits hop,
New offsets help us never stop.
With every byte, we find our way,
Columns dance and play all day!
Unicode characters, bold and bright,
Make our positions feel just right!

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)

crates/pyo3/tests/test_range.py (1)

51-51: LGTM! Consider adding more test cases.

The uncommented assertion correctly validates the character-based column calculation for Unicode text. This aligns with the PR's objective to fix column reporting for special characters.

Consider adding more test cases with:

Mixed ASCII and non-ASCII characters

Different Unicode character widths (e.g., emojis, combining characters)

Multi-line text with special characters

crates/napi/src/doc.rs (1)

110-113: LGTM! Consider improving documentation.

The implementation correctly handles UTF-16 character column calculation with O(1) complexity. The division by 2 is appropriate as UTF-16 uses 2 bytes per character.

Consider adding:

Documentation about edge cases (e.g., surrogate pairs)

Example usage in doc comments

Link to UTF-16 specification or reference
crates/core/src/source.rs (3)
149-150: LGTM! Consider documenting return value semantics.

The trait method signature is clean and purpose-clear.

Add documentation specifying:

Return value meaning (0-based or 1-based column number)

Parameter requirements/constraints

195-212: Performance optimization opportunities available.

The implementation correctly handles UTF-8 encoding, but as noted in the comments, the O(n) complexity could be problematic for large files.

Consider these optimizations:

Cache column calculations for frequently accessed positions

Implement the suggested SIMD optimization for parallel byte processing

Add early exit when reaching maximum possible column value
 fn get_char_column(&self, _col: usize, offset: usize) -> usize {
+    // Early exit for empty lines
+    if offset == 0 {
+        return 0;
+    }
     let src = self.as_bytes();
     let mut col = 0;
-    // TODO: is it possible to use SIMD here???
+    // TODO(performance): Implement SIMD optimization
+    // Example: Use wide loads to process multiple bytes in parallel
     for &b in src[..offset].iter().rev() {
       if b == b'\n' {
         break;
       }
       // https://en.wikipedia.org/wiki/UTF-8#Description
       if b & 0b1100_0000 != 0b1000_0000 {
         col += 1;
       }
     }
     col
 }
206-209: Consider adding explanatory comments for the bit manipulation.

The UTF-8 continuation byte check is correct but could benefit from more detailed explanation.

Add comments explaining:

Why this bit pattern identifies UTF-8 continuation bytes

What other UTF-8 byte patterns exist

Link to the UTF-8 specification section about continuation bytes

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b87dad7 and aaf2f82.

📒 Files selected for processing (5)

crates/core/src/node.rs (3 hunks)
crates/core/src/source.rs (2 hunks)
crates/napi/src/doc.rs (1 hunks)
crates/napi/src/sg_node.rs (1 hunks)
crates/pyo3/tests/test_range.py (1 hunks)

🔇 Additional comments (3)

crates/core/src/node.rs (2)

22-23: LGTM: Good separation of byte and character-based offsets

The addition of byte_offset field helps distinguish between byte-based and character-based positions, which is crucial for correct handling of Unicode characters.

759-765: LGTM: Comprehensive Unicode test coverage

The added test cases thoroughly verify the handling of multi-byte Unicode characters (🦀) in different contexts, including indentation.

crates/napi/src/sg_node.rs (1)

46-46: LGTM: Simplified column calculation

The removal of byte-based division aligns with the transition to character-based column reporting. This change correctly uses the new column method that handles Unicode characters properly.

crates/core/src/node.rs

codecov · 2024-12-01T03:42:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.08%. Comparing base (b87dad7) to head (aaf2f82).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1634      +/-   ##
==========================================
+ Coverage   87.00%   87.08%   +0.07%     
==========================================
  Files          95       95              
  Lines       15310    15334      +24     
==========================================
+ Hits        13320    13353      +33     
+ Misses       1990     1981       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feat: support character based column number

aaf2f82

BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

crates/core/src/node.rs Show resolved Hide resolved

HerringtonDarkholme added this pull request to the merge queue Dec 1, 2024

Merged via the queue into main with commit 1266fc7 Dec 1, 2024

HerringtonDarkholme deleted the fix-range branch December 1, 2024 04:28

coderabbitai bot mentioned this pull request Dec 2, 2024

feat: Add support for range matcher #1626

Merged

coderabbitai bot mentioned this pull request Dec 14, 2024

fix: rename range rule's row to line #1664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support character based column number#1634

feat: support character based column number#1634
HerringtonDarkholme merged 1 commit intomainfrom
fix-range

HerringtonDarkholme commented Dec 1, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 1, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Dec 1, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HerringtonDarkholme commented Dec 1, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HerringtonDarkholme commented Dec 1, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 1, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Dec 1, 2024 •

edited

Loading