Skip to content

feat: support character based column number#1634

Merged
HerringtonDarkholme merged 1 commit intomainfrom
fix-range
Dec 1, 2024
Merged

feat: support character based column number#1634
HerringtonDarkholme merged 1 commit intomainfrom
fix-range

Conversation

@HerringtonDarkholme
Copy link
Copy Markdown
Member

@HerringtonDarkholme HerringtonDarkholme commented Dec 1, 2024

BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594

Summary by CodeRabbit

  • New Features

    • Introduced a new byte_offset field in the Position struct for improved source code position representation.
    • Added get_char_column method to the Content trait for character column calculations, accommodating UTF-8 and UTF-16 encodings.
  • Bug Fixes

    • Updated position calculations in SgNode methods for consistency and clarity.
  • Tests

    • Enhanced test_unicode function to ensure the correctness of column index assertions.

BREAKING CHANGE: now column returns character based offset in line
previously it returns byte based offset. fix #1594
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 1, 2024

Walkthrough

The pull request introduces significant modifications to the Position struct and its methods in the node.rs file, adding a byte_offset field to enhance position representation in source code. The constructor and several methods are updated to accommodate this new field. Additionally, a new get_char_column method is added to the Content trait, affecting both the String type and the Wrapper struct. The SgNode struct is also modified to simplify position calculations. Finally, test cases are updated to validate these changes, particularly concerning Unicode character handling.

Changes

File Path Change Summary
crates/core/src/node.rs - Added byte_offset field to Position struct.
- Updated new, column, start_pos, and end_pos methods to use byte_offset.
crates/core/src/source.rs - Added get_char_column method to Content trait and its implementation for String.
crates/napi/src/doc.rs - Added get_char_column method to Content trait implementation for Wrapper.
crates/napi/src/sg_node.rs - Updated to_pos and replace methods to simplify column calculations.
crates/pyo3/tests/test_range.py - Uncommented assertion in test_unicode to validate column index of a node's range.

Assessment against linked issues

Objective Addressed Explanation
Ensure accurate end column reporting for special characters (#[1594])

Possibly related PRs

Poem

🐰 In the code where rabbits hop,
New offsets help us never stop.
With every byte, we find our way,
Columns dance and play all day!
Unicode characters, bold and bright,
Make our positions feel just right!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
crates/pyo3/tests/test_range.py (1)

51-51: LGTM! Consider adding more test cases.

The uncommented assertion correctly validates the character-based column calculation for Unicode text. This aligns with the PR's objective to fix column reporting for special characters.

Consider adding more test cases with:

  • Mixed ASCII and non-ASCII characters
  • Different Unicode character widths (e.g., emojis, combining characters)
  • Multi-line text with special characters
crates/napi/src/doc.rs (1)

110-113: LGTM! Consider improving documentation.

The implementation correctly handles UTF-16 character column calculation with O(1) complexity. The division by 2 is appropriate as UTF-16 uses 2 bytes per character.

Consider adding:

  • Documentation about edge cases (e.g., surrogate pairs)
  • Example usage in doc comments
  • Link to UTF-16 specification or reference
crates/core/src/source.rs (3)

149-150: LGTM! Consider documenting return value semantics.

The trait method signature is clean and purpose-clear.

Add documentation specifying:

  • Return value meaning (0-based or 1-based column number)
  • Parameter requirements/constraints

195-212: Performance optimization opportunities available.

The implementation correctly handles UTF-8 encoding, but as noted in the comments, the O(n) complexity could be problematic for large files.

Consider these optimizations:

  1. Cache column calculations for frequently accessed positions
  2. Implement the suggested SIMD optimization for parallel byte processing
  3. Add early exit when reaching maximum possible column value
 fn get_char_column(&self, _col: usize, offset: usize) -> usize {
+    // Early exit for empty lines
+    if offset == 0 {
+        return 0;
+    }
     let src = self.as_bytes();
     let mut col = 0;
-    // TODO: is it possible to use SIMD here???
+    // TODO(performance): Implement SIMD optimization
+    // Example: Use wide loads to process multiple bytes in parallel
     for &b in src[..offset].iter().rev() {
       if b == b'\n' {
         break;
       }
       // https://en.wikipedia.org/wiki/UTF-8#Description
       if b & 0b1100_0000 != 0b1000_0000 {
         col += 1;
       }
     }
     col
 }

206-209: Consider adding explanatory comments for the bit manipulation.

The UTF-8 continuation byte check is correct but could benefit from more detailed explanation.

Add comments explaining:

  • Why this bit pattern identifies UTF-8 continuation bytes
  • What other UTF-8 byte patterns exist
  • Link to the UTF-8 specification section about continuation bytes
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b87dad7 and aaf2f82.

📒 Files selected for processing (5)
  • crates/core/src/node.rs (3 hunks)
  • crates/core/src/source.rs (2 hunks)
  • crates/napi/src/doc.rs (1 hunks)
  • crates/napi/src/sg_node.rs (1 hunks)
  • crates/pyo3/tests/test_range.py (1 hunks)
🔇 Additional comments (3)
crates/core/src/node.rs (2)

22-23: LGTM: Good separation of byte and character-based offsets

The addition of byte_offset field helps distinguish between byte-based and character-based positions, which is crucial for correct handling of Unicode characters.


759-765: LGTM: Comprehensive Unicode test coverage

The added test cases thoroughly verify the handling of multi-byte Unicode characters (🦀) in different contexts, including indentation.

crates/napi/src/sg_node.rs (1)

46-46: LGTM: Simplified column calculation

The removal of byte-based division aligns with the transition to character-based column reporting. This change correctly uses the new column method that handles Unicode characters properly.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.08%. Comparing base (b87dad7) to head (aaf2f82).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1634      +/-   ##
==========================================
+ Coverage   87.00%   87.08%   +0.07%     
==========================================
  Files          95       95              
  Lines       15310    15334      +24     
==========================================
+ Hits        13320    13353      +33     
+ Misses       1990     1981       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@HerringtonDarkholme HerringtonDarkholme added this pull request to the merge queue Dec 1, 2024
Merged via the queue into main with commit 1266fc7 Dec 1, 2024
@HerringtonDarkholme HerringtonDarkholme deleted the fix-range branch December 1, 2024 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] ast-grep scan: special characters lead to different end columns

1 participant