feat: support character based column number#1634
Conversation
BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594
WalkthroughThe pull request introduces significant modifications to the Changes
Assessment against linked issues
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
crates/pyo3/tests/test_range.py (1)
51-51: LGTM! Consider adding more test cases.The uncommented assertion correctly validates the character-based column calculation for Unicode text. This aligns with the PR's objective to fix column reporting for special characters.
Consider adding more test cases with:
- Mixed ASCII and non-ASCII characters
- Different Unicode character widths (e.g., emojis, combining characters)
- Multi-line text with special characters
crates/napi/src/doc.rs (1)
110-113: LGTM! Consider improving documentation.The implementation correctly handles UTF-16 character column calculation with O(1) complexity. The division by 2 is appropriate as UTF-16 uses 2 bytes per character.
Consider adding:
- Documentation about edge cases (e.g., surrogate pairs)
- Example usage in doc comments
- Link to UTF-16 specification or reference
crates/core/src/source.rs (3)
149-150: LGTM! Consider documenting return value semantics.The trait method signature is clean and purpose-clear.
Add documentation specifying:
- Return value meaning (0-based or 1-based column number)
- Parameter requirements/constraints
195-212: Performance optimization opportunities available.The implementation correctly handles UTF-8 encoding, but as noted in the comments, the O(n) complexity could be problematic for large files.
Consider these optimizations:
- Cache column calculations for frequently accessed positions
- Implement the suggested SIMD optimization for parallel byte processing
- Add early exit when reaching maximum possible column value
fn get_char_column(&self, _col: usize, offset: usize) -> usize { + // Early exit for empty lines + if offset == 0 { + return 0; + } let src = self.as_bytes(); let mut col = 0; - // TODO: is it possible to use SIMD here??? + // TODO(performance): Implement SIMD optimization + // Example: Use wide loads to process multiple bytes in parallel for &b in src[..offset].iter().rev() { if b == b'\n' { break; } // https://en.wikipedia.org/wiki/UTF-8#Description if b & 0b1100_0000 != 0b1000_0000 { col += 1; } } col }
206-209: Consider adding explanatory comments for the bit manipulation.The UTF-8 continuation byte check is correct but could benefit from more detailed explanation.
Add comments explaining:
- Why this bit pattern identifies UTF-8 continuation bytes
- What other UTF-8 byte patterns exist
- Link to the UTF-8 specification section about continuation bytes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
crates/core/src/node.rs(3 hunks)crates/core/src/source.rs(2 hunks)crates/napi/src/doc.rs(1 hunks)crates/napi/src/sg_node.rs(1 hunks)crates/pyo3/tests/test_range.py(1 hunks)
🔇 Additional comments (3)
crates/core/src/node.rs (2)
22-23: LGTM: Good separation of byte and character-based offsets
The addition of byte_offset field helps distinguish between byte-based and character-based positions, which is crucial for correct handling of Unicode characters.
759-765: LGTM: Comprehensive Unicode test coverage
The added test cases thoroughly verify the handling of multi-byte Unicode characters (🦀) in different contexts, including indentation.
crates/napi/src/sg_node.rs (1)
46-46: LGTM: Simplified column calculation
The removal of byte-based division aligns with the transition to character-based column reporting. This change correctly uses the new column method that handles Unicode characters properly.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1634 +/- ##
==========================================
+ Coverage 87.00% 87.08% +0.07%
==========================================
Files 95 95
Lines 15310 15334 +24
==========================================
+ Hits 13320 13353 +33
+ Misses 1990 1981 -9 ☔ View full report in Codecov by Sentry. |
BREAKING CHANGE: now column returns character based offset in line previously it returns byte based offset. fix #1594
Summary by CodeRabbit
New Features
byte_offsetfield in thePositionstruct for improved source code position representation.get_char_columnmethod to theContenttrait for character column calculations, accommodating UTF-8 and UTF-16 encodings.Bug Fixes
SgNodemethods for consistency and clarity.Tests
test_unicodefunction to ensure the correctness of column index assertions.