feat(core): add UTF-8 boundary validation to Fix constructors (#463)#505
feat(core): add UTF-8 boundary validation to Fix constructors (#463)#505
Conversation
Add debug_assert!(start <= end) to Fix::replace, replace_with_confidence, delete, and delete_with_confidence to catch reversed byte ranges in debug builds. Introduce six new _checked constructor variants (replace_checked, replace_with_confidence_checked, insert_checked, insert_with_confidence_checked, delete_checked, delete_with_confidence_checked) that additionally validate UTF-8 char boundary alignment via debug_assert!. These are zero-cost in release builds and delegate to the unchecked originals. Add 24 tests covering reversed-range panics, valid boundary usage, mid-codepoint panics (2-byte and 4-byte), and out-of-bounds detection.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary of ChangesHello @avifenesh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness of Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a great improvement for robustness. Adding debug_assert! for range checks in existing Fix constructors and introducing new _checked variants for UTF-8 boundary validation will help catch bugs early during development. The test coverage for these new additions is thorough. I have one suggestion to reduce code duplication in the newly added _checked methods by using private helper functions.
crates/agnix-core/src/diagnostics.rs
Outdated
| /// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds. | ||
| /// | ||
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available. | ||
| pub fn replace_checked( | ||
| content: &str, | ||
| start: usize, | ||
| end: usize, | ||
| replacement: impl Into<String>, | ||
| description: impl Into<String>, | ||
| safe: bool, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| start <= end, | ||
| "Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})" | ||
| ); | ||
| debug_assert!( | ||
| start <= content.len(), | ||
| "Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(start), | ||
| "Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary" | ||
| ); | ||
| debug_assert!( | ||
| end <= content.len(), | ||
| "Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(end), | ||
| "Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::replace(start, end, replacement, description, safe) | ||
| } | ||
|
|
||
| /// Create a replacement fix with explicit confidence, asserting UTF-8 char boundary | ||
| /// alignment in debug builds. | ||
| /// | ||
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace_with_confidence`] when | ||
| /// `content` is not available. | ||
| pub fn replace_with_confidence_checked( | ||
| content: &str, | ||
| start: usize, | ||
| end: usize, | ||
| replacement: impl Into<String>, | ||
| description: impl Into<String>, | ||
| confidence: f32, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| start <= end, | ||
| "Fix::replace_with_confidence_checked: start_byte ({start}) must be <= end_byte ({end})" | ||
| ); | ||
| debug_assert!( | ||
| start <= content.len(), | ||
| "Fix::replace_with_confidence_checked: start_byte ({start}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(start), | ||
| "Fix::replace_with_confidence_checked: start_byte ({start}) is not on a UTF-8 char boundary" | ||
| ); | ||
| debug_assert!( | ||
| end <= content.len(), | ||
| "Fix::replace_with_confidence_checked: end_byte ({end}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(end), | ||
| "Fix::replace_with_confidence_checked: end_byte ({end}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::replace_with_confidence(start, end, replacement, description, confidence) | ||
| } | ||
|
|
||
| /// Create an insertion fix, asserting UTF-8 char boundary alignment in debug builds. | ||
| /// | ||
| /// Validates that `position` lands on a UTF-8 char boundary in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::insert`] when `content` is not available. | ||
| pub fn insert_checked( | ||
| content: &str, | ||
| position: usize, | ||
| text: impl Into<String>, | ||
| description: impl Into<String>, | ||
| safe: bool, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| position <= content.len(), | ||
| "Fix::insert_checked: position ({position}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(position), | ||
| "Fix::insert_checked: position ({position}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::insert(position, text, description, safe) | ||
| } | ||
|
|
||
| /// Create an insertion fix with explicit confidence, asserting UTF-8 char boundary | ||
| /// alignment in debug builds. | ||
| /// | ||
| /// Validates that `position` lands on a UTF-8 char boundary in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::insert_with_confidence`] when | ||
| /// `content` is not available. | ||
| pub fn insert_with_confidence_checked( | ||
| content: &str, | ||
| position: usize, | ||
| text: impl Into<String>, | ||
| description: impl Into<String>, | ||
| confidence: f32, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| position <= content.len(), | ||
| "Fix::insert_with_confidence_checked: position ({position}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(position), | ||
| "Fix::insert_with_confidence_checked: position ({position}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::insert_with_confidence(position, text, description, confidence) | ||
| } | ||
|
|
||
| /// Create a deletion fix, asserting UTF-8 char boundary alignment in debug builds. | ||
| /// | ||
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::delete`] when `content` is not available. | ||
| pub fn delete_checked( | ||
| content: &str, | ||
| start: usize, | ||
| end: usize, | ||
| description: impl Into<String>, | ||
| safe: bool, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| start <= end, | ||
| "Fix::delete_checked: start_byte ({start}) must be <= end_byte ({end})" | ||
| ); | ||
| debug_assert!( | ||
| start <= content.len(), | ||
| "Fix::delete_checked: start_byte ({start}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(start), | ||
| "Fix::delete_checked: start_byte ({start}) is not on a UTF-8 char boundary" | ||
| ); | ||
| debug_assert!( | ||
| end <= content.len(), | ||
| "Fix::delete_checked: end_byte ({end}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(end), | ||
| "Fix::delete_checked: end_byte ({end}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::delete(start, end, description, safe) | ||
| } | ||
|
|
||
| /// Create a deletion fix with explicit confidence, asserting UTF-8 char boundary | ||
| /// alignment in debug builds. | ||
| /// | ||
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::delete_with_confidence`] when | ||
| /// `content` is not available. | ||
| pub fn delete_with_confidence_checked( | ||
| content: &str, | ||
| start: usize, | ||
| end: usize, | ||
| description: impl Into<String>, | ||
| confidence: f32, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| start <= end, | ||
| "Fix::delete_with_confidence_checked: start_byte ({start}) must be <= end_byte ({end})" | ||
| ); | ||
| debug_assert!( | ||
| start <= content.len(), | ||
| "Fix::delete_with_confidence_checked: start_byte ({start}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(start), | ||
| "Fix::delete_with_confidence_checked: start_byte ({start}) is not on a UTF-8 char boundary" | ||
| ); | ||
| debug_assert!( | ||
| end <= content.len(), | ||
| "Fix::delete_with_confidence_checked: end_byte ({end}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(end), | ||
| "Fix::delete_with_confidence_checked: end_byte ({end}) is not on a UTF-8 char boundary" | ||
| ); | ||
| Self::delete_with_confidence(start, end, description, confidence) | ||
| } |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Adds debug-only validation to Fix construction to catch invalid byte ranges/UTF-8 boundary issues earlier (during fix creation), while keeping the existing runtime safety net at fix-application time.
Changes:
- Added
debug_assert!(start <= end)to range-basedFixconstructors (replace*,delete*). - Introduced 6 new
_checkedFixconstructors that validate bounds + UTF-8 char boundaries against providedcontent: &strin debug builds. - Added comprehensive tests for reversed ranges, out-of-bounds, and mid-codepoint offsets (debug-only panic tests gated by
#[cfg(debug_assertions)]).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| crates/agnix-core/src/diagnostics.rs | Adds new debug assertions, introduces _checked constructors, and adds extensive unit tests for boundary/range validation. |
| CHANGELOG.md | Documents the new _checked constructors and the new debug range assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
crates/agnix-core/src/diagnostics.rs
Outdated
| /// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds. | ||
| /// | ||
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | ||
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available. | ||
| pub fn replace_checked( | ||
| content: &str, | ||
| start: usize, | ||
| end: usize, | ||
| replacement: impl Into<String>, | ||
| description: impl Into<String>, | ||
| safe: bool, | ||
| ) -> Self { | ||
| debug_assert!( | ||
| start <= end, | ||
| "Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})" | ||
| ); | ||
| debug_assert!( | ||
| start <= content.len(), | ||
| "Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(start), | ||
| "Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary" | ||
| ); | ||
| debug_assert!( | ||
| end <= content.len(), | ||
| "Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len() | ||
| ); | ||
| debug_assert!( | ||
| content.is_char_boundary(end), | ||
| "Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary" | ||
| ); |
There was a problem hiding this comment.
The _checked constructors repeat the same debug-only range/bounds/UTF-8 boundary assertions across multiple methods. Consider extracting a small private helper to reduce duplication and prevent these checks/error messages from drifting over time.
| /// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds. | |
| /// | |
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | |
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available. | |
| pub fn replace_checked( | |
| content: &str, | |
| start: usize, | |
| end: usize, | |
| replacement: impl Into<String>, | |
| description: impl Into<String>, | |
| safe: bool, | |
| ) -> Self { | |
| debug_assert!( | |
| start <= end, | |
| "Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})" | |
| ); | |
| debug_assert!( | |
| start <= content.len(), | |
| "Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len() | |
| ); | |
| debug_assert!( | |
| content.is_char_boundary(start), | |
| "Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary" | |
| ); | |
| debug_assert!( | |
| end <= content.len(), | |
| "Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len() | |
| ); | |
| debug_assert!( | |
| content.is_char_boundary(end), | |
| "Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary" | |
| ); | |
| /// Internal helper for debug-only validation of byte ranges and UTF-8 char boundaries. | |
| /// | |
| /// This is used by `_checked` constructors to keep assertions and their error | |
| /// messages consistent in one place. It is a no-op in release builds since it | |
| /// only contains `debug_assert!` calls. | |
| fn debug_assert_valid_byte_range_for_content( | |
| content: &str, | |
| start: usize, | |
| end: usize, | |
| context: &str, | |
| ) { | |
| debug_assert!( | |
| start <= end, | |
| "{context}: start_byte ({start}) must be <= end_byte ({end})" | |
| ); | |
| debug_assert!( | |
| start <= content.len(), | |
| "{context}: start_byte ({start}) is out of bounds (len={})", | |
| content.len() | |
| ); | |
| debug_assert!( | |
| content.is_char_boundary(start), | |
| "{context}: start_byte ({start}) is not on a UTF-8 char boundary" | |
| ); | |
| debug_assert!( | |
| end <= content.len(), | |
| "{context}: end_byte ({end}) is out of bounds (len={})", | |
| content.len() | |
| ); | |
| debug_assert!( | |
| content.is_char_boundary(end), | |
| "{context}: end_byte ({end}) is not on a UTF-8 char boundary" | |
| ); | |
| } | |
| /// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds. | |
| /// | |
| /// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`. | |
| /// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available. | |
| pub fn replace_checked( | |
| content: &str, | |
| start: usize, | |
| end: usize, | |
| replacement: impl Into<String>, | |
| description: impl Into<String>, | |
| safe: bool, | |
| ) -> Self { | |
| Self::debug_assert_valid_byte_range_for_content(content, start, end, "Fix::replace_checked"); |
avifenesh
left a comment
There was a problem hiding this comment.
Addressed Copilot and Gemini suggestions: extracted private helpers debug_assert_valid_range() and debug_assert_valid_position() to consolidate duplicate debug_assert! logic across the _checked constructors.
Summary
debug_assert!(start <= end)to the 4 range-basedFixconstructors (replace,replace_with_confidence,delete,delete_with_confidence) to catch inverted byte ranges early in development_checkedconstructor variants (replace_checked,replace_with_confidence_checked,insert_checked,insert_with_confidence_checked,delete_checked,delete_with_confidence_checked) that accept acontent: &strparameter and validate UTF-8 char boundary alignmentWhy
Previously, invalid byte offsets were only detected at apply time in
fixes.rs, which silently skipped bad fixes in production. This shifts detection left - validator bugs that produce mid-codepoint offsets or reversed ranges are now caught at construction time in debug builds, matching the existing pattern inspan_utils.rs:497.All checks use
debug_assert!- zero cost in release builds, no change to release behavior. The existing runtime safety net infixes.rsremains as the production backstop.What the
_checkedvariants addThe
_checkedvariants delegate to the base constructors after asserting:start <= end(ordering)start <= content.len()andcontent.is_char_boundary(start)(UTF-8 boundary)end <= content.len()andcontent.is_char_boundary(end)(UTF-8 boundary)Test coverage
éand 4-byte emoji😀), valid boundaries, out-of-bounds, zero-width replacements, and ASCII content#[cfg(debug_assertions)]Closes #463