Skip to content

feat(core): add UTF-8 boundary validation to Fix constructors (#463)#505

Merged
avifenesh merged 6 commits intomainfrom
feature/utf8-fix-constructors-463
Feb 18, 2026
Merged

feat(core): add UTF-8 boundary validation to Fix constructors (#463)#505
avifenesh merged 6 commits intomainfrom
feature/utf8-fix-constructors-463

Conversation

@avifenesh
Copy link
Collaborator

Summary

  • Add debug_assert!(start <= end) to the 4 range-based Fix constructors (replace, replace_with_confidence, delete, delete_with_confidence) to catch inverted byte ranges early in development
  • Add 6 new _checked constructor variants (replace_checked, replace_with_confidence_checked, insert_checked, insert_with_confidence_checked, delete_checked, delete_with_confidence_checked) that accept a content: &str parameter and validate UTF-8 char boundary alignment

Why

Previously, invalid byte offsets were only detected at apply time in fixes.rs, which silently skipped bad fixes in production. This shifts detection left - validator bugs that produce mid-codepoint offsets or reversed ranges are now caught at construction time in debug builds, matching the existing pattern in span_utils.rs:497.

All checks use debug_assert! - zero cost in release builds, no change to release behavior. The existing runtime safety net in fixes.rs remains as the production backstop.

What the _checked variants add

// Existing (base constructors still work, now have start <= end check):
Fix::replace(start, end, replacement, description, safe)

// New - validates UTF-8 boundaries when content is available:
Fix::replace_checked(content, start, end, replacement, description, safe)
Fix::insert_checked(content, position, text, description, safe)
Fix::delete_checked(content, start, end, description, safe)
// Plus _with_confidence variants for all three

The _checked variants delegate to the base constructors after asserting:

  1. start <= end (ordering)
  2. start <= content.len() and content.is_char_boundary(start) (UTF-8 boundary)
  3. end <= content.len() and content.is_char_boundary(end) (UTF-8 boundary)

Test coverage

  • 34 new tests covering: reversed ranges, mid-codepoint offsets (2-byte é and 4-byte emoji 😀), valid boundaries, out-of-bounds, zero-width replacements, and ASCII content
  • Panic-path tests are gated under #[cfg(debug_assertions)]
  • 2705 debug tests pass, 2931 release tests pass (24 debug-only tests compile out cleanly)

Closes #463

Add debug_assert!(start <= end) to Fix::replace, replace_with_confidence,
delete, and delete_with_confidence to catch reversed byte ranges in debug
builds.

Introduce six new _checked constructor variants (replace_checked,
replace_with_confidence_checked, insert_checked,
insert_with_confidence_checked, delete_checked,
delete_with_confidence_checked) that additionally validate UTF-8 char
boundary alignment via debug_assert!. These are zero-cost in release
builds and delegate to the unchecked originals.

Add 24 tests covering reversed-range panics, valid boundary usage,
mid-codepoint panics (2-byte and 4-byte), and out-of-bounds detection.
Copilot AI review requested due to automatic review settings February 18, 2026 02:22
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @avifenesh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of Fix object creation by shifting validation left in the development cycle. It introduces debug-only assertions to catch common errors like inverted byte ranges and misaligned UTF-8 character boundaries at construction time, rather than silently failing at apply time in production. This change helps developers identify and fix issues earlier, leading to more reliable code without impacting the performance of release builds.

Highlights

  • Enhanced Fix Constructor Validation: Added debug_assert!(start <= end) to four existing range-based Fix constructors (replace, replace_with_confidence, delete, delete_with_confidence) to catch inverted byte ranges early in debug builds.
  • New UTF-8 Boundary Checked Constructors: Introduced six new _checked Fix constructor variants (replace_checked, replace_with_confidence_checked, insert_checked, insert_with_confidence_checked, delete_checked, delete_with_confidence_checked) that accept a content: &str parameter and validate UTF-8 character boundary alignment in debug builds.
  • Debug-Only Assertions: All new validations are implemented using debug_assert!, meaning they incur zero cost in release builds and do not alter release behavior, maintaining the existing runtime safety net for production.
  • Comprehensive Test Coverage: Added 34 new tests covering reversed ranges, mid-codepoint offsets (for 2-byte and 4-byte UTF-8 characters), valid boundaries, out-of-bounds scenarios, zero-width replacements, and ASCII content, with panic-path tests gated under #[cfg(debug_assertions)].
Changelog
  • CHANGELOG.md
    • Added a new feature entry for the UTF-8 boundary _checked Fix constructors.
    • Added a new fix entry for the Fix constructor range assertions.
  • crates/agnix-core/src/diagnostics.rs
    • Added debug_assert!(start <= end) to Fix::replace, Fix::replace_with_confidence, Fix::delete, and Fix::delete_with_confidence constructors.
    • Introduced replace_checked, replace_with_confidence_checked, insert_checked, insert_with_confidence_checked, delete_checked, and delete_with_confidence_checked methods for Fix.
    • Added comprehensive test cases for debug_assert! panics and _checked constructor validations, including UTF-8 boundary checks and out-of-bounds scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement for robustness. Adding debug_assert! for range checks in existing Fix constructors and introducing new _checked variants for UTF-8 boundary validation will help catch bugs early during development. The test coverage for these new additions is thorough. I have one suggestion to reduce code duplication in the newly added _checked methods by using private helper functions.

Comment on lines +184 to +370
/// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available.
pub fn replace_checked(
content: &str,
start: usize,
end: usize,
replacement: impl Into<String>,
description: impl Into<String>,
safe: bool,
) -> Self {
debug_assert!(
start <= end,
"Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
Self::replace(start, end, replacement, description, safe)
}

/// Create a replacement fix with explicit confidence, asserting UTF-8 char boundary
/// alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace_with_confidence`] when
/// `content` is not available.
pub fn replace_with_confidence_checked(
content: &str,
start: usize,
end: usize,
replacement: impl Into<String>,
description: impl Into<String>,
confidence: f32,
) -> Self {
debug_assert!(
start <= end,
"Fix::replace_with_confidence_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::replace_with_confidence_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::replace_with_confidence_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::replace_with_confidence_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::replace_with_confidence_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
Self::replace_with_confidence(start, end, replacement, description, confidence)
}

/// Create an insertion fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that `position` lands on a UTF-8 char boundary in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::insert`] when `content` is not available.
pub fn insert_checked(
content: &str,
position: usize,
text: impl Into<String>,
description: impl Into<String>,
safe: bool,
) -> Self {
debug_assert!(
position <= content.len(),
"Fix::insert_checked: position ({position}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(position),
"Fix::insert_checked: position ({position}) is not on a UTF-8 char boundary"
);
Self::insert(position, text, description, safe)
}

/// Create an insertion fix with explicit confidence, asserting UTF-8 char boundary
/// alignment in debug builds.
///
/// Validates that `position` lands on a UTF-8 char boundary in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::insert_with_confidence`] when
/// `content` is not available.
pub fn insert_with_confidence_checked(
content: &str,
position: usize,
text: impl Into<String>,
description: impl Into<String>,
confidence: f32,
) -> Self {
debug_assert!(
position <= content.len(),
"Fix::insert_with_confidence_checked: position ({position}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(position),
"Fix::insert_with_confidence_checked: position ({position}) is not on a UTF-8 char boundary"
);
Self::insert_with_confidence(position, text, description, confidence)
}

/// Create a deletion fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::delete`] when `content` is not available.
pub fn delete_checked(
content: &str,
start: usize,
end: usize,
description: impl Into<String>,
safe: bool,
) -> Self {
debug_assert!(
start <= end,
"Fix::delete_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::delete_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::delete_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::delete_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::delete_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
Self::delete(start, end, description, safe)
}

/// Create a deletion fix with explicit confidence, asserting UTF-8 char boundary
/// alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::delete_with_confidence`] when
/// `content` is not available.
pub fn delete_with_confidence_checked(
content: &str,
start: usize,
end: usize,
description: impl Into<String>,
confidence: f32,
) -> Self {
debug_assert!(
start <= end,
"Fix::delete_with_confidence_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::delete_with_confidence_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::delete_with_confidence_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::delete_with_confidence_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::delete_with_confidence_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
Self::delete_with_confidence(start, end, description, confidence)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's significant duplication of validation logic across the new _checked methods. You can refactor this into private helper functions to improve maintainability and reduce code duplication. One helper can validate spans (start and end) and another can validate single positions.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds debug-only validation to Fix construction to catch invalid byte ranges/UTF-8 boundary issues earlier (during fix creation), while keeping the existing runtime safety net at fix-application time.

Changes:

  • Added debug_assert!(start <= end) to range-based Fix constructors (replace*, delete*).
  • Introduced 6 new _checked Fix constructors that validate bounds + UTF-8 char boundaries against provided content: &str in debug builds.
  • Added comprehensive tests for reversed ranges, out-of-bounds, and mid-codepoint offsets (debug-only panic tests gated by #[cfg(debug_assertions)]).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
crates/agnix-core/src/diagnostics.rs Adds new debug assertions, introduces _checked constructors, and adds extensive unit tests for boundary/range validation.
CHANGELOG.md Documents the new _checked constructors and the new debug range assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +184 to +215
/// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available.
pub fn replace_checked(
content: &str,
start: usize,
end: usize,
replacement: impl Into<String>,
description: impl Into<String>,
safe: bool,
) -> Self {
debug_assert!(
start <= end,
"Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _checked constructors repeat the same debug-only range/bounds/UTF-8 boundary assertions across multiple methods. Consider extracting a small private helper to reduce duplication and prevent these checks/error messages from drifting over time.

Suggested change
/// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available.
pub fn replace_checked(
content: &str,
start: usize,
end: usize,
replacement: impl Into<String>,
description: impl Into<String>,
safe: bool,
) -> Self {
debug_assert!(
start <= end,
"Fix::replace_checked: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"Fix::replace_checked: start_byte ({start}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(start),
"Fix::replace_checked: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"Fix::replace_checked: end_byte ({end}) is out of bounds (len={})", content.len()
);
debug_assert!(
content.is_char_boundary(end),
"Fix::replace_checked: end_byte ({end}) is not on a UTF-8 char boundary"
);
/// Internal helper for debug-only validation of byte ranges and UTF-8 char boundaries.
///
/// This is used by `_checked` constructors to keep assertions and their error
/// messages consistent in one place. It is a no-op in release builds since it
/// only contains `debug_assert!` calls.
fn debug_assert_valid_byte_range_for_content(
content: &str,
start: usize,
end: usize,
context: &str,
) {
debug_assert!(
start <= end,
"{context}: start_byte ({start}) must be <= end_byte ({end})"
);
debug_assert!(
start <= content.len(),
"{context}: start_byte ({start}) is out of bounds (len={})",
content.len()
);
debug_assert!(
content.is_char_boundary(start),
"{context}: start_byte ({start}) is not on a UTF-8 char boundary"
);
debug_assert!(
end <= content.len(),
"{context}: end_byte ({end}) is out of bounds (len={})",
content.len()
);
debug_assert!(
content.is_char_boundary(end),
"{context}: end_byte ({end}) is not on a UTF-8 char boundary"
);
}
/// Create a replacement fix, asserting UTF-8 char boundary alignment in debug builds.
///
/// Validates that both `start` and `end` land on UTF-8 char boundaries in `content`.
/// These checks are no-ops in release builds; the function otherwise behaves identically to its unchecked counterpart. Use [`Self::replace`] when `content` is not available.
pub fn replace_checked(
content: &str,
start: usize,
end: usize,
replacement: impl Into<String>,
description: impl Into<String>,
safe: bool,
) -> Self {
Self::debug_assert_valid_byte_range_for_content(content, start, end, "Fix::replace_checked");

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@avifenesh avifenesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Copilot and Gemini suggestions: extracted private helpers debug_assert_valid_range() and debug_assert_valid_position() to consolidate duplicate debug_assert! logic across the _checked constructors.

@avifenesh avifenesh merged commit 88c050f into main Feb 18, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add UTF-8 boundary validation to Fix constructors

2 participants