Skip to content

fix(fix-flaws): handle UTF-8 characters and HTML entities#395

Merged
caugner merged 10 commits intomainfrom
fix-flaws
Dec 3, 2025
Merged

fix(fix-flaws): handle UTF-8 characters and HTML entities#395
caugner merged 10 commits intomainfrom
fix-flaws

Conversation

@caugner
Copy link
Contributor

@caugner caugner commented Nov 26, 2025

Description

Fixes two issues with the fix-flaws command:

  • Byte and character length were mixed.
  • HTML entity encoding caused mismatches.

Motivation

Avoids that it fails in translated-content's autofix workflow.

Additional details

Related issues and pull requests

Fixes #394.

Fix critical bug where byte offsets and character positions were mixed
throughout the codebase, causing incorrect position reporting for content
with multi-byte UTF-8 characters (emojis, accented characters, etc.).

Changes:
- Add position_utils module with byte ↔ character conversion functions
- Fix render.rs bug mixing character count with byte offset in end_col
- Convert Issue byte columns to DisplayIssue character columns for output
- Update actual_offset to convert character positions back to bytes
- Improve char boundary checking with proper warnings
- Document all position fields as bytes (internal) or characters (display)
- Verify Comrak uses byte-based sourcepos (1-based)
- Add comprehensive UTF-8 tests with emojis and accented characters

All 153 existing tests pass. The fix ensures correct position handling
throughout: tree-sitter/Comrak (bytes) → Issue (bytes) → DisplayIssue
(characters) → file operations (bytes).
Fixes panic when byte_offset is in the middle of multi-byte characters
(e.g., inside é). Adjusts to nearest boundary before counting chars.
Verified with French content. All tests pass.
Fixes panics when running `content fix-flaws` on content with multi-byte
UTF-8 characters (e.g., French accented characters like é).

The issue occurred when calculating byte offsets for link replacements:
- `offset - href.len()` could land inside a multi-byte character
- String slicing at invalid boundaries caused panics

Changes:
- Add character boundary validation in `collect_suggestions()` to ensure
  href start offsets are on valid UTF-8 boundaries
- Add defensive checks in `apply_suggestions()` for both start and end
  offsets, skipping suggestions with invalid boundaries instead of panicking
- Add char boundary check in `calc_offset()` as additional safety net
- Add comprehensive tests for multi-byte character handling

This ensures robust handling of international characters throughout the
fix-flaws pipeline (French, German, Japanese, emoji, etc.).
Remove the unnecessary +10 byte margin when calculating search_start
position, which could cause finding wrong instances of duplicate hrefs.
The rfind() search is precise enough without the extra margin.

Additionally, enhance the warning message when an href cannot be located
by including the actual content region that was searched. This makes
debugging much easier by showing what text was examined rather than just
an offset number.
The fix-flaws command was failing to locate and fix broken links when
hrefs contained HTML entities (&#x27; for ', &lt; for <, etc.). This
occurred because:

1. Issue hrefs come from HTML output with encoded entities
2. Suggestion URLs from redirects also contain encoded entities
3. Raw markdown files contain literal characters (', <, etc.)
4. The search/replace logic couldn't match encoded strings against literals

This commit fixes the issue by:

- Adding html-escape dependency to rari-tools
- Decoding both href and suggestion before searching raw markdown
- Decoding href in actual_offset() when calculating positions
- Adding tests for HTML entity handling

Result: fix-flaws now successfully updates 22 French docs that were
previously failing with "Could not locate href" warnings. The fixes
preserve literal characters in markdown (no entities added).

Fixes issues with French SVG documentation and other files containing
accented characters in URLs.
@caugner caugner force-pushed the fix-flaws branch 2 times, most recently from 5df22b7 to 08e7bc6 Compare November 26, 2025 22:51
@caugner caugner marked this pull request as ready for review November 28, 2025 10:56
@caugner caugner requested review from a team and mdn-bot as code owners November 28, 2025 10:56
@caugner caugner requested review from LeoMcA and argl November 28, 2025 10:56

// Convert character column to byte column
let new_column = if let Some(line_content) = raw.lines().nth(new_line) {
use rari_doc::position_utils::char_to_byte_column;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the import to the top of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 25bd584.

Copy link
Contributor

@argl argl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, otherwise nice!

@caugner caugner merged commit 63e2439 into main Dec 3, 2025
7 checks passed
@caugner caugner deleted the fix-flaws branch December 3, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix-flaws duplicates content + fails to apply suggestions

2 participants