fix(fix-flaws): handle UTF-8 characters and HTML entities by caugner · Pull Request #395 · mdn/rari

caugner · 2025-11-26T12:10:29Z

Description

Fixes two issues with the fix-flaws command:

Byte and character length were mixed.
HTML entity encoding caused mismatches.

Motivation

Avoids that it fails in translated-content's autofix workflow.

Additional details

Related issues and pull requests

Fixes #394.

Fix critical bug where byte offsets and character positions were mixed throughout the codebase, causing incorrect position reporting for content with multi-byte UTF-8 characters (emojis, accented characters, etc.). Changes: - Add position_utils module with byte ↔ character conversion functions - Fix render.rs bug mixing character count with byte offset in end_col - Convert Issue byte columns to DisplayIssue character columns for output - Update actual_offset to convert character positions back to bytes - Improve char boundary checking with proper warnings - Document all position fields as bytes (internal) or characters (display) - Verify Comrak uses byte-based sourcepos (1-based) - Add comprehensive UTF-8 tests with emojis and accented characters All 153 existing tests pass. The fix ensures correct position handling throughout: tree-sitter/Comrak (bytes) → Issue (bytes) → DisplayIssue (characters) → file operations (bytes).

Fixes panic when byte_offset is in the middle of multi-byte characters (e.g., inside é). Adjusts to nearest boundary before counting chars. Verified with French content. All tests pass.

Fixes panics when running `content fix-flaws` on content with multi-byte UTF-8 characters (e.g., French accented characters like é). The issue occurred when calculating byte offsets for link replacements: - `offset - href.len()` could land inside a multi-byte character - String slicing at invalid boundaries caused panics Changes: - Add character boundary validation in `collect_suggestions()` to ensure href start offsets are on valid UTF-8 boundaries - Add defensive checks in `apply_suggestions()` for both start and end offsets, skipping suggestions with invalid boundaries instead of panicking - Add char boundary check in `calc_offset()` as additional safety net - Add comprehensive tests for multi-byte character handling This ensures robust handling of international characters throughout the fix-flaws pipeline (French, German, Japanese, emoji, etc.).

Remove the unnecessary +10 byte margin when calculating search_start position, which could cause finding wrong instances of duplicate hrefs. The rfind() search is precise enough without the extra margin. Additionally, enhance the warning message when an href cannot be located by including the actual content region that was searched. This makes debugging much easier by showing what text was examined rather than just an offset number.

The fix-flaws command was failing to locate and fix broken links when hrefs contained HTML entities (' for ', < for <, etc.). This occurred because: 1. Issue hrefs come from HTML output with encoded entities 2. Suggestion URLs from redirects also contain encoded entities 3. Raw markdown files contain literal characters (', <, etc.) 4. The search/replace logic couldn't match encoded strings against literals This commit fixes the issue by: - Adding html-escape dependency to rari-tools - Decoding both href and suggestion before searching raw markdown - Decoding href in actual_offset() when calculating positions - Adding tests for HTML entity handling Result: fix-flaws now successfully updates 22 French docs that were previously failing with "Could not locate href" warnings. The fixes preserve literal characters in markdown (no entities added). Fixes issues with French SVG documentation and other files containing accented characters in URLs.

argl · 2025-12-03T10:47:45Z

crates/rari-tools/src/fix/issues.rs

+
+    // Convert character column to byte column
+    let new_column = if let Some(line_content) = raw.lines().nth(new_line) {
+        use rari_doc::position_utils::char_to_byte_column;


Move the import to the top of the file?

Fixed in 25bd584.

argl

small nit, otherwise nice!

caugner force-pushed the fix-flaws branch from c43997c to ceadd42 Compare November 26, 2025 12:18

caugner added 6 commits November 26, 2025 13:27

fix(position_utils): handle byte offsets not on UTF-8 char boundaries

01f8a9b

Fixes panic when byte_offset is in the middle of multi-byte characters (e.g., inside é). Adjusts to nearest boundary before counting chars. Verified with French content. All tests pass.

ci(test): add workflow with fix-flaws job

abfc030

caugner force-pushed the fix-flaws branch from ceadd42 to abfc030 Compare November 26, 2025 12:29

caugner mentioned this pull request Nov 26, 2025

[fr] auto-fix content issues mdn/translated-content#30709

Merged

caugner force-pushed the fix-flaws branch from f5e258e to 33c5291 Compare November 26, 2025 12:37

style(clippy): fix issues

7b5e1cc

caugner force-pushed the fix-flaws branch 2 times, most recently from 5df22b7 to 08e7bc6 Compare November 26, 2025 22:51

ci(test): show fix-flaws changes

2153d60

caugner force-pushed the fix-flaws branch from 08e7bc6 to 2153d60 Compare November 26, 2025 23:06

caugner marked this pull request as ready for review November 28, 2025 10:56

caugner requested review from a team and mdn-bot as code owners November 28, 2025 10:56

caugner requested review from LeoMcA and argl November 28, 2025 10:56

argl reviewed Dec 3, 2025

View reviewed changes

argl approved these changes Dec 3, 2025

View reviewed changes

caugner added 2 commits December 3, 2025 12:54

Merge branch 'main' into fix-flaws

b611688

refactor(tools): move import to top

25bd584

caugner merged commit 63e2439 into main Dec 3, 2025
7 checks passed

caugner deleted the fix-flaws branch December 3, 2025 12:02

mdn-bot mentioned this pull request Dec 3, 2025

chore(main): release 0.2.6 #408

Merged

caugner mentioned this pull request Dec 4, 2025

feat(fix-flaws): fix slugs in macro parameters #413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fix-flaws): handle UTF-8 characters and HTML entities#395

fix(fix-flaws): handle UTF-8 characters and HTML entities#395
caugner merged 10 commits intomainfrom
fix-flaws

caugner commented Nov 26, 2025

Uh oh!

argl Dec 3, 2025

Uh oh!

caugner Dec 3, 2025

Uh oh!

argl left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caugner commented Nov 26, 2025

Description

Motivation

Additional details

Related issues and pull requests

Uh oh!

argl Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

caugner Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

argl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants