Skip to content

feat: word diffs#33

Merged
Byron merged 23 commits intopascalkuthe:masterfrom
KnorpelSenf:word-diff
Dec 30, 2025
Merged

feat: word diffs#33
Byron merged 23 commits intopascalkuthe:masterfrom
KnorpelSenf:word-diff

Conversation

@KnorpelSenf
Copy link
Copy Markdown
Contributor

@KnorpelSenf KnorpelSenf commented Sep 27, 2025

Adds a word diffing feature for each hunk. The following questions need to be answered to move this out of draft state:

EDIT: all done!

  1. The current implementation is based on bytes. This is consistent with how strings are used internally in other places. Is this useful for word-diffs? We could also use .chars() or .graphemes(true) (via this crate). It's also possible to offer several of these options. (EDIT: we now use words)
  2. The current implementation reuses the diff, but it reallocates the bytes of the underlying strings for every hunk. Is this acceptable? If no, which alternative should we go with? (EDIT: no longer applies since we no longer use bytes)
  3. The current implementation reuses the diff, but it reallocates the InternedInput for every hunk. Is this acceptable? If no, which alternative should we go with? (EDIT: we now reuse the interner)
  4. The current implementation estimates the token count to be 256 for byte token sources. Please let me know if that is an unreasonable heuristic for byte token sources. (EDIT: no longer applies since we no longer use bytes)
  5. The current implementation always uses the Myers algo. Should we offer a second method which performs a minimal diff? (EDIT: no, we can add that later if requested)
  6. Is there anything else I am missing?

Closes #1.

@KnorpelSenf KnorpelSenf marked this pull request as draft September 27, 2025 11:56
@KnorpelSenf KnorpelSenf marked this pull request as ready for review November 7, 2025 17:21
@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

@pascalkuthe @Byron I have addressed all comments. This is ready for review now.

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

@pascalkuthe is there a way for to me make this easier to review?

@pascalkuthe
Copy link
Copy Markdown
Owner

left one cooment otherwise lgtm

@pascalkuthe
Copy link
Copy Markdown
Owner

this has conflicts now but as CI passes I am ok with merging it as soon as conflicts are resolved and @Byron has given his thumbs up aswell

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

Oddly enough, GitHub displays "all checks have passed" as well as "no conflicts with base branch" for me. What should I do? Merge main into this branch?

@pascalkuthe
Copy link
Copy Markdown
Owner

you need to rebase (git fetch && git rebase origin/main) and resolve conflicts

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

I know how to resolve conflicts, I'm just surprised that GitHub doesn't display any merge conflicts for me but it does for you. But sure, I'll double-check that with the CLI tomorrow

@pascalkuthe
Copy link
Copy Markdown
Owner

ah sorry I misunderstood weird for me it shows
image

@pascalkuthe
Copy link
Copy Markdown
Owner

ok seems squash + merge /merge + merge commit would work.. probably if you rebased instead of merged locally there may be conflicts. but I am also ok with squash merging after @Byron had a look

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

Fascinating:
Screenshot_20251219-214900

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a word-level diffing feature to complement the existing line-level diff functionality. The implementation provides a word_diff method on Hunk that computes word-level differences within changed lines, making it easier to identify specific word changes within modified lines.

Key changes:

  • Implemented a word tokenizer that splits text into words based on alphanumeric sequences, spaces, and individual punctuation characters
  • Added word_diff method to Hunk that performs word-level diffing on hunk contents
  • Fixed a typo in the interner documentation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
src/sources.rs Adds Words tokenizer struct and words() function to split text into word-level tokens for diffing
src/lib.rs Implements word_diff() method on Hunk to compute word-level diffs within line hunks
src/tests.rs Adds tests for word tokenizer and word_diff functionality including pure insertion/removal and modification cases
src/intern.rs Fixes typo in documentation comment ("a the" → "the")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +18
/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
/// other characters are their own word.
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that "a sequence of just the space character ' '" is treated as a word, but this is incomplete. The implementation on line 104-108 treats any sequence of consecutive spaces as a single token. The documentation should clarify that multiple consecutive spaces are grouped together as one token.

Suggested change
/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
/// other characters are their own word.
/// `char::is_alphanumeric`, or a sequence of one or more consecutive space
/// characters (' '). Any other characters are their own word.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +18
/// a sequence of alphanumeric characters as determined by
/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
/// other characters are their own word.
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions "a sequence of alphanumeric characters as determined by char::is_alphanumeric" but the implementation on lines 110-113 also includes underscores (_) as part of alphanumeric words. This is inconsistent with the documentation. Either update the documentation to mention that underscores are included in alphanumeric sequences, or remove the special handling of underscores if they should be treated as separate tokens.

Suggested change
/// a sequence of alphanumeric characters as determined by
/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
/// other characters are their own word.
/// a sequence of "word" characters (those for which `char::is_alphanumeric`
/// returns `true`, plus the underscore character '_'), or a sequence of just
/// the space character ' '. Any other characters are their own word.

Copilot uses AI. Check for mistakes.
src/lib.rs Outdated
Comment on lines +396 to +402
/// Performs a word-diff of the hunk
pub fn word_diff<'a>(
&self,
input: &InternedInput<&'a str>,
diff_input: &mut InternedInput<&'a str>,
diff: &mut Diff,
) {
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word_diff API requires callers to pass mutable references to diff_input and diff that are populated by this function. This design is not intuitive and error-prone because:

  1. Callers must create and maintain these objects themselves
  2. The function's return type is (), making it unclear that it populates the diff parameter
  3. The relationship between the parameters is not obvious from the signature

Consider either:

  • Returning the Diff as the function's result
  • Using a builder pattern with a dedicated WordDiff type
  • Adding documentation that clearly explains the expected usage pattern and the relationship between parameters

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While copilot is right in principle, I think compete documentation on parameters could explain why they need to be mutable.
Docs are definitely needed here.

src/tests.rs Outdated
let mut input = InternedInput::new(before, after);
for algorithm in Algorithm::ALL {
let mut diff_input = InternedInput::default();
let mut d = Diff::default();
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'd' is too short and unclear. Consider using a more descriptive name like 'word_diff' or 'inner_diff' to make the test more readable and to clarify that this is a separate diff object from the line-level diff.

Copilot uses AI. Check for mistakes.
src/tests.rs Outdated
let mut input = InternedInput::new(before, after);
for algorithm in Algorithm::ALL {
let mut diff_input = InternedInput::default();
let mut d = Diff::default();
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'd' is too short and unclear. Consider using a more descriptive name like 'word_diff' or 'inner_diff' to make the test more readable and to clarify that this is a separate diff object from the line-level diff.

Copilot uses AI. Check for mistakes.
src/tests.rs Outdated
Comment on lines +404 to +405
let mut diff_input = InternedInput::default();
let mut d = Diff::default();
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test variables diff_input and d are declared once outside the algorithm loop but are reused across all algorithm iterations. While word_diff appears to clear and resize the vectors, this pattern could hide bugs if word_diff doesn't properly reset all state. Consider either:

  1. Moving these declarations inside the loop to ensure a fresh state for each algorithm
  2. Adding an explicit reset/clear call between iterations to make the intent clear

Copilot uses AI. Check for mistakes.
}

fn estimate_tokens(&self) -> u32 {
(self.0.len() / 3) as u32
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token estimation heuristic divides string length by 3 (assuming average word length of 3 characters). This could result in poor allocation sizing:

  1. For typical English text, average word length is closer to 4-5 characters when including spaces and punctuation
  2. For code with long identifiers, the average could be much higher
  3. The estimate doesn't account for the fact that each punctuation character becomes its own token

Consider using a more conservative estimate like (self.0.len() / 5) or implementing a sampling approach similar to ByteLines::estimate_tokens() that examines the first portion of the text to calculate the actual average.

Suggested change
(self.0.len() / 3) as u32
(self.0.len() / 5) as u32

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about this one, but it seems that heuristics aren't always right and maybe there is a way to make it configurable?
Maybe this word-diff is also so tuned to Latin text that we might say it in the function, i.e. word_diff to latin_word_diff.

src/lib.rs Outdated
Comment on lines +396 to +433
/// Performs a word-diff of the hunk
pub fn word_diff<'a>(
&self,
input: &InternedInput<&'a str>,
diff_input: &mut InternedInput<&'a str>,
diff: &mut Diff,
) {
let Hunk { before, after } = self.clone();
diff_input.update_before(
before
.map(|index| input.before[index as usize])
.map(|token| input.interner[token])
.flat_map(|line| words(line)),
);
diff_input.update_after(
after
.map(|index| input.after[index as usize])
.map(|token| input.interner[token])
.flat_map(|line| words(line)),
);
diff.removed.clear();
diff.removed.resize(diff_input.before.len(), false);
diff.added.clear();
diff.added.resize(diff_input.after.len(), false);
if self.is_pure_removal() {
diff.removed.fill(true);
} else if self.is_pure_insertion() {
diff.added.fill(true);
} else {
diff.compute_with(
Algorithm::Myers,
&diff_input.before,
&diff_input.after,
diff_input.interner.num_tokens(),
);
diff.postprocess_no_heuristic(diff_input);
}
}
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word_diff method doesn't clear the interner between calls, which could lead to memory accumulation if called repeatedly. According to the documentation of update_before and update_after, "this does not erase any tokens from the interner and might therefore be considered a memory leak."

While the comment mentions clearing the interner for long-running processes, for a public API like word_diff, it would be better to either:

  1. Clear the interner at the start of this method
  2. Document this behavior clearly in the word_diff documentation
  3. Provide a way for callers to manage this explicitly

Copilot uses AI. Check for mistakes.
src/tests.rs Outdated
Comment on lines +347 to +348
let mut diff_input = InternedInput::default();
let mut d = Diff::default();
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test variables diff_input and d are declared once outside the algorithm loop but are reused across all algorithm iterations. While word_diff appears to clear and resize the vectors, this pattern could hide bugs if word_diff doesn't properly reset all state. Consider either:

  1. Moving these declarations inside the loop to ensure a fresh state for each algorithm
  2. Adding an explicit reset/clear call between iterations to make the intent clear

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a feature to let state accumulate without resetting it to ensure the call can handle it.

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Dec 20, 2025

Please state the nature of the medical emergency.

Thanks for summoning me :). Unfortunately, I am completely useless for looking at complex algorithms that don't have visual tests. So I really can't review this. But I did bring in a robot which isn't completely useless, and I hope it's true this time as well.

I will take a look at its suggestions too and add my opinion.

Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot did inspire me to look more closely, and I think some more work is needed.

  • Let's make clear that we only deal with latin words here with a change to the function name.
  • let's document the function and all of its parameters in detail. The &mut ones need extra work to make clear what the user can expect or is expected to do. Particularly what happens to the memory there, will it build up?

I'd go as far as to say that we need a fuzz-test for this one, so could block it on #42 or change the target to merge into that one. All these index operations I wouldn't trust until a fuzzer ran it with a couple million inputs.

Thanks again for making this happen @KnorpelSenf, I hope it's not too much of a burden to go into another round.

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

I skimmed most of the AI review and it largely looks like slop, I'm not really sure if there serious changes needed but if you think so, I can spend some more time on this

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Dec 20, 2025

I went through that slop and left comments, while also summarising my takeaway in the comment above. Some more time would thus be appreciated 🙏.

@KnorpelSenf
Copy link
Copy Markdown
Contributor Author

@pascalkuthe I added a commit, does GitHub still display conflicts?

@Byron

* Let's make clear that we only deal with latin words here with a change to the function name.

Do you suggest to rename word_diff to latin_word_diff?

* let's document the function and all of its parameters in detail. The `&mut` ones need extra work to make clear what the user can expect or is expected to do. Particularly what happens to the memory there, will it build up?

I can add more docs to word_diff

I'd go as far as to say that we need a fuzz-test for this one, so could block it on #42 or change the target to merge into that one. All these index operations I wouldn't trust until a fuzzer ran it with a couple million inputs.

That approach sounds strange. The indexing operations either work on all inputs or none. There are definitely places in this crate that would benefit from fuzz-testing, but tokenisation (and vector indexing, for that matter) would be the last places where I'd employ this. Can you elaborate why you'd want to block this by that?

- rename `word_diff` to `latin_word_diff`
Move all tests that don't validate private state are on integration level.
Reorganise fixtures while at it.
Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

I made some more adjustments, mostly to the testing so I get it, too 😁.

@Byron Byron merged commit 595c591 into pascalkuthe:master Dec 30, 2025
5 checks passed
@KnorpelSenf KnorpelSenf deleted the word-diff branch December 30, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add convenience feature for word diff

4 participants