feat: word diffs by KnorpelSenf · Pull Request #33 · pascalkuthe/imara-diff

KnorpelSenf · 2025-09-27T11:56:47Z

Adds a word diffing feature for each hunk. The following questions need to be answered to move this out of draft state:

EDIT: all done!

The current implementation is based on bytes. This is consistent with how strings are used internally in other places. Is this useful for word-diffs? We could also use .chars() or .graphemes(true) (via this crate). It's also possible to offer several of these options. (EDIT: we now use words)
The current implementation reuses the diff, but it reallocates the bytes of the underlying strings for every hunk. Is this acceptable? If no, which alternative should we go with? (EDIT: no longer applies since we no longer use bytes)
The current implementation reuses the diff, but it reallocates the InternedInput for every hunk. Is this acceptable? If no, which alternative should we go with? (EDIT: we now reuse the interner)
The current implementation estimates the token count to be 256 for byte token sources. Please let me know if that is an unreasonable heuristic for byte token sources. (EDIT: no longer applies since we no longer use bytes)
The current implementation always uses the Myers algo. Should we offer a second method which performs a minimal diff? (EDIT: no, we can add that later if requested)
Is there anything else I am missing?

Closes #1.

src/lib.rs

src/sources.rs

src/lib.rs

KnorpelSenf · 2025-11-07T17:24:20Z

@pascalkuthe @Byron I have addressed all comments. This is ready for review now.

src/sources.rs

KnorpelSenf · 2025-11-21T20:49:21Z

@pascalkuthe is there a way for to me make this easier to review?

src/sources.rs

pascalkuthe · 2025-11-21T21:03:53Z

left one cooment otherwise lgtm

pascalkuthe · 2025-12-19T20:03:59Z

this has conflicts now but as CI passes I am ok with merging it as soon as conflicts are resolved and @Byron has given his thumbs up aswell

KnorpelSenf · 2025-12-19T20:35:54Z

Oddly enough, GitHub displays "all checks have passed" as well as "no conflicts with base branch" for me. What should I do? Merge main into this branch?

pascalkuthe · 2025-12-19T20:38:14Z

you need to rebase (git fetch && git rebase origin/main) and resolve conflicts

KnorpelSenf · 2025-12-19T20:40:49Z

I know how to resolve conflicts, I'm just surprised that GitHub doesn't display any merge conflicts for me but it does for you. But sure, I'll double-check that with the CLI tomorrow

pascalkuthe · 2025-12-19T20:46:16Z

ah sorry I misunderstood weird for me it shows

pascalkuthe · 2025-12-19T20:47:29Z

ok seems squash + merge /merge + merge commit would work.. probably if you rebased instead of merged locally there may be conflicts. but I am also ok with squash merging after @Byron had a look

KnorpelSenf · 2025-12-19T20:50:29Z

Fascinating:

Copilot

Pull request overview

This PR adds a word-level diffing feature to complement the existing line-level diff functionality. The implementation provides a word_diff method on Hunk that computes word-level differences within changed lines, making it easier to identify specific word changes within modified lines.

Key changes:

Implemented a word tokenizer that splits text into words based on alphanumeric sequences, spaces, and individual punctuation characters
Added word_diff method to Hunk that performs word-level diffing on hunk contents
Fixed a typo in the interner documentation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File	Description
src/sources.rs	Adds `Words` tokenizer struct and `words()` function to split text into word-level tokens for diffing
src/lib.rs	Implements `word_diff()` method on `Hunk` to compute word-level diffs within line hunks
src/tests.rs	Adds tests for word tokenizer and word_diff functionality including pure insertion/removal and modification cases
src/intern.rs	Fixes typo in documentation comment ("a the" → "the")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-20T09:49:57Z

src/sources.rs

+/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
+/// other characters are their own word.


The documentation states that "a sequence of just the space character ' '" is treated as a word, but this is incomplete. The implementation on line 104-108 treats any sequence of consecutive spaces as a single token. The documentation should clarify that multiple consecutive spaces are grouped together as one token.

Suggested change

/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any

/// other characters are their own word.

/// `char::is_alphanumeric`, or a sequence of one or more consecutive space

/// characters (' '). Any other characters are their own word.

Copilot · 2025-12-20T09:49:58Z

src/sources.rs

+/// a sequence of alphanumeric characters as determined by
+/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
+/// other characters are their own word.


The documentation mentions "a sequence of alphanumeric characters as determined by char::is_alphanumeric" but the implementation on lines 110-113 also includes underscores (_) as part of alphanumeric words. This is inconsistent with the documentation. Either update the documentation to mention that underscores are included in alphanumeric sequences, or remove the special handling of underscores if they should be treated as separate tokens.

Suggested change

/// a sequence of alphanumeric characters as determined by

/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any

/// other characters are their own word.

/// a sequence of "word" characters (those for which `char::is_alphanumeric`

/// returns `true`, plus the underscore character '_'), or a sequence of just

/// the space character ' '. Any other characters are their own word.

src/lib.rs

Copilot · 2025-12-20T09:49:58Z

src/lib.rs

+    /// Performs a word-diff of the hunk
+    pub fn word_diff<'a>(
+        &self,
+        input: &InternedInput<&'a str>,
+        diff_input: &mut InternedInput<&'a str>,
+        diff: &mut Diff,
+    ) {


The word_diff API requires callers to pass mutable references to diff_input and diff that are populated by this function. This design is not intuitive and error-prone because:

Callers must create and maintain these objects themselves

The function's return type is (), making it unclear that it populates the diff parameter

The relationship between the parameters is not obvious from the signature

Consider either:

Returning the Diff as the function's result

Using a builder pattern with a dedicated WordDiff type

Adding documentation that clearly explains the expected usage pattern and the relationship between parameters

While copilot is right in principle, I think compete documentation on parameters could explain why they need to be mutable.
Docs are definitely needed here.

Copilot · 2025-12-20T09:49:59Z

src/tests.rs

+    let mut input = InternedInput::new(before, after);
+    for algorithm in Algorithm::ALL {
+        let mut diff_input = InternedInput::default();
+        let mut d = Diff::default();


The variable name 'd' is too short and unclear. Consider using a more descriptive name like 'word_diff' or 'inner_diff' to make the test more readable and to clarify that this is a separate diff object from the line-level diff.

Copilot · 2025-12-20T09:49:59Z

src/tests.rs

+    let mut input = InternedInput::new(before, after);
+    for algorithm in Algorithm::ALL {
+        let mut diff_input = InternedInput::default();
+        let mut d = Diff::default();


The variable name 'd' is too short and unclear. Consider using a more descriptive name like 'word_diff' or 'inner_diff' to make the test more readable and to clarify that this is a separate diff object from the line-level diff.

Copilot · 2025-12-20T09:49:59Z

src/tests.rs

+        let mut diff_input = InternedInput::default();
+        let mut d = Diff::default();


The test variables diff_input and d are declared once outside the algorithm loop but are reused across all algorithm iterations. While word_diff appears to clear and resize the vectors, this pattern could hide bugs if word_diff doesn't properly reset all state. Consider either:

Moving these declarations inside the loop to ensure a fresh state for each algorithm

Adding an explicit reset/clear call between iterations to make the intent clear

Copilot · 2025-12-20T09:49:59Z

src/sources.rs

+    }
+
+    fn estimate_tokens(&self) -> u32 {
+        (self.0.len() / 3) as u32


The token estimation heuristic divides string length by 3 (assuming average word length of 3 characters). This could result in poor allocation sizing:

For typical English text, average word length is closer to 4-5 characters when including spaces and punctuation

For code with long identifiers, the average could be much higher

The estimate doesn't account for the fact that each punctuation character becomes its own token

Consider using a more conservative estimate like (self.0.len() / 5) or implementing a sampling approach similar to ByteLines::estimate_tokens() that examines the first portion of the text to calculate the actual average.

Suggested change

(self.0.len() / 3) as u32

(self.0.len() / 5) as u32

I don't know about this one, but it seems that heuristics aren't always right and maybe there is a way to make it configurable?
Maybe this word-diff is also so tuned to Latin text that we might say it in the function, i.e. word_diff to latin_word_diff.

Copilot · 2025-12-20T09:49:59Z

src/lib.rs

+    /// Performs a word-diff of the hunk
+    pub fn word_diff<'a>(
+        &self,
+        input: &InternedInput<&'a str>,
+        diff_input: &mut InternedInput<&'a str>,
+        diff: &mut Diff,
+    ) {
+        let Hunk { before, after } = self.clone();
+        diff_input.update_before(
+            before
+                .map(|index| input.before[index as usize])
+                .map(|token| input.interner[token])
+                .flat_map(|line| words(line)),
+        );
+        diff_input.update_after(
+            after
+                .map(|index| input.after[index as usize])
+                .map(|token| input.interner[token])
+                .flat_map(|line| words(line)),
+        );
+        diff.removed.clear();
+        diff.removed.resize(diff_input.before.len(), false);
+        diff.added.clear();
+        diff.added.resize(diff_input.after.len(), false);
+        if self.is_pure_removal() {
+            diff.removed.fill(true);
+        } else if self.is_pure_insertion() {
+            diff.added.fill(true);
+        } else {
+            diff.compute_with(
+                Algorithm::Myers,
+                &diff_input.before,
+                &diff_input.after,
+                diff_input.interner.num_tokens(),
+            );
+            diff.postprocess_no_heuristic(diff_input);
+        }
+    }


The word_diff method doesn't clear the interner between calls, which could lead to memory accumulation if called repeatedly. According to the documentation of update_before and update_after, "this does not erase any tokens from the interner and might therefore be considered a memory leak."

While the comment mentions clearing the interner for long-running processes, for a public API like word_diff, it would be better to either:

Clear the interner at the start of this method

Document this behavior clearly in the word_diff documentation

Provide a way for callers to manage this explicitly

Copilot · 2025-12-20T09:50:00Z

src/tests.rs

+        let mut diff_input = InternedInput::default();
+        let mut d = Diff::default();


The test variables diff_input and d are declared once outside the algorithm loop but are reused across all algorithm iterations. While word_diff appears to clear and resize the vectors, this pattern could hide bugs if word_diff doesn't properly reset all state. Consider either:

Moving these declarations inside the loop to ensure a fresh state for each algorithm

Adding an explicit reset/clear call between iterations to make the intent clear

Maybe it's a feature to let state accumulate without resetting it to ensure the call can handle it.

Byron · 2025-12-20T09:51:01Z

Please state the nature of the medical emergency.

Thanks for summoning me :). Unfortunately, I am completely useless for looking at complex algorithms that don't have visual tests. So I really can't review this. But I did bring in a robot which isn't completely useless, and I hope it's true this time as well.

I will take a look at its suggestions too and add my opinion.

Byron

Copilot did inspire me to look more closely, and I think some more work is needed.

Let's make clear that we only deal with latin words here with a change to the function name.
let's document the function and all of its parameters in detail. The &mut ones need extra work to make clear what the user can expect or is expected to do. Particularly what happens to the memory there, will it build up?

I'd go as far as to say that we need a fuzz-test for this one, so could block it on #42 or change the target to merge into that one. All these index operations I wouldn't trust until a fuzzer ran it with a couple million inputs.

Thanks again for making this happen @KnorpelSenf, I hope it's not too much of a burden to go into another round.

KnorpelSenf · 2025-12-20T14:55:27Z

I skimmed most of the AI review and it largely looks like slop, I'm not really sure if there serious changes needed but if you think so, I can spend some more time on this

Byron · 2025-12-20T15:29:32Z

I went through that slop and left comments, while also summarising my takeaway in the comment above. Some more time would thus be appreciated 🙏.

KnorpelSenf · 2025-12-20T19:07:49Z

@pascalkuthe I added a commit, does GitHub still display conflicts?

@Byron

* Let's make clear that we only deal with latin words here with a change to the function name.

Do you suggest to rename word_diff to latin_word_diff?

* let's document the function and all of its parameters in detail. The `&mut` ones need extra work to make clear what the user can expect or is expected to do. Particularly what happens to the memory there, will it build up?

I can add more docs to word_diff

I'd go as far as to say that we need a fuzz-test for this one, so could block it on #42 or change the target to merge into that one. All these index operations I wouldn't trust until a fuzzer ran it with a couple million inputs.

That approach sounds strange. The indexing operations either work on all inputs or none. There are definitely places in this crate that would benefit from fuzz-testing, but tokenisation (and vector indexing, for that matter) would be the last places where I'd employ this. Can you elaborate why you'd want to block this by that?

- rename `word_diff` to `latin_word_diff`

Move all tests that don't validate private state are on integration level. Reorganise fixtures while at it.

Byron

Thanks a lot!

I made some more adjustments, mostly to the testing so I get it, too 😁.

KnorpelSenf added 4 commits September 27, 2025 13:41

feat: define bytes token source

969dde8

feat: implement byte-based word diff

a625abc

test: cover word diffs

1bf3ccf

docs: fix typo

3dba63f

KnorpelSenf marked this pull request as draft September 27, 2025 11:56

KnorpelSenf mentioned this pull request Sep 27, 2025

Add convenience feature for word diff #1

Closed

KnorpelSenf commented Sep 27, 2025

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

KnorpelSenf commented Sep 27, 2025

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

KnorpelSenf commented Sep 27, 2025

View reviewed changes

src/sources.rs Outdated Show resolved Hide resolved

KnorpelSenf commented Sep 27, 2025

View reviewed changes

src/lib.rs Show resolved Hide resolved

KnorpelSenf mentioned this pull request Sep 27, 2025

perf(fmt): speed up file diffing denoland/deno#30644

Merged

KnorpelSenf added 3 commits November 7, 2025 18:19

feat: tokenize words not bytes

2682999

test: cover word diffs

a175626

Merge branch 'master' into word-diff

fbff98b

KnorpelSenf marked this pull request as ready for review November 7, 2025 17:21

docs: fix oversight

ae627f5

KnorpelSenf requested review from Byron and pascalkuthe November 7, 2025 17:27

KnorpelSenf added 4 commits November 7, 2025 18:28

fix: lint about needless &

acdad54

perf: no need to copy word vec

337bbdb

refactor: only branch on word length

236ae42

perf: merge contiguous blocks of space chars

4c14911

KnorpelSenf commented Nov 8, 2025

View reviewed changes

src/sources.rs Show resolved Hide resolved

KnorpelSenf added 2 commits November 8, 2025 10:20

docs: improve words docs w.r.t. space handling

83fe6bd

test: cover words tokenizer

b24e52c

KnorpelSenf mentioned this pull request Nov 8, 2025

feat: character diffs #38

Open

pascalkuthe reviewed Nov 21, 2025

View reviewed changes

src/sources.rs Outdated Show resolved Hide resolved

test: remove lost test case

0194b50

KnorpelSenf mentioned this pull request Nov 22, 2025

Convenient variable-granularity chunk iterator #40

Open

KnorpelSenf requested a review from pascalkuthe December 11, 2025 12:40

pascalkuthe approved these changes Dec 19, 2025

View reviewed changes

Byron requested a review from Copilot December 20, 2025 09:46

Copilot started reviewing on behalf of Byron December 20, 2025 09:46 View session

Copilot AI reviewed Dec 20, 2025

View reviewed changes

Byron requested changes Dec 20, 2025

View reviewed changes

Merge branch 'master' into word-diff

d042243

docs: elaborate on word_diff usage

cd20497

KnorpelSenf requested review from Byron and pascalkuthe December 20, 2025 22:27

Byron added 2 commits December 30, 2025 08:39

refactor

d193772

- rename `word_diff` to `latin_word_diff`

refactor tests

7724909

Move all tests that don't validate private state are on integration level. Reorganise fixtures while at it.

Byron approved these changes Dec 30, 2025

View reviewed changes

Add more tests for the word diff to understand postprocessing

e92b3c6

Byron force-pushed the word-diff branch from 0b210b2 to e92b3c6 Compare December 30, 2025 08:58

Byron merged commit 595c591 into pascalkuthe:master Dec 30, 2025
5 checks passed

KnorpelSenf deleted the word-diff branch December 30, 2025 22:08

		/// `char::is_alphanumeric`, or a sequence of just the space character ' '. Any
		/// other characters are their own word.

		let mut diff_input = InternedInput::default();
		let mut d = Diff::default();

Conversation

KnorpelSenf commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KnorpelSenf commented Nov 7, 2025

Uh oh!

Uh oh!

KnorpelSenf commented Nov 21, 2025

Uh oh!

Uh oh!

pascalkuthe commented Nov 21, 2025

Uh oh!

pascalkuthe commented Dec 19, 2025

Uh oh!

KnorpelSenf commented Dec 19, 2025

Uh oh!

pascalkuthe commented Dec 19, 2025

Uh oh!

KnorpelSenf commented Dec 19, 2025

Uh oh!

pascalkuthe commented Dec 19, 2025

Uh oh!

pascalkuthe commented Dec 19, 2025

Uh oh!

KnorpelSenf commented Dec 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Byron Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Byron Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Byron Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Byron commented Dec 20, 2025

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf commented Dec 20, 2025

Uh oh!

Byron commented Dec 20, 2025

KnorpelSenf commented Sep 27, 2025 •

edited

Loading