Fix normalized string append after clear #1636 by Anantha-Kandrapu · Pull Request #1660 · huggingface/tokenizers

Anantha-Kandrapu · 2024-10-20T04:10:29Z

Rewrite of the append method:
- The previous implementation used character indices and transformations.
- The new implementation directly appends the string and updates alignments.
Update to the clear method:
- Instead of using the transform method, it now directly clears the normalized string and alignments.
- Adds a single alignment point to represent the empty state after clearing.
New test case:
- Added test_append_after_clear to verify the behavior of appending after clearing a NormalizedString.
- This test ensures that appending works correctly after clearing, and that original and normalized lengths are maintained properly.

Narsil · 2025-01-08T11:49:48Z

Can we kept the potential regressions down to a minimum by making the change much smaller:

diff --git a/tokenizers/src/tokenizer/normalizer.rs b/tokenizers/src/tokenizer/normalizer.rs
index 94b56887..91857d8a 100644
--- a/tokenizers/src/tokenizer/normalizer.rs
+++ b/tokenizers/src/tokenizer/normalizer.rs
@@ -517,6 +517,9 @@ impl NormalizedString {
         if let Some((b, prev)) = self.normalized.char_indices().last() {
             let transformations = std::iter::once((prev, 0)).chain(s.chars().map(|c| (c, 1)));
             self.transform_range(Range::Normalized(b..), transformations, 0);
+        } else {
+            let transformations = s.chars().map(|c| (c, 1));
+            self.transform_range(Range::Normalized(..), transformations, 0);
         }
         self
     }
@@ -2284,4 +2287,24 @@ mod tests {
         s.lowercase();
         assert_eq!(s.get(), "a...");
     }
+
+    #[test]
+    fn test_append_after_clear() {
+        let mut n = NormalizedString::from("Hello");
+        assert_eq!(n.get(), "Hello");
+
+        n.clear();
+        assert_eq!(n.get(), "");
+
+        n.append(" World");
+        assert_eq!(n.get(), " World");
+
+        assert_eq!(n.len_original(), 5);
+        assert_eq!(n.len(), 6);
+
+        assert_eq!(n.get_range_original(Range::Original(0..5)), Some("Hello"));
+        assert_eq!(n.get_range_original(Range::Normalized(0..6)), Some(""));
+
+        assert_eq!(n.get_range(Range::Normalized(0..6)), Some(" World"));
+    }
 }

Anantha-Kandrapu · 2025-01-09T18:57:54Z

Oh Thanks, for taking a look, I see this change is fixed now, in #1717 .
Closing PR

Narsil · 2025-01-10T08:57:28Z

I put you as a co-author. Sorry I don´t work on tokenizers regularly that's why I didn't wait super long before getting the thing merged, usually we prefer to merge OP contributors' PR

Fix normalized string append after clear

c28708d

ArthurZucker self-requested a review October 22, 2024 15:16

Narsil mentioned this pull request Jan 9, 2025

Fixing NormalizedString append when normalized is empty. #1717

Merged

Anantha-Kandrapu closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix normalized string append after clear #1636#1660

Fix normalized string append after clear #1636#1660
Anantha-Kandrapu wants to merge 1 commit intohuggingface:mainfrom
Anantha-Kandrapu:NormalizedStringBugFix

Anantha-Kandrapu commented Oct 20, 2024

Uh oh!

Narsil commented Jan 8, 2025

Uh oh!

Anantha-Kandrapu commented Jan 9, 2025

Uh oh!

Narsil commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Anantha-Kandrapu commented Oct 20, 2024

Uh oh!

Narsil commented Jan 8, 2025

Uh oh!

Anantha-Kandrapu commented Jan 9, 2025

Uh oh!

Narsil commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants