Skip to content

Fix normalized string append after clear #1636#1660

Closed
Anantha-Kandrapu wants to merge 1 commit intohuggingface:mainfrom
Anantha-Kandrapu:NormalizedStringBugFix
Closed

Fix normalized string append after clear #1636#1660
Anantha-Kandrapu wants to merge 1 commit intohuggingface:mainfrom
Anantha-Kandrapu:NormalizedStringBugFix

Conversation

@Anantha-Kandrapu
Copy link
Copy Markdown

  1. Rewrite of the append method:

    • The previous implementation used character indices and transformations.
    • The new implementation directly appends the string and updates alignments.
  2. Update to the clear method:

    • Instead of using the transform method, it now directly clears the normalized string and alignments.
    • Adds a single alignment point to represent the empty state after clearing.
  3. New test case:

    • Added test_append_after_clear to verify the behavior of appending after clearing a NormalizedString.
    • This test ensures that appending works correctly after clearing, and that original and normalized lengths are maintained properly.

#1636

@ArthurZucker ArthurZucker self-requested a review October 22, 2024 15:16
@Narsil
Copy link
Copy Markdown
Contributor

Narsil commented Jan 8, 2025

Can we kept the potential regressions down to a minimum by making the change much smaller:

diff --git a/tokenizers/src/tokenizer/normalizer.rs b/tokenizers/src/tokenizer/normalizer.rs
index 94b56887..91857d8a 100644
--- a/tokenizers/src/tokenizer/normalizer.rs
+++ b/tokenizers/src/tokenizer/normalizer.rs
@@ -517,6 +517,9 @@ impl NormalizedString {
         if let Some((b, prev)) = self.normalized.char_indices().last() {
             let transformations = std::iter::once((prev, 0)).chain(s.chars().map(|c| (c, 1)));
             self.transform_range(Range::Normalized(b..), transformations, 0);
+        } else {
+            let transformations = s.chars().map(|c| (c, 1));
+            self.transform_range(Range::Normalized(..), transformations, 0);
         }
         self
     }
@@ -2284,4 +2287,24 @@ mod tests {
         s.lowercase();
         assert_eq!(s.get(), "a...");
     }
+
+    #[test]
+    fn test_append_after_clear() {
+        let mut n = NormalizedString::from("Hello");
+        assert_eq!(n.get(), "Hello");
+
+        n.clear();
+        assert_eq!(n.get(), "");
+
+        n.append(" World");
+        assert_eq!(n.get(), " World");
+
+        assert_eq!(n.len_original(), 5);
+        assert_eq!(n.len(), 6);
+
+        assert_eq!(n.get_range_original(Range::Original(0..5)), Some("Hello"));
+        assert_eq!(n.get_range_original(Range::Normalized(0..6)), Some(""));
+
+        assert_eq!(n.get_range(Range::Normalized(0..6)), Some(" World"));
+    }
 }

@Anantha-Kandrapu
Copy link
Copy Markdown
Author

Oh Thanks, for taking a look, I see this change is fixed now, in #1717 .
Closing PR

@Narsil
Copy link
Copy Markdown
Contributor

Narsil commented Jan 10, 2025

I put you as a co-author. Sorry I don´t work on tokenizers regularly that's why I didn't wait super long before getting the thing merged, usually we prefer to merge OP contributors' PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants