Skip to content

Fix type notation of merges in BPE Python binding#1685

Closed
Coqueue wants to merge 1 commit intohuggingface:mainfrom
Coqueue:bpe_merges_py_notation
Closed

Fix type notation of merges in BPE Python binding#1685
Coqueue wants to merge 1 commit intohuggingface:mainfrom
Coqueue:bpe_merges_py_notation

Conversation

@Coqueue
Copy link
Copy Markdown
Contributor

@Coqueue Coqueue commented Nov 21, 2024

Heya, this change is to correct type notations of merges in Python binding of BPE as the counterpart in Rust implementation expects a file name string or Vec<(String, String)> instead of a mapping from a tuple of integers to another.

@Narsil
Copy link
Copy Markdown
Contributor

Narsil commented Jan 8, 2025

Why do you think the annotations are incorrect ? The 2 things are unlinked and the signature here is correct.

@Narsil Narsil closed this Jan 8, 2025
@Coqueue
Copy link
Copy Markdown
Contributor Author

Coqueue commented Jan 8, 2025

Hi Nicolas, thanks for taking a look.

The 2 things are unlinked and the signature here is correct.

I wonder if this is true.
Taking the Python SentencePieceBPETokenizer class as an example, the merges annotated as Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] is only used to initialize the BPE class:

if vocab is not None and merges is not None:
        tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))

where the BPE class is generated from the Rust implementation mentioned in the original post.

BPE class also contains below comment:

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm

    Args:
        vocab (:obj:`Dict[str, int]`, `optional`):
            A dictionary of string keys and their ids :obj:`{"am": 0,...}`

        merges (:obj:`List[Tuple[str, str]]`, `optional`):
            A list of pairs of tokens (:obj:`Tuple[str, str]`) :obj:`[("a", "b"),...]`

At least this comment gives a different type annotation for merges.

Why do you think the annotations are incorrect?

I encountered an issue initializing SentencePieceBPETokenizer with the given type annotation, but succeeded following the type annotation given in the comment of BPE class.

The type annotation of merges being Dict[Tuple[int, int], Tuple[int, int]]]] should actually be for MergesMap in Rust, as an alias of HashMap<Pair, (u32, u32)>, which is built internally in Rust from merges.

As you closed the PR as a repo collaborator, I'll provide the above context -- I'll be more than happy to learn more if I was wrong.

@Coqueue
Copy link
Copy Markdown
Contributor Author

Coqueue commented Mar 17, 2025

@Narsil Hi Nicolas, appreciate it if you could take a look at the above reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants