4 Comments
User's avatar
Dwayne McDaniel's avatar

This is a great write-up. Thank you.

I am curious, though, about the internationalization side of this. For example, how would this fare against Polish or other Cyrillic-alphabet languages for tokenization? For example, the Polish word for "luck" (1 token) is "szczęście", which produces 4 tokens according to the tool you shared, which would meet the definition of "high tokenization" while still being a common "dictionary word".

I am wondering if this means for English words this makes sense due to training bias, but for other alphabets, maybe entropy might still be needed?

Zachary Rice's avatar

Hey Dwayne, thanks for the compliment and question! You are right, Token Efficiency using cl100k_base is biased toward English. There may be another "all language" model that produces similar token efficiency values for common and rare words across languages. Entropy is still a great filter and is needed in order to squeak out that .89 F1 score (generic rule uses a entropy of 2.73 + Token Efficiency). But you're definitely right, the way token efficiency is shipped today you would still need an entropy filter for Polish or other Cyrillic-alphabet languages.

Feel free to open an issue for this on the Betterleaks repo if you're feeling motivated. Otherwise I'll eventually get to this.

Dmitriy Alergant's avatar

Thanks for the acknowledgment, and glad I was able to contribute.

Great article!

I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short? In many use-cases, the stakeholder may decide this is not what they even need to protect against. If a someone is using `Passw0rd-gmail` as a credential (14/6 = 2.33), they have bigger problems besides it being hardcoded somewhere, and it may not be worth protection at a scanner level. Potentially the threshold can still be moved to 2.00 or 2.05-2.10.

P.S> I also continue developing this idea separately, but haven't had a chance to publish a formalized research like you did - congarts! I still like the term 'token density'. Although in that case the formula needs to be reversed; len(tok)/len(str) with density thresholds lingering in the 0.5-ish range.

Henry's avatar

hey great writeup!

just because its also written in the post:

"A quick note on passwords. Token Efficiency does not do well with classifying bad passwords like “password123” or “chibearsfan123”. These passwords are basically natural language which means a high token efficiency value. Pass phrases also don’t do well because those are usually just straight up words."

What do you think is the best way to find these than? or its something to drop in a secret scanner because "who uses such a weak password should be pwned anyway"?