Skip to content

feat(languages): Add english-ngrams#109

Merged
max-niederman merged 1 commit intomax-niederman:mainfrom
heysokam:patch-1
Feb 5, 2024
Merged

feat(languages): Add english-ngrams#109
max-niederman merged 1 commit intomax-niederman:mainfrom
heysokam:patch-1

Conversation

@heysokam
Copy link
Copy Markdown
Contributor

Based on the app and wordlist from:
https://github.com/ranelpadon/ngram-type

Copy link
Copy Markdown
Owner

@max-niederman max-niederman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a useful dictionary to add, but I'm not sure about the naming. "N-gram" can refer to sequences of any kinds of symbols, including words. In fact, all of our dictionaries are lists of common 1-grams, where the symbols are words. I suggest english-nchars, unless you have another idea.

@heysokam
Copy link
Copy Markdown
Contributor Author

heysokam commented Feb 2, 2024

They are still n-grams, not n-chars, even if they are using characters as their symbols.
The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word".
The case of ngrams symbols being words is the rare case, not the opposite.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

@max-niederman
Copy link
Copy Markdown
Owner

They are still n-grams, not n-chars, even if they are using characters as their symbols. The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word".

This is not true; neither is more technically correct because "n-gram" is a very broad term and applies to both. That's why I'm hesitant to call only one "n-gram" as its distinguishing feature.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

This is a valid point, though. "N-gram" is more searchable, at the very least because of ngram-type. I'm going to go ahead and merge this, although in v2 I think this'll need to be replaced by n-gram generation, which is already planned.

@max-niederman max-niederman merged commit 2bd7555 into max-niederman:main Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants