-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
- Chose picture
- Entered file name "Nintendo sign at Tokyo branch in Taito"
- The proposed categories did not contain anything related to Nintendo nor Taito.
- Waiting some time does not change the proposed categories
That's because the API searches for "Nintendo sign at Tokyo branch in Taito" instead of "Nintendo" and "sign" and "Tokyo" and "branch" and "Taito".
We would need to split into words, then remove grammar words such as "the is first to into" or more generally all small words, then perform a search for each seemingly relevant word (these seem to be called stop words)
It is less easy for languages without spaces (like Japanese), but most file names are in space-separated languages so for now that's not a big problem.
A bigger problem is that most titles are not in English, which means we would have to first guess the language, and then extract stop words in the context of that language. To not make the app to big, we could write a multi-languages extractor (for instance using nltk) and host it on a Wikimedia server.