Skip to content

Improve title-based category search #306

@nicolas-raoul

Description

@nicolas-raoul
  1. Chose picture
  2. Entered file name "Nintendo sign at Tokyo branch in Taito"
  3. The proposed categories did not contain anything related to Nintendo nor Taito.
  4. Waiting some time does not change the proposed categories

That's because the API searches for "Nintendo sign at Tokyo branch in Taito" instead of "Nintendo" and "sign" and "Tokyo" and "branch" and "Taito".

We would need to split into words, then remove grammar words such as "the is first to into" or more generally all small words, then perform a search for each seemingly relevant word (these seem to be called stop words)
It is less easy for languages without spaces (like Japanese), but most file names are in space-separated languages so for now that's not a big problem.

A bigger problem is that most titles are not in English, which means we would have to first guess the language, and then extract stop words in the context of that language. To not make the app to big, we could write a multi-languages extractor (for instance using nltk) and host it on a Wikimedia server.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions