How many languages can we support with Machine Translation? We train a translation model on 1000+ languages, using it to launch 24 new languages on Google Translate without any parallel data for these languages.arxiv.org/abs/2205.03983 Technical 🧵below: 1/18
iseeaswell꩜bʂky
437 posts
low resource MT, plants, insects, music+sangeetham.
Join TUSL, the Low Resource NLP Discord: discord.gg/z3ya9EUS2U
Joined September 2019
- What do we need to scale NLP research to 1000 languages? We started off with a goal to build a monolingual corpus in 1000 languages by mining data from the web. Here’s our work documenting our struggles with Language Identification (LangID): arxiv.org/abs/2010.14571 1/8
- Excited to announce that 110 languages got added to Google Translate today! Time for context on these languages, especially the communities who helped a lot over the past few years, including Cantonese, NKo, and Faroese volunteers. Also, a 110-language youtube playlist. 🧵
- Does the data used for multilingual modeling really contain content in the languages it says it does? Short answer: sometimes 🙁 arxiv.org/abs/2103.12028 1/n
- Happy to finally be public about my main project over the last few years: adding more languages to Translate!Excited to share some real world results from our effort on building machine translation models for long tail languages. Here's the research paper that describes the approach in more detail: arxiv.org/abs/2205.03983 Tweet 🧵 coming soon :)
- Have you ever wanted a LangID model that works on 1500+ languages? check out FUN-LangID: github.com/google-researc… !
- I'm excited to open-source GATITOS, a new multilingual lexicon in 26 long-tail languages! arxiv.org/pdf/2303.15265… shows how to use it for an average ChrF boost of +7.0 to +10 over baseline. Open-sourced data here: github.com/google-researc… (1/10)
- Do you want your language to be supported by NLP (like Google Translate) -- or left alone? Please fill this form if you have thoughts you'd like to share with me :) docs.google.com/forms/d/e/1FAI… 1/4
- Just added 84 more languages to GATITOS! It now has a total of 113 languages, many of which with no other public resources 😊
- Replying to @iseeaswellAs a closing note: PLEASE LOOK AT ANY DATA YOU CRAWL OR TRAIN ON. Publicly available LangID and web corpora also have these issues for lower-resource languages. 7/8
- Do you want to help improve translation for the 110 new Google Translate languages? One way is to help correct GATITOS 😼🧵
- Replying to @xkcdThe diagram is great but from a linguist's perspective, I don't think the bottom statement is related. [t] -> [ʔ] is just is the allophone of /t/ used word-finally in America, and is not inherently more efficient. (And this is not done by native speakers in Ireland etc)
- Announcing BREAD, a new benchmark for noisy text detection, and CRED, the scoring functions we open-source to solve the problem!
- Replying to @iseeaswellFor example, did you know how much ᏋᏁᎶᏝᎥᏕᏂ is written in Cherokee syllabics online? Or that because of the common Oromo 4-gram “essa”, a majority of web-crawled “Oromo” may actually be English sentences containing “essay” multiple times? 5/8




