iseeaswell꩜bʂky (@iseeaswell) / X

iseeaswell꩜bʂky

437 posts

iseeaswell꩜bʂky

@iseeaswell

low resource MT, plants, insects, music+sangeetham. Join TUSL, the Low Resource NLP Discord: discord.gg/z3ya9EUS2U

Joined September 2019

iseeaswell꩜bʂky
@iseeaswell
May 17, 2022
How many languages can we support with Machine Translation? We train a translation model on 1000+ languages, using it to launch 24 new languages on Google Translate without any parallel data for these languages.arxiv.org/abs/2205.03983 Technical 🧵below: 1/18
iseeaswell꩜bʂky
@iseeaswell
Oct 29, 2020
What do we need to scale NLP research to 1000 languages? We started off with a goal to build a monolingual corpus in 1000 languages by mining data from the web. Here’s our work documenting our struggles with Language Identification (LangID): arxiv.org/abs/2010.14571 1/8
iseeaswell꩜bʂky
@iseeaswell
Jun 27, 2024
Excited to announce that 110 languages got added to Google Translate today! Time for context on these languages, especially the communities who helped a lot over the past few years, including Cantonese, NKo, and Faroese volunteers. Also, a 110-language youtube playlist. 🧵
50K
iseeaswell꩜bʂky
@iseeaswell
Mar 23, 2021
Does the data used for multilingual modeling really contain content in the languages it says it does? Short answer: sometimes 🙁 arxiv.org/abs/2103.12028 1/n
iseeaswell꩜bʂky
@iseeaswell
May 11, 2022
Happy to finally be public about my main project over the last few years: adding more languages to Translate!
Ankur Bapna
@ankurbpn
May 11, 2022
Excited to share some real world results from our effort on building machine translation models for long tail languages. Here's the research paper that describes the approach in more detail: arxiv.org/abs/2205.03983 Tweet 🧵 coming soon :)
iseeaswell꩜bʂky
@iseeaswell
Sep 25, 2023
Have you ever wanted a LangID model that works on 1500+ languages? check out FUN-LangID: github.com/google-researc… !
9.4K
iseeaswell꩜bʂky
@iseeaswell
Mar 28, 2023
I'm excited to open-source GATITOS, a new multilingual lexicon in 26 long-tail languages! arxiv.org/pdf/2303.15265… shows how to use it for an average ChrF boost of +7.0 to +10 over baseline. Open-sourced data here: github.com/google-researc… (1/10)
6.3K
iseeaswell꩜bʂky
@iseeaswell
Jun 5, 2022
Do you want your language to be supported by NLP (like Google Translate) -- or left alone? Please fill this form if you have thoughts you'd like to share with me :) docs.google.com/forms/d/e/1FAI… 1/4
docs.google.com
Do you speak a language not on Google Translate?
We are looking for people from various communities to talk to and understand whether they would want their language more supported by technology -- e.g. Google Translate -- or not. You may also fill...
iseeaswell꩜bʂky
@iseeaswell
May 9, 2023
Just added 84 more languages to GATITOS! It now has a total of 113 languages, many of which with no other public resources 😊
GitHub - google-research/url-nlp
From github.com
3.9K
iseeaswell꩜bʂky
@iseeaswell
Oct 29, 2020
Replying to @iseeaswell
As a closing note: PLEASE LOOK AT ANY DATA YOU CRAWL OR TRAIN ON. Publicly available LangID and web corpora also have these issues for lower-resource languages. 7/8
iseeaswell꩜bʂky
@iseeaswell
Jul 1, 2024
Do you want to help improve translation for the 110 new Google Translate languages? One way is to help correct GATITOS 😼🧵
3.2K
iseeaswell꩜bʂky
@iseeaswell
Jun 6, 2024
Replying to @xkcd
The diagram is great but from a linguist's perspective, I don't think the bottom statement is related. [t] -> [ʔ] is just is the allophone of /t/ used word-finally in America, and is not inherently more efficient. (And this is not done by native speakers in Ireland etc)
3K
iseeaswell꩜bʂky
@iseeaswell
Nov 15, 2023
Announcing BREAD, a new benchmark for noisy text detection, and CRED, the scoring functions we open-source to solve the problem!
arxiv.org
Separating the Wheat from the Chaff with BREAD: An open-source...
Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A...
3.4K
iseeaswell꩜bʂky
@iseeaswell
Oct 29, 2020
Replying to @iseeaswell
For example, did you know how much ᏋᏁᎶᏝᎥᏕᏂ is written in Cherokee syllabics online? Or that because of the common Oromo 4-gram “essa”, a majority of web-crawled “Oromo” may actually be English sentences containing “essay” multiple times? 5/8