Join | Mozilla Data Collective

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Tamazight NLP

Tamazight Open Speech Dataset

This dataset provides a parsed, formatted, and ready-to-use Amazigh Voice Dataset. It contains voice recordings and corresponding text transcripts in Standard Moroccan Amazigh (ⵜⴰⵎⴰⵣⵉⵖⵜ ⵜⴰⵏⴰⵡⴰⵢⵜ ⵜⴰⵎⵓⵔⴰⴽⵓⵛⵜ) intended for training Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models.

License: Apache-2.0

Locale: zgh

Task: ASR

Format: WAV, JSONL

Size: 459.09 MB

Community

Italian TTS - female voice

Italian, 66yo woman reading "Il fu Mattia Pascal" and "L'Argentina vista com'è", italian classical literature.

License: Apache-2.0

Locale: it

Task: TTS

Format: WEBM, TSV

Size: 680.16 MB

CLEAR Global

Read Speech in Kenyan Swahili (6h)

~6 hours of prompted read speech in Kenyan Swahili from a single anonymous male speaker, with transcriptions drawn from Tatoeba-sourced sentences.

License: CC-BY-NC-4.0

Locale: sw

Task: ASR

Format: WAV, TSV

Size: 1.59 GB

RFE/RL

RFE/RL Belarusian News Text Corpus

Longitudinal Belarusian news corpus from Radio Svaboda (1997-2026) with nearly 339,000 articles and 134M tokens.

License: CC-BY-NC-SA-4.0

Locale: be

Task: NLP

Format: TXT

Size: 486.55 MB

RFE/RL

RFE/RL Afghan Dari News Text Corpus

Longitudinal Afghan Dari news corpus from Radio Azadi (2000-2026) featuring over 212,000 articles and 50M tokens.

License: CC-BY-NC-SA-4.0

Locale: prs

Task: NLP

Format: TXT

Size: 115.70 MB

RFE/RL

RFE/RL Afghan Pashto News Text Corpus

This dataset is a longitudinal news corpus for the Pashto language, sourced from Radio Azadi, from 2000 to 2026, it contains over 204,000 articles (16M tokens).

License: CC-BY-NC-SA-4.0

Locale: ps

Task: NLP

Format: TXT

Size: 117.68 MB

RFE/RL

RFE/RL Armenian News Text Corpus

Longitudinal Armenian & English news corpus from Radio Azatutyun (1999-2026) with over 232,000 articles and 54M tokens.

License: CC-BY-NC-SA-4.0

Locale: hy,en

Task: NLP

Format: TXT

Size: 183.19 MB

CLEAR Global

TWB Parallel Sentence kits - Tigrinya (5k)

5,000 English–Tigrinya parallel sentences from CLEAR Global's Gamayun mini-kit, translated by professional translators from Tatoeba source sentences.

License: CC-BY-SA-4.0

Locale: ti

Task: MT

Format: TSV

Size: 405.06 KB

MirasAI

Sylheti Text Corpus by Haque Publishers

A 524K token Sylheti corpus featuring drama scripts and cultural texts in UTF-8 format for linguistic research and NLP development.

License: CC-BY-NC-4.0

Locale: syl

Task: NLP

Format: TXT

Size: 3.53 MB

MirasAI

Rohingya Literature Corpus

A rare 613.5 K token Rohingya corpus written in the Myanmar script, featuring cultural articles and folklore for advanced NLP research and preservation.

License: CC-BY-NC-4.0

Locale: rhg

Task: NLP

Format: TXT, DOCX

Size: 7.01 MB

MirasAI

Noakhalian (নোয়াখাইল্লা) Text Corpus

A 500K token corpus of Noakhalian linguistic data including drama scripts and cultural texts for professional NLP research and language preservation.

License: CC-BY-NC-4.0

Locale: oak

Task: NLP

Format: TXT, DOCX

Size: 3.61 MB

Institute of African Digital Humanities

Tiv-TTS-Dataset

This dataset consists of segmented Tiv speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.

License: NOODL-1.0

Locale: tiv

Task: TTS

Format: MP3, TSV

Size: 311.58 MB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

JOIN THE MOVEMENT

Join Mozilla Data Collective

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.

How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at support@mozilladatacollective.com.

Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.