MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

MDC Curators

CoVoST 2 Catalan - English

The dataset includes audio files in Catalan and their translations in English.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ca, en

Task Icon

Task: MT

Format Icon

Format: MP3, TSV

Size Icon

Size: 4.50 GB

MDC Curators

CoVoST 2 Welsh - English

The dataset includes audio files in Welsh and their translations in English.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: cy, en

Task Icon

Task: MT

Format Icon

Format: MP3, TSV

Size Icon

Size: 95.16 MB

MDC Curators

CoVoST 2 Arabic-English

The dataset includes 5736 audio files in Arabic and their translations in English.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ar, en

Task Icon

Task: MT

Format Icon

Format: MP3, TSV

Size Icon

Size: 148.21 MB

CLEAR Global

TWB Voice 1.0 - Hausa

Read speech corpus of 58 hours of Hausa audio with transcriptions, collected by CLEAR Global for ASR development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ha

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 11.88 GB

MDC Community Concierge

VoxForge - German

32 hours (24685 utterances) of read speech of Deutsch (German).
License Icon

License: GPL-3.0

Locale Icon

Locale: de

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 11.59 GB

MDC Community Concierge

VoxForge - Greek

4 hours (1377 utterances) of read speech of Ελληνικά (Greek).
License Icon

License: GPL-3.0

Locale Icon

Locale: el

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 954.26 MB

MDC Community Concierge

VoxForge - Catalan

39 minutes of Catalan read speech.
License Icon

License: GPL-3.0

Locale Icon

Locale: ca

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 143.71 MB

0DIN by Mozilla

Public GenAI Vulnerability Disclosures

Weekly export of 0DIN's public GenAI vulnerability disclosures: jailbreaks, prompt injection, content manipulation, and other model-security findings.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: en

Task Icon

Task: N/A

Format Icon

Format: JSONL

Size Icon

Size: 22.39 KB

MDC Community Concierge

VoxForge - Bulgarian

1 hour of read speech in Bulgarian, collected via the VoxForge project
License Icon

License: GPL-3.0

Locale Icon

Locale: bg

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 234.06 MB

MDC Curators

CoVoST 2 German-English

The dataset includes 154856 audio files in German and their translations in English.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: de, en

Task Icon

Task: MT

Format Icon

Format: MP3, TSV

Size Icon

Size: 5.90 GB

MDC Curators

CoVoST 2 English-Slovenian

The dataset includes 320492 audio files in English and their translations in Slovenian.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en, sl

Task Icon

Task: MT

Format Icon

Format: MP3, TSV

Size Icon

Size: 12.56 GB

CLEAR Global

TWB Parallel Sentence kits - Congo Swahili (25k)

25,305 French–Congo Swahili parallel sentences from CLEAR Global's Gamayun kits, translated by professional translators from Tatoeba source sentences.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: swc

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 2.18 MB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

An illustration of a floppy disks

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

An illustration of a floppy disks

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at support@mozilladatacollective.com.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.