MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 300 high-quality global datasets, built by and for the community in a transparent and ethical way.

Hero Line

Datasets

Effect AI

Effect AI Scripted Speech 1.0 - English

A collection of scripted spoken phrases in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: TTS

Format Icon

Format: CSV, MP3

Size Icon

Size: 663.45 MB

Amara Hub

DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda

This is an open-access corpus of short clips of public radio content from Mega 100 FM, Q FM, Radio Pacis and Radio Rupiny in Northern Uganda. As of now, the online corpus has over 350 clips of recordings in English. We also hope to add finely-annotated transcripts to them. The dataset is for use in NLP research and non-commercial use. Upcoming datasets to look out for from Amara Hub are public radio recordings in other languages spoken in the region like Acholi, Lango, Lugbara and Akaramajong.
License Icon

License: NOODL-1.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: MP3

Size Icon

Size: 179.82 MB

Digital Divide Data

Khmer ASR Cultural Dataset

37.62 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.29 seconds with the standard deviation of 3.87. Speaker metadata (gender, age group, and origin city) is provided.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 12.59 GB

PT Pancaran Semangat Jaya

Corpus of Panjebar Semangat Javanese-Language Magazine

This dataset is a TXT-format collection compiled from three years of popular articles published in the Javanese-language weekly magazine Panjebar Semangat. It compiles widely read, non-academic Javanese texts reflecting contemporary themes and language use.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: Jav

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 4.31 MB

Center za jezikovne vire in tehnologije Univerze v Ljubljani

SI-NLI

SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets. The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: sl

Task Icon

Task: NLU

Format Icon

Format: TSV

Size Icon

Size: 392.44 KB

Pro Svizra Rumantscha

Vallader Newspaper Corpus

6.2 million tokens in the Vallader variety of Romansh from the daily newspaper ”La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-vallader

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 18.71 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

Sindh Line Publishers

Sindh Line Publishers

The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: snd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.22 MB

Institute of African Digital Humanities

Spoken-Congolese-French-Dataset

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recordings of semi-guided interviews conducted in Brazzaville, and orthographic transcriptions were added. The long audio recordings and their corresponding TRJS transcription files were automatically clipped alongside their respective transcriptions. The dataset comprises ten folders containing audio files and ten audio/text mapping files.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fr-CG

Task Icon

Task: NLP

Format Icon

Format: MP3, WAV, TSV

Size Icon

Size: 3.44 GB

Institute of African Digital Humanities

Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 19.25 MB

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.88 MB

Institute of African Digital Humanities

Mada Narratives

This dataset contains 17 transcribed oral narratives in Mada (mxu), a language belonging to the Afro-Asiatic family that is spoken in Cameroon. The texts, derived from audio recordings of oral literature, reflect natural spoken discourse. This dataset can be used for language modelling, text analysis and other natural language processing (NLP) tasks.
License Icon

License: NOODL-1.0

Locale Icon

Locale: mxu

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 65.04 KB

Line Logo
Line Logo

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.