Common Crawl Datasets

An open repository of web crawl data freely available to researchers, developers, and innovators worldwide. Petabytes of data collected since 2008, hosted on Amazon S3 with free access under the AWS Open Data Sponsorship Program.

Core Archives

Main Crawl Archives

Comprehensive monthly snapshots of the web containing raw HTML, metadata, and extracted text from billions of pages.

Indexes

Search Indexes

Efficient indexes for searching and querying across the entire Common Crawl corpus at scale.

Network Data

Web Graph Datasets

Hyperlink graphs revealing the structure, connectivity, and relationships across the web.

Experimental

Additional Projects

Experimental datasets and specialized extractions from Common Crawl data.

Third-Party

Contributed Datasets

Specialized datasets and filtered corpora derived from Common Crawl by researchers and organizations.

Overview of Contributed Datasets

Development

Developer Resources

Reusable code, tools and resources for working with the data, including libraries, examples and hosted integrations.