An open repository of web crawl data freely available to researchers, developers, and innovators worldwide. Petabytes of data collected since 2008, hosted on Amazon S3 with free access under the AWS Open Data Sponsorship Program.
Comprehensive monthly snapshots of the web containing raw HTML, metadata, and extracted text from billions of pages.
Efficient indexes for searching and querying across the entire Common Crawl corpus at scale.
Hyperlink graphs revealing the structure, connectivity, and relationships across the web.
Experimental datasets and specialized extractions from Common Crawl data.
Specialized datasets and filtered corpora derived from Common Crawl by researchers and organizations.
Reusable code, tools and resources for working with the data, including libraries, examples and hosted integrations.