Common Crawl Foundation
1,439 posts
Common Crawl is a non-profit foundation dedicated to the Open Web.
- Our friends at Webrecorder have announced the launch of GovArchive.us, a dedicated site for exploring their US Government Web Archive on Browsertrix. More details in their blog post: webrecorder.net/blog/2025-03-2โฆ
- February 2025 Crawl Archive Now Available The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages. Page captures are from 47.6 million hosts or 38.5 million registered domains and include 1 billion new URLs not visited in any of our prior
- MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by @stevesalevan bit.ly/vCu8uM
- commoncrawl.org/blog/october-2โฆ The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages . Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.03 billion new URLs, not visited in any of our prior crawls.
- We are happy to announce cc-downloader, an experimental command-line tool for downloading Common Crawl data via https:
- ๐ท Check out NVIDIA NeMo Curator - This GPU-accelerated data-curation library includes data download, document deduplication, language identification, filtering, and other features often requested by Common Crawl users. Helpful for preparing large-scale, high-quality datasets for

