Common Crawl Foundation (@CommonCrawl) / X

Common Crawl Foundation

1,439 posts

Common Crawl Foundation

@CommonCrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA

Joined February 2010

Common Crawl Foundation
@CommonCrawl
Dec 21, 2024
Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024
From commoncrawl.org
9.7K
Common Crawl Foundation
@CommonCrawl
Mar 31, 2025
Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI
From commoncrawl.org
9K
Common Crawl Foundation
@CommonCrawl
Mar 25, 2025
Our friends at Webrecorder have announced the launch of GovArchive.us, a dedicated site for exploring their US Government Web Archive on Browsertrix. More details in their blog post: webrecorder.net/blog/2025-03-2…
Webrecorder US Government Web Archive
From govarchive.us
3.4K
Common Crawl Foundation
@CommonCrawl
Feb 23, 2025
February 2025 Crawl Archive Now Available The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages. Page captures are from 47.6 million hosts or 38.5 million registered domains and include 1 billion new URLs not visited in any of our prior
1.7K
Common Crawl Foundation
@CommonCrawl
Sep 29, 2017
Need 3 billion web pages in WARC, WAT, and WET? Here you go! #opendata
Common Crawl - Blog - September 2017 Crawl Archive Now Available
From commoncrawl.org
Common Crawl Foundation
@CommonCrawl
Sep 29, 2017
Check it out! "Common Crawl And Unlocking Web Archives For Research" via @forbes
Common Crawl And Unlocking Web Archives For Research
From forbes.com
Common Crawl Foundation
@CommonCrawl
Dec 16, 2011
MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by @stevesalevan bit.ly/vCu8uM
Common Crawl Foundation
@CommonCrawl
Oct 20, 2024
commoncrawl.org/blog/october-2… The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages . Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.03 billion new URLs, not visited in any of our prior crawls.
Common Crawl - Blog - October 2024 Crawl Archive Now Available
From commoncrawl.org
2.9K
Common Crawl Foundation
@CommonCrawl
Jan 22, 2025
We are happy to announce cc-downloader, an experimental command-line tool for downloading Common Crawl data via https:
Common Crawl - Blog - Introducing cc-downloader
From commoncrawl.org
1.7K
Common Crawl Foundation
@CommonCrawl
Jun 3, 2024
commoncrawl.org/blog/may-2024-… Our 100th crawl!!
Common Crawl - Blog - May 2024 Crawl Archive Now Available
From commoncrawl.org
3.9K
Common Crawl Foundation
@CommonCrawl
Dec 19, 2024
Common Crawl - Blog - December 2024 Crawl Archive Now Available
From commoncrawl.org
1.4K
Common Crawl Foundation
@CommonCrawl
Jan 10, 2012
This is an awesome idea! @stephen_wolfram on a .data TLD bit.ly/w2mwhc HN discussion bit.ly/w2mwhc
Common Crawl Foundation
@CommonCrawl
Nov 18, 2024
Common Crawl - Blog - November 2024 Crawl Archive Now Available
From commoncrawl.org
893
Common Crawl Foundation
@CommonCrawl
Jun 25, 2024
📷 Check out NVIDIA NeMo Curator - This GPU-accelerated data-curation library includes data download, document deduplication, language identification, filtering, and other features often requested by Common Crawl users. Helpful for preparing large-scale, high-quality datasets for
2.5K