Microsoft is active in the LLM space, but doesn't provide any way to opt out.
The Verge: Microsoft’s AI boss thinks it’s perfectly okay to steal content if it’s on the open web
TL;DR: copy the following to your robots.txt (what's robots.txt?):
## AI opt-out rules (see aioptout.dev) # Amazon User-agent: Amazonbot Disallow: / # Anthropic User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Claude-Web Disallow: / # Apple User-agent: Applebot-Extended Disallow: / # ByteDance User-agent: Bytespider Disallow: / # Common Crawl User-agent: CCBot Disallow: / # Facebook User-agent: FacebookBot Disallow: / # Google User-agent: Google-Extended Disallow: / # OpenAI User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / # Webz.io User-agent: omgilibot Disallow: / User-agent: omgili Disallow: /
aioptout.dev is a collection of known user agents that belong to AI scrapers.
You can use the snippet above, or use toml or json datasets to generate your robots.txt dynamically.
Corrections or missing scrapers? Please create an issue or send a pull request on Github.
Microsoft is active in the LLM space, but doesn't provide any way to opt out.
The Verge: Microsoft’s AI boss thinks it’s perfectly okay to steal content if it’s on the open web
Amazonbot, no canonical reference yet
ClaudeBot, reference
anthropic-ai, no canonical reference yet
Not referenced in Anthropic's documentation directly
Claude-Web, no canonical reference yet
Not referenced in Anthropic's documentation directly
Applebot-Extended, reference
Applebot-Extended doesn't crawl pages directly, but presence of this user agent in robots.txt is used as a signal to opt-out data collected by Applebot from LLM training.
Bytespider, no canonical reference yet
Whilst not directly confirmed, ByteDance seems to use this bot to scrape data for their LLMs. There are reports of Bytespider ignoring robots.txt.
CCBot, reference
Common Crawl is not a company per se, but Common Crawl-collected web crawls are used to train LLMs.
FacebookBot, reference
This is the user agent Facebook uses for LLM training. Link previews are made with a different user agent.
Google-Extended, reference
As of 2024-07-03, "Google-Extended does not impact a site's inclusion or ranking in Google Search."
GPTBot, reference
ChatGPT-User, reference
Crawls on behalf of a ChatGPT user.
As of 2024-07-03, OpenAI claims that having either GPTBot or ChatGPT-User in robots.txt disables OpenAI crawling entirely.
omgilibot, reference
This bot belongs to a company selling web scrapped data to other companies. The data is used for LLM training.
omgili, reference
We shouldn't be chasing user agent strings across the internet.
The open web is underpinned by an unspoken contract. Creators put their hearts into content they create and publish, their websites are scraped by search engines, search engines get search results to mix with ads, and authors get people on their websites.
LLM-feeding scrapers renege on this contract. Authors publish into the void and their words get scraped and incorporated into LLM weights during training. There will be no attribution, no exposure, no people coming back to them, just faceless chatbot responses rehashing their insights. For companies scraping their work it's just more training material.
AI scraping should be opt-in.
While not directly confirmed in the documentation, there are reports that Amazonbot significantly increased its activity around the time everyone started training LLMs. Example one, example two.