TL;DR: copy the following to your robots.txt (what's robots.txt?):

## AI opt-out rules (see aioptout.dev)

# Amazon
User-agent: Amazonbot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /

# Apple
User-agent: Applebot-Extended
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# Facebook
User-agent: FacebookBot
Disallow: /

# Google
User-agent: Google-Extended
Disallow: /

# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

# Webz.io
User-agent: omgilibot
Disallow: /
User-agent: omgili
Disallow: /

Opt out of AI/LLM training datasets

aioptout.dev is a collection of known user agents that belong to AI scrapers.

You can use the snippet above, or use toml or json datasets to generate your robots.txt dynamically.

Corrections or missing scrapers? Please create an issue or send a pull request on Github.

Conspicuously missing companies

Microsoft

Microsoft is active in the LLM space, but doesn't provide any way to opt out.

The Verge: Microsoft’s AI boss thinks it’s perfectly okay to steal content if it’s on the open web

Known companies

Amazon

Amazonbot, no canonical reference yet

While not directly confirmed in the documentation, there are reports that Amazonbot significantly increased its activity around the time everyone started training LLMs. Example one, example two.

Anthropic

ClaudeBot, reference

anthropic-ai, no canonical reference yet

Not referenced in Anthropic's documentation directly

Claude-Web, no canonical reference yet

Not referenced in Anthropic's documentation directly

Apple

Applebot-Extended, reference

Applebot-Extended doesn't crawl pages directly, but presence of this user agent in robots.txt is used as a signal to opt-out data collected by Applebot from LLM training.

ByteDance

Bytespider, no canonical reference yet

Whilst not directly confirmed, ByteDance seems to use this bot to scrape data for their LLMs. There are reports of Bytespider ignoring robots.txt.

Common Crawl

CCBot, reference

Common Crawl is not a company per se, but Common Crawl-collected web crawls are used to train LLMs.

Facebook

FacebookBot, reference

This is the user agent Facebook uses for LLM training. Link previews are made with a different user agent.

Google

Google-Extended, reference

As of 2024-07-03, "Google-Extended does not impact a site's inclusion or ranking in Google Search."

OpenAI

GPTBot, reference

ChatGPT-User, reference

Crawls on behalf of a ChatGPT user.

As of 2024-07-03, OpenAI claims that having either GPTBot or ChatGPT-User in robots.txt disables OpenAI crawling entirely.

Webz.io

omgilibot, reference

This bot belongs to a company selling web scrapped data to other companies. The data is used for LLM training.

omgili, reference

This is bollocks

We shouldn't be chasing user agent strings across the internet.

The open web is underpinned by an unspoken contract. Creators put their hearts into content they create and publish, their websites are scraped by search engines, search engines get search results to mix with ads, and authors get people on their websites.

LLM-feeding scrapers renege on this contract. Authors publish into the void and their words get scraped and incorporated into LLM weights during training. There will be no attribution, no exposure, no people coming back to them, just faceless chatbot responses rehashing their insights. For companies scraping their work it's just more training material.

AI scraping should be opt-in.

Sources used

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
https://nytimes.com/robots.txt
https://www.wired.com/robots.txt
https://darkvisitors.com/agents (they are great if you want a proper freemium product)