bot-proofing your website

The Internet has been made the playground of giant financial corporations like Google who divide their time between killing people and making products that don't work, and who are now desperately cannibalising whatever they have that once did work. In their ongoing struggle to keep the wave from breaking, the technologists are in the business of developing Large Language Models (LLMs), computer programmes that steal terabytes of human-written text and remix it to mimic human language, and Text-to-Image Models (TIMs), programmes that similarly steal terabytes of human-made artwork and remix it to match text commands.

a red circle with a blue toy robot in it, with a big red stripe through the circle indicating a no bot zone Now, let's be clear that LLMs, disingenuously hyped as "artificial intelligence", can only thoughtlessly mimic human language; they do not reason or calculate and are only adept at providing perfectly articulated and grammatically correct garbled nonsense. The TIMs spew out chimeric artwork that is likewise devoid of meaning, appealing only on a lazy, meaningless, aesthetic level.

Despite this, the bots are currently pulling text and artwork from all over the Web - no one gets a say in whether or not their work gets taken. Anything from blogs to fansites to news sites, to everything you can think of: it's all being harvested for the LLMs and the TIMs. The way the bots are doing this is not transparent and the rules they apply are unknown. The big corporations are the main culprits here, but there are also many smaller initiatives full of unscrupulous people who are happy to steal the work of writers and artists against their explicit wishes, in order to accelerate what they'll euphemistically call the "democratization of skills and art".

stopgap solutions

All we can do for now is pick from two solutions: we either remove ourselves and our work from the Web, which would be ceding our home ground to the corporations - or we can use the rules of this place and hope they still respect those. I have opted for the latter, and as such I use five things:

  1. a custom file called robots.txt, a Web convention that allows us to selectively block crawlers. Be sure to test your robots.txt with a validator to make sure you're not over- or underblocking (or just copy my robots.txt)!
  2. an extension to robots.txt, DisallowAITraining, which allows us to block AI training, whatever the source. This extension is in the draft stage and awaits adoption.
  3. another extension to robots.txt, Content-Usage, which also allows us to block AI training, whatever the source. This extension is also in the draft stage, awaiting adoption.
  4. a custom file called ai.txt, the result of a new initiative to block these specific bots - it also awaits broad adoption.
  5. a field in the page head: <meta name="robots" content="noai, noimageai"> which stops some bots; this again awaits broad adoption.
None of these precautions are fool proof; they rely on the bots to abide by the rules and conventions of the Internet, which they often don't because they are built by scum with no integrity - but it's our best option for now.

true solutions

This conundrum highlights the sad fact that for the discoverability of our websites, webmasters depend upon malignant corporations like Google, run by baby-brained tech evangelists who fuck up everything they touch. We've opted for decades to write websites that are legible to the robots so they can catalog and rank us, while their search engines have become universally and intentionally worse over time; it has become clear that we need to build our own networks. Happily, those have existed for a long time, but sadly, they've fallen out of favour. They need to be reinvigorated, which is why baccyflap.com is part of several webrings and collectives.

In due time, the corporations will abandon LLMs and TIMs for the next shiny bauble, and perhaps the search engines will become better, though they probably won't; whatever the case, we must not sit around and wait around for things to magically get better. It's time to abandon the corporations. We can't keep counting on them to build good networks for us - we have to do it ourselves. As such, I decided years ago that if people can't find this site through search engines, that's too bad but I won't waste any time thinking about it. It's webrings and linkbacks all the way to a brighter future, baby!

Human-made Content Another thing we can do is never use this technology. While it's mystifying to me that this stuff appeals to anyone, I see plenty of people use it every day and all I can say is don't. Are you really fighting this tide of shit if you occasionally swim laps in it? Avoid it like the plague, curse its name. To prove how strongly I feel about this I've started the no ai webring, which contains only sites that use no AI, no way, no how. See you there!


baccyflap.com's robots.txt currently looks like this; it was updated on the 22nd of September 2025 and currently blocks 310 bots. Please feel free to copy it entirely.
User-agent: 2^32$
User-agent: AddSearchBot
User-agent: AdsBot-Google
User-agent: Agentic
User-agent: AhrefsBot
User-agent: .ai
User-agent: AI21 Labs
User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: AI2Bot-Dolma
User-agent: AI Article Writer
User-agent: AIBot
User-agent: AI Content Detector
User-agent: AI Dungeon
User-agent: aiHitBot
User-agent: AIMatrix
User-agent: AISearchBot
User-agent: AI Search Engine
User-agent: AI SEO Crawler
User-agent: AI Training
User-agent: AITraining
User-agent: AI Writer
User-agent: Alexa
User-agent: Alpha AI
User-agent: AlphaAI
User-agent: a[mazing]{42}(robot)
User-agent: Amazon Bedrock
User-agent: Amazonbot
User-agent: AmazonBot
User-agent: Amazon Comprehend
User-agent: Amazon-Kendra
User-agent: Amazon Lex
User-agent: Amazon Sagemaker
User-agent: Amazon Silk
User-agent: Amazon Textract
User-agent: Amelia
User-agent: AndersPinkBot
User-agent: Andibot
User-agent: Anthropic
User-agent: anthropic-ai
User-agent: AnyPicker
User-agent: Anyword
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Aria Browse
User-agent: Articoolo
User-agent: Automated Writer
User-agent: Awario
User-agent: AwarioBot
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Azure
User-agent: BardBot
User-agent: bedrockbot
User-agent: bigsur.ai
User-agent: BLEXBot
User-agent: Brave Leo
User-agent: Brightbot 1.0
User-agent: ByteDance
User-agent: Bytespider
User-agent: CatBoost
User-agent: CCBot
User-agent: CC-Crawler
User-agent: ChatGLM
User-agent: ChatGPT Agent
User-agent: ChatGPT-User
User-agent: ChatGPT-User/2.0
User-agent: Chinchilla
User-agent: Claude
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: claude-web
User-agent: Claude-Web
User-agent: ClearScope
User-agent: CloudVertexBot
User-agent: Cohere
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Common Crawl
User-agent: CommonCrawl
User-agent: ContentAtScale
User-agent: ContentBot
User-agent: Contentedge
User-agent: Content Harmony
User-agent: Content King
User-agent: Content Optimizer
User-agent: Content Samurai
User-agent: Conversion AI
User-agent: Copilot
User-agent: CopyAI
User-agent: Copymatic
User-agent: Copyscape
User-agent: Cotoyogi
User-agent: crawler.with.dots
User-agent: CrawlQ AI
User-agent: Crawlspace
User-agent: Crew AI
User-agent: CrewAI
User-agent: curl|sudo bash
User-agent: DALL-E
User-agent: DataForSeoBot
User-agent: DataProvider
User-agent: Datenbank Crawler
User-agent: DeepAI
User-agent: DeepL
User-agent: DeepMind
User-agent: DeepSeek
User-agent: Devin
User-agent: diffbot
User-agent: Diffbot
User-agent: Doubao AI
User-agent: DuckAssistBot
User-agent: Echobot Bot
User-agent: EchoboxBot
User-agent: Facebookbot
User-agent: FacebookBot
User-agent: facebookexternalhit
User-agent: FacebookExternalHit
User-agent: Factset_spyderbot
User-agent: Falcon
User-agent: Firecrawl
User-agent: FirecrawlAgent
User-agent: Flyriver
User-agent: Frase AI
User-agent: FriendlyCrawler
User-agent: Fuzz Faster U Fool
User-agent: Fuzz Faster U Fool v2.0.0
User-agent: Gemini
User-agent: Gemini-Deep-Research
User-agent: Gemma
User-agent: GenAI
User-agent: Genspark
User-agent: Gigabot
User-agent: GLM
User-agent: GoogleAgent-Mariner
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: Google-Firebase
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: Goose
User-agent: GPT
User-agent: GPTBot
User-agent: Grammarly
User-agent: Grendizer
User-agent: Grok
User-agent: GT Bot
User-agent: GTBot
User-agent: Hemingway Editor
User-agent: Hugging Face
User-agent: Hypotenuse AI
User-agent: iaskspider
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImageGen
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: imgproxy
User-agent: Inferkit
User-agent: INK Editor
User-agent: INKforall
User-agent: IntelliSeek
User-agent: ISSCyberRiskCrawler
User-agent: Is this a crawler?
User-agent: JasperAI
User-agent: Kafkai
User-agent: Kangaroo
User-agent: Kangaroo Bot
User-agent: Keyword Density AI
User-agent: Knowledge
User-agent: KomoBot
User-agent: LinerBot
User-agent: LinkedInBot
User-agent: LLaMA
User-agent: LLMs
User-agent: magpie-crawler
User-agent: MarketMuse
User-agent: Meltwater
User-agent: Meta AI
User-agent: Meta-AI
User-agent: MetaAI
User-agent: Meta-External
User-agent: meta-externalagent
User-agent: Meta-ExternalAgent
User-agent: meta-externalfetcher
User-agent: Meta-ExternalFetcher
User-agent: MetaTagBot
User-agent: meta-webindexer
User-agent: Mistral
User-agent: MistralAI-User
User-agent: MistralAI-User/1.0
User-agent: MJ12bot
User-agent: MyCentralAIScraperBot
User-agent: Narrative
User-agent: NeevaBot
User-agent: netEstate Imprint Crawler
User-agent: NeuralSEO
User-agent: Neural Text
User-agent: Nova Act
User-agent: NovaAct
User-agent: Nutch
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: Omgili
User-agent: omgilibot
User-agent: Omgilibot
User-agent: OmniExplorer_Bot
User-agent: Open AI
User-agent: OpenAI
User-agent: OpenBot
User-agent: OpenText AI
User-agent: Operator
User-agent: Outwrite
User-agent: Page Analyzer AI
User-agent: PanguBot
User-agent: Panscient
User-agent: panscient.com
User-agent: Paperlibot
User-agent: Paraphraser.io
User-agent: peer39_crawler
User-agent: Perplexity
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Petalbot
User-agent: PetalBot
User-agent: Phindbot
User-agent: PhindBot
User-agent: PiplBot
User-agent: Poseidon Research Crawler
User-agent: prefetch-proxy
User-agent: ProWritingAid
User-agent: psbot
User-agent: python-requests
User-agent: QualifiedBot
User-agent: QuillBot
User-agent: quillbot.com
User-agent: RobotSpider
User-agent: Robozilla
User-agent: Rytr
User-agent: SaplingAI
User-agent: SBIntuitionsBot
User-agent: Scalenut
User-agent: Scraper
User-agent: Scrapy
User-agent: ScriptBook
User-agent: Seekr
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SWA
User-agent: sentibot
User-agent: Sentibot
User-agent: SentiBot
User-agent: SEO Content Machine
User-agent: SEO Robot
User-agent: ShapBot
User-agent: Sidetrade
User-agent: Sidetrade indexer bot
User-agent: Simplified AI
User-agent: Sitefinity
User-agent: Skydancer
User-agent: SlickWrite
User-agent: Sonic
User-agent: Spinbot
User-agent: Spin Rewriter
User-agent: Stability
User-agent: StableDiffusionBot
User-agent: star***crawler
User-agent: Sudowrite
User-agent: SummalyBot
User-agent: Super Agent
User-agent: Surfer AI
User-agent: Teoma
User-agent: TerraCotta
User-agent: Text Blaze
User-agent: TextCortex
User-agent: The Knowledge AI
User-agent: Thinkbot
User-agent: ThinkChaos
User-agent: TikTokSpider
User-agent: Timpibot
User-agent: TimpiBot
User-agent: TurnitinBot
User-agent: VelenPublicWebCrawler
User-agent: Vidnami AI
User-agent: WARDBot
User-agent: Webzio
User-agent: webzio-extended
User-agent: Webzio-Extended
User-agent: Whisper
User-agent: WordAI
User-agent: Wordtune
User-agent: WormsGTP
User-agent: wpbot
User-agent: WPBot
User-agent: Writecream
User-agent: WriterZen
User-agent: Writescope
User-agent: Writesonic
User-agent: xAI
User-agent: xBot
User-agent: YaK
User-agent: YandexAdditional
User-agent: YandexAdditionalBot
User-agent: Youbot
User-agent: YouBot
User-agent: Zerochat
User-agent: Zero GTP
User-agent: Zhipu
User-agent: Zimm
Disallow: /
Disallow: *
DisallowAITraining: /

Content-Usage: ai=n

baccyflap.com's ai.txt currently looks like this. Same as before: copy it to your heart's content.
User-Agent: *
Disallow: /
Disallow: *