Cover wide webspace invaders 2

Web­space Invaders

A couple of weeks back, I’m sitting at my desk when a direct message from my frontend friend Kevin Powell pops up. Kevin’s a genuinely kind guy. He makes CSS videos on YouTube and he’s got this way of explaining things that never makes you feel stupid for not knowing them already. He’s one of those folks who still has faith in the Web.

Hey, hope you’re doing well!

I was going to link to something of yours for an article I’m writing for Piccalilli, but I keep getting a “This site can’t be reached” error in Chrome. I tried in Firefox, and it’s saying it can’t establish a secure connection... Andy just left a comment saying it’s working for him, so it might be a regional thing? I have no idea.

Wait. My site is down?

I switch into the browser and load the site. It’s fine. I hit refresh. Still fine. I pick up my phone and load the site there. Also fine. I feel a little lift of relief, followed immediately by confusion: If my site is down for Kevin but not for me, something’s definitely not right. Maybe it really is a regional thing?

So, I do what any legitimate web person does when confronted with a problem: I throw technology at it. VPN time! I connect through Canada, because that’s where Kevin is based, and sure enough – my website doesn’t load. And I get the same results checking availability through check-host.net from a bunch of other locations: Brazil, Czechia, Hong Kong, India, Japan, and the Netherlands all fail to load the site, for instance.

After a few emails back and forth with my shared hosting provider all-inkl.com the issue is identified: they’ve indeed installed a filter that blocks entire countries from accessing my site. Without telling me. They just decided on their own that the best solution to whatever was happening was geographic surgery. “Your website is being accessed permanently and en masse. This is presumably a DDoS attack,” they wrote. “We have installed a filter that only allows access from certain countries so that we can protect the server from overload.”

An Alien Invasion

A DDoS attack? A botnet trying to hammer my server into submission? Really? I look at the access logs. No, this is something different. This is methodical. Millions (!) of requests over a couple of days from things with User-Agent strings reading GPTBot, OAI-Searchbot, Claude-SearchBot, or Meta-ExternalAgent, plus a whole bunch of IP addresses from Singapore, Shenzhen, and other parts of Asia scanning my articles and notes sections. That’s not a botnet DDoS attack. That’s an LLM bot “attack” by OpenAI, Meta, Anthropic, and many others.

In their hunger for data to train their large language models, companies from all over the world are systematically harvesting every word I’ve ever published, feeding it into their language models to keep them fresh – and the side effect, the collateral damage, is that Kevin in Montreal now can’t read my articles because my hosting provider decided the solution was to block Canada and half the rest of the world.

I sat there staring at those logs for a while. The irony wasn’t lost on me. This is my little corner of the web. My writing. With my weird little style mixer up there in the top right. And now it is simultaneously being strip-mined by AI companies and effectively made inaccessible to actual humans around the world who might want to read it.

This is where we are in 2026. There’s something happening on the Web at the moment that almost feels like watching that old arcade game Space Invaders play out across our servers. Bots and scrapers marching in formation, attacking our servers wave after wave, systematically requesting page after page, relentlessly filling their data stores while we watch our access logs fill up.

The webspace invaders have arrived.

And the numbers are staggering. About 50 percent of web traffic is now coming from non-humans. Bots, crawlers, and agents are constantly moving across our servers in search for new pieces of content to absorb and while search engine crawlers are still the highest traffic bot category, AI/LLM bot traffic is constantly surging, with AI training being the main crawl purpose. According to Cloudflare’s 2025 Year in Review report, AI bots originated 4.2 % of HTML requests. But this number excludes Googlebot, which crawls for both search indexing and AI training, and averaged at 4.5 % of requests. Yes, according to Cloudflare, Google’s crawl volume is still bigger than all other AI crawlers combined. And in some months of 2025, the numbers were even more extreme. Take April 26, 2025, for example: on that day, human traffic only accounted for 34.5 % of HTTP requests worldwide. AI bots came in at 4.1 %. A number that is already high on its own but is being dwarfed by Googlebot with a whopping 11 % of HTTP requests. I repeat: 11 % of HTTP requests worldwide coming from Googlebot alone. The rest of “Non-AI” bots averaged at 50.1 % of requests that day. And this number might still include a fair amount of LLM training bots that hide behind fake User-Agent strings.

The Open Web Subsidizing Big AI?

Here’s the thing about running a personal site on modest hosting, the kind of hosting most of us are on: you notice when things change. The big platforms, they can absorb all this additional traffic without blinking. They have CDNs and server farms and maybe even entire teams of people thinking about handling bot traffic.

But me? I’m running a simple setup. A Craft CMS site on shared hosting, some static files here and there, nothing fancy. Just like the majority of people with a personal site or a blog. And suddenly, I’m getting hammered with enough traffic that my hosting provider is implementing emergency measures. And I’m not the only one. Just recently, Remy did not only notice unusual spikes of traffic on his blog, but his project JS Bin was hit so hard by massive spikes in network inbound traffic that he was struggling to keep the site online. As Jeremy mentioned, The Wikimedia Foundation is also seeing unprecedented amounts of traffic generated by scraper bots. While humans tend to browse with intention, those AI crawlers “binge read” indiscriminately and visit also the less popular pages. 65 % of this resource-consuming traffic they get for Wikipedia is now coming from bots.

And when I asked on Mastodon and Bluesky, plenty of people chimed in. Turns out I’m really not alone. People are seeing the same pattern on their own sites – often serious enough that they’ve had to take action. There’s even a classic tell at the moment: waves of “users” visiting from Singapore with suspiciously old Chrome User-Agent strings.

There’s a power imbalance at work here that’s hard to ignore. Large “AI” companies, the ones with billions in venture capital, send their bots to harvest free content. Not only from big publishers or Wikipedia, but from small, independent websites, too. But we, the people running these sites – often as passion projects, as ways to freely share what we’ve learned, as digital gardens we tend in our spare time – we’re the ones paying for the bandwidth and server resources to handle all those additional requests while those companies profit from the training data they extract. It’s an asymmetric battle: small systems absorbing the demands generated at an entirely different, industrial scale.

These companies might not be plotting evil, but they’re externalizing their costs and aren’t particularly concerned about the collateral damage. Sure, some still provide robots.txt compliance. They’ve published documentation about their crawlers, albeit often incomplete. But even compliance and documentation won’t change the fundamental math: when you have dozens – no hundreds, soon thousands of companies (just look at the download counts on Hugging Face for any popular model), each running their own scrapers, each one binge-reading your site not once every two weeks like earlier versions of Googlebot used to do, but several times a day, the aggregate effect can quickly overwhelm a small site. And as long as the training data keeps flowing and the models keep improving, why would they change anything?

Good Bot, Bad Bot

Alright, it’s bad, obviously. But things get even messier. In Space Invaders, the aliens followed predictable patterns. You could learn their behavior, anticipate their movements, and develop strategies to play the game longer and score higher. The game we’re playing on the Web now? It’s unpredictable.

Because on the Web, we’re now playing against adaptable opponents. Scrapers from Singapore that hide behind spoofed User-Agent strings, in an attempt to appear as though they are a real browser, instead of identifying themselves honestly. Scrapers that change their IP addresses regularly. Scrapers that obviously don’t give a penny about your cute robots.txt. Scrapers that obviously don’t care about the gentleman’s agreement that the web was built upon.

And we’re just entering the next phase. Agentic AI. Systems that can autonomously explore websites, adapt their scraping strategies based on what defenses they encounter, rotate IP addresses, work around rate limiting. The bots are getting smarter by the minute. More persistent. More capable of treating every attempt to protect our sites as a new puzzle to solve.

This isn’t Space Invaders anymore. Those aliens invading our webspaces are locking on and firing relentlessly, constantly adapting, learning, and getting stronger. And here I am – one human with a keyboard – trying to keep my site accessible to other real humans while it is being harvested and under fire from all directions.

And to cap it all, all of this is happening against a backdrop of genuine geopolitical competition where training data for large language models is being seen as strategic infrastructure. The countries and companies that control the best AI models are said to have real future advantages – economically, militarily, politically. This transforms web scraping from an annoyance into something that’s part of a much larger power struggle. Chinese AI companies might not be that concerned about conventions around robots.txt or polite crawling. But honestly, American AI companies are under pressure to maintain their lead, and when prioritizing growth over everyone else in your path is the business model, they’re perfectly willing to cut corners, too. Everyone’s racing, and the open web looks like free resources up for grabs. The result is that my personal blog – your personal blog – becomes collateral damage in this competition. The scrapers don’t care that you’re not Facebook or the New York Times. Your content is data. Data is valuable. And in this race, the externality of overwhelming small site owners is just the cost of doing business.

“Just Use Cloudflare”

So naturally, there’s a corporate solution being offered. Cloudflare, with its massive network infrastructure, provides (free) AI bot protection now. You add some DNS entries, flip a switch in the Cloudflare dashboard, and let their systems handle the AI crawler traffic for you.

And from what I hear, it works fairly well. Cloudflare can identify and block malicious scrapers at a scale that individual site owners simply cannot. They can distinguish between real browsers and bots masquerading as browsers. They have the data, the machine learning models, the infrastructure, and the threat intelligence.

But I keep coming back to this question: should we all need to route our traffic through Cloudflare? Is that really the sustainable solution to the webspace invaders problem? Is that really the open, independent web we want?

Because here’s the thing: Cloudflare already sees an enormous percentage of web traffic. They’re a neutral actor now. But power structures change. Companies change. Countries and their governments change. Obviously. Every time we centralize infrastructure in response to a threat, we make the web a little less distributed, a little less resilient, and a little less ours. And we all saw last year what happens when half of the web relies on just one single point of failure. Within just one month, Cloudflare experienced two major outages: a global outage on November 18, 2025 took down roughly one in five webpages globally at the height of the incident, and one-third of the world’s 10,000 most popular websites, apps, and services, according to some estimates. Another significant incident occured on December 5 while Cloudflare attempted to detect and mitigate an industry-wide vulnerability in React Server Components.

I sat with all this for a while, trying to figure out how I felt about it and what to do with it. On one hand, I don’t have the time or expertise to run sophisticated bot detection on my own. I have a day job. I have a life and a family. The idea of spending weekends maintaining IP blocklists and analyzing traffic patterns is exhausting just to think about.

On the other hand, every small site owner routing through Cloudflare means more consolidation, more centralization, more of the web’s infrastructure dependent on a single company’s good intentions and continued solvency.

I don’t have a good answer here. I really don’t. It’s one of those problems where all the solutions feel inadequate, somehow.

Still, I decided to give it one last try. And so, I spent the last few weeks moving my site from my shared hosting plan over to a virtual private server (VPS) where I have more control over the protection layers of my site and where nobody can add a country filter without my knowledge. I’ll write more about other aspects of the setup in future posts.

Activating Your Site Shields

But for now, let’s look at a few practical ways to fight back against those webspace invaders. Because although the bots and agents are getting smarter by the minute, there is still a lot we can do to better protect our sites and hosting bills.

You, Robot?

The first step (still) is to start with robots.txt, even though this won’t stop them all. Create a robots.txt file in your web root if you don’t have one yet, and then explicitly disallow known AI scrapers of your choice:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

If you want to go further, you can block wide whole range of known LLM bots with a list like the one at ai.robots.txt or even serve a robots.txt that continuously updates via Known Agents (formerly known as Dark Visitors – interesting name change, by the way). It’s not a complete solution, but it’s a starting point that takes a few minutes to set up.

However, using robots.txt will only work for the actors that actually honour this convention. If a crawler decides to ignore your robots.txt, nothing will stop it. That’s why a lot of people, just like Ethan, have also started blocking bots via .htaccess.

For my new VPS setup, I now use nginx to serve my site. And so, instead of using .htaccess, which does only exist on Apache, I added a user agent check to the nginx config instead, similar to the one you can find in the ai.robots.txt repo:

set $block 0;

if ($http_user_agent ~* "(AddSearchBot|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Andibot|Anomura|anthropic\-ai|Applebot|Applebot\-Extended|atlassian\-bot|Awario|bedrockbot|bigsur\.ai|Bravebot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|netEstate\ Imprint\ Crawler|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|TerraCotta|Thinkbot|TikTokSpider|Timpibot|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)") {
    set $block 1;
}

if ($request_uri = "/robots.txt") {
    set $block 0;
}

if ($block) {
    return 403;
}

If the User-Agent string matches that of one of the bots on the list, the server now returns a 403 Forbidden status code and blocks the request. In case you're running an Eleventy blog on nginx, Robb has written about a similar approach he is using to block bots on his site.

One important note, though: if you want to still allow the Internet Archive to archive your site, which you should, make sure to not block their archive-org_bot – like I did at first. ;)

Rate Limiting

If you control your server configuration, rate limiting also becomes essential. Real humans very rarely request fifty pages per second. But aggressive scrapers definitely do. Therefore, putting a cap on how often someone can repeat an action within a certain timeframe can help stop malicious bot activity and reduce strain on your server. If you’re on nginx, look into limit_req_zone and limit_req, as explained in this post on the nginx blog. Apache has mod_ratelimit.

The tricky part is setting the right thresholds. Too strict and you risk blocking legitimate users or search engines. Too loose and the scrapers slip through easily. I’ve been experimenting with this, and honestly, it’s been a lot of trial and error and I’m still not 100 % sure if my settings are ideal. The general advice is to conservative – maybe 5 to 10 requests per second per IP – and adjust based on what you see in your logs.

My own setup, which complements the nginx config setup above, now distinguishes between three tiers of visitors based on User-Agent strings: the most aggressive bots and crawlers are blocked immediately, less aggressive crawlers – those that respect robots.txt – and user-initiated AI searches are moderately throttled, while everyone else (presumably humans, I hope, I’m still very hopeful, I guess) falls into the least restrictive tier. Using IP addresses would strengthen this approach even further, since, as we know, malicious actors will spoof their User-Agent strings. But for now, this additional rate limiting layer should help catch another significant number of aggressive invaders and cut off the peaks.

Monitoring Your Access Logs

I know, it sounds tedious. And it is. But you need to know what’s actually hitting your site. And means looking into the access logs of your server or webspace, not just your analytics dashboard. Because tools like Fathom or Plausible will only tell you half of the story. They apply their own filters to keep your analytics clean and bot-free, which is great for understanding human visitors. But that also means that there is no way around looking at your raw logs.

I recently spent hours going through mine, identifying the IPs with the most traffic and blocking them manually as a first, probably a bit unsustainable counter-measure. But it helped me to understand the problem better – and also to check whether my actions actually had an effect.

Tools like GoAccess can be super helpful for this, because they give you real-time insights and even let you set up alerts for unusual traffic spikes. Because ideally, you want to know when something weird is happening, not discover it three days later when your hosting provider has already implemented their own “solution.” This goes double for client sites.

Using IP Blocklists

IP blocklists are part of any solid firewall setup, but you need to be thoughtful about it. There are well‑established resources that publish lists of IP addresses of known scraping networks, exploiters, spam sources, and other malicious actors – data centers that legitimate traffic would never come from. Useful resources include AbuseIPDB, Spamhaus’s Don’t Route Or Peer Lists (DROP), blocklist.de, or All Cybercrime IP Feeds by FireHOL.

You can then block those IPs with a firewall like Linux’s iptables via ipset. This, however, requires ongoing maintenance because those lists need updating. There are ways to configure your server so that the lists get updated automatically. In my case, I am now using Hestia Control Panel on my server, which makes configuring iptables a bit more comfortable when blocking individual IPs or whole blacklists. But you should still be very careful to not block legitimate traffic or even yourself. It is all about finding the most trustworthy lists and not being overly aggressive. And again, watching your logs.

Poisoning Well

For the really aggressive invaders, especially if you’re concerned about LLM scrapers harvesting your content without consent, effectively violating your copyright, there’s another approach: serve different content to suspected bots, or as Heydon put it, poison the well, actually. Feed them garbage data. It won’t stop the bandwidth drain, but it does something about the copyright violation.

WAFs: Blocking the Baddest Baddies

The final layer of protection would be to install a web application firewall (WAF) that includes bot protection. Open-source tools like BunkerWeb, SafeLine, or the really interesting Anubis provide multi-layered defense against bot attacks through CAPTCHA verification or other challenges, dynamic protection, and anti-replay protection. They are basically an additional layer (a reverse-proxy) between your app or website and the internet that only lets the good traffic through.

But the bigger you make the concrete walls around your digital garden, the more likely it might also have negative consequences. You might block legitimate crawlers. You might hurt your accessibility and block real humans. And you might spend hours installing and maintaining systems that break in weird ways. To me personally, tools that put a challenge in front of real humans still feel a bit like admitting defeat.

So I haven’t yet tried Anubis, for example – although it might well be the best way to really get rid of the crawlers. But I’ll definitely keep exploring this space and try out this and other tools, now that I have my own virtual private personal website server.

The Web We Wanted

We built the web on optimistic assumptions. We assumed good faith. We assumed people would respect robots.txt because we all understood we were building something great together. We created these protocols and conventions because we believed in mutual respect and shared purpose. The Web was supposed to be for everyone.

But when training data becomes worth billions of dollars, when AI capabilities determine who wins and loses in global economic competition, when scraping is a strategic commercial activity, then those assumptions break down.

Part of me wants to believe that we can still maintain an open web without surrendering to corporate infrastructure or turning every website into a defensive fortress shooting webspace invaders. Part of me wants to believe that international cooperation could establish better norms over time and that the companies building these AI systems will take responsibility for the externalities they’re creating. (Haha.)

But I’m now also old enough to be an optimistic realist. I’m watching my access logs, I’m watching the arms race escalate, and I’m wondering how long places like mine can hold out. How long before every personal website needs enterprise-level protection just to function? How long before the cost of publishing your thoughts online becomes prohibitive for regular people? For some, it might be already.

It feels like we might soon be forced to choose between accessibility and survival. Between openness and sustainability. Between our ideals about what the web should be and the practical reality of keeping our sites online.

An Infinite Game

Stories like mine are now probably happening to more small site owners everywhere. Many of them might not even know why their hosting bills are climbing or why their sites are suddenly down in certain countries. And once they find out, many of them might not be able or willing to set up a VPS and spend hours fiddling with server logs, firewalls, blocklists, and nginx config files.

Yes, the AI companies need to do better. They actually should throttle their scraping to reasonable levels. They actually should respect the limited resources of small sites. They actually should develop industry standards that don’t externalize costs onto individuals who are just trying to share their work. This might even be in their best interest, because ultimately, they also risk destroying the very ecosystem they depend on – after all, what will their LLMs learn from when the independent web has been scraped into extinction? But I’m also not holding my breath for voluntary restraint when there are billions of dollars at stake. In the end, it is also on us to adapt to this new reality – clear-eyed and pragmatic.

So what do we do? We document what’s happening. We block scrapers. We move to better, more capable servers. We share our approaches and learn from each other. We push for better standards and regulations where we can. We make noise about the problem instead of suffering silently. Because this is the Open Web and and the Web was designed so that we can still do all that. That’s the magic of it.

I for one am not giving up my little corner on the Web, of course. I'll continue to document and write about that journey that is my personal website. Over the coming weeks, I will share my experiences with my new VPS setup – how I configured the server, how I set up my continuous deployment pipeline for Craft CMS (yes, I have that now, Manuel! 😉), how I configured caching and other performance improvements, and I’ll also finally write more about the redesign and CSS of my site. Because that’s what our little corners on the Web are all about. That’s what the personal website verse is all about. And that’s what we are trying to protect here.

So, if you’re running a website, check your logs. Set up some basic protections. Share what you find. Be part of this conversation. Write about the future you want. Because the alternative is watching the independent web getting scraped into oblivion while we all wonder what happened to the Web we loved.

Here’s the thing about Space Invaders, the game: it didn’t end when the aliens reached the bottom. It just started over, harder and faster. Maybe that’s where we are now. It’s not game over. Not yet. Just the next level. And we need to get better at playing it.

👾👾👾

What has your experience been with AI scrapers? Have you found approaches that work? Have you “given up” or moved to Cloudflare? I’m genuinely curious what’s happening in your corner of the Web, because I suspect we’re all dealing with variations of the same problem. So let me know via Webmention, Mastodon, Bluesky, email, or maybe even in a response blog post.

~

233 Webmentions

How LLMs & Chatbots Are Bad For the Indie Web

Large language models and their associated bots are bad for the indie web in at least three ways: 1) their logistical consequences are bad for bandwidth, 2) their social consequences are bad for guides, and 3) their citational consequences are bad for surfability. These consequences are worth highlighting in light of how LLM-based chatbots have been used and endorsed on the indie web. The indie web may mean ...
Photo of damianwalsh
damianwalsh
@matthiasott Thanks for writing this up. I deliberately didn't add analytics to my site because I didn't want to compromise privacy or fall into the trap of creating content that "performs well". That said, Netlify offers a one-hour rolling observability feature on their legacy free plan. Even my small site typically gets around 60% non-browser traffic. I don't want to block genuine users, but this somehow makes ...
Photo of svenk
svenk
@matthiasott Eine interessante Maßnahme ist auch ein self-hosted Proof-of-Work-Captcha wie https://anubis.techaro.lol/ PS: Deine Seite ist mit Firefox@Android defekt (Sieht alles etwas aus wie line-height:0px) Anubis: Web AI Firewall Utility | Anubis
Photo of Steve Woodson, CPACC
Steve Woodson, CPACC
What a nightmare, I've been there too with hosting providers making changes indiscriminately and we're left investigating what's going on. Hard to fault them, besides the lack of communication, because they're just trying to keep their servers online. Thanks for the ideas & resources at the end!
Photo of Aline F. :nyancat_rainbow:
Aline F. :nyancat_rainbow:
@matthiasott Welp, I deleted instead of editing 😅 reposting with some links you might enjoy: blog posts by @algernon “You probably shouldn’t block #AI bots from your website” https://chronicles.mad-scientist.club/tales/you-probably-shouldnt-block-ai-bots-from-your-website/ “The cost of poison” https://chronicles.mad-scientist.club/tales/the-cost-of-poison/ plus this excellent list: “Sabot in the ...
Photo of Koos
Koos
@matthiasott I haven't had issues like that (yet!) but now I know where to look when the day comes. I'm using Matomo Analytics which ignores a good chunk of traffic, for better or for worse. I'm wondering about the usefulness of robots.txt. I've added one with the LLM scrapers too, but OpenAI and Anthropic have publicly admitted that they ignore the file. And the smaller companies always have anyway.
Photo of Matthias Ott
Matthias Ott
My site just was under heavy load from Singapore again LOL. 🤣 Temporarily blocked the whole country via ipset now so that my site is available again …
Photo of Matthias Ott
Matthias Ott
My site just was under heavy load from Singapore again LOL. 🤣 Temporarily blocked the whole country via ipset now so that the site is available again …
Photo of Joe Crawford
Joe Crawford
@matthiasott What a moment we are in. Feel free to document at https://indieweb.org/large_language_model_traffic (Also, I added your piece to that page) large language model traffic
Photo of Nicolas Hoizey
Nicolas Hoizey
“Webspace Invaders” by @matthiasott 🔗 https://matthiasott.com/articles/webspace-invaders > We built the web on optimistic assumptions. We assumed good faith. We assumed people would respect robots.txt because we all understood we were building something great together. We created these protocols and conventions because we believed in mutual respect and shared purpose. The Web was supposed to be for ...
Photo of Flaki
Flaki
The arms race chugs along... https://matthiasott.com/articles/webspace-invaders We were actually relying on Cloudflare's bot protections for a while, but the AI scraping got *so bad* lately (with mitigations turned on!) that we have been forced to spend most of our time just trying to keep the site running and not on actual feature work. Eventually, our CTO implemented a POW scheme inspired by Anubis that was ...
Photo of Elliot Jay Stocks
Elliot Jay Stocks
This is a fascinating (and terrifying) read by @matthiasott, who identifies the sheer scale of LLM scraper bots hitting personal sites — to the point where web hosts are having to limit traffic, preventing actual humans from accessing them. https://matthiasott.com/articles/webspace-invaders Webspace Invaders · Matthias Ott
Photo of Ric Le Poidevin
Ric Le Poidevin
I had a similar problem on client websites, although admittedly a fair proportion of the traffic was bots scanning for WordPress vulnerabilities etc. I don't use WordPress but it hit my CMS API to see if the URLs exist. AI crawlers have just made this worse by crawling actual URLs.
Photo of Cory Dransfeldt :demi:
Cory Dransfeldt :demi:
🔗 Web­space Invaders via @matthiasott #Development #Webdev #Ai #Tech A couple of weeks back, I’m sitting at my desk when a direct message from my frontend friend Kevin Powell pops up. Kevin’s a genuinely kind guy. He makes CSS videos on YouTube and he’s got this way of explaining things that never makes you feel stupid for not knowing them already. He’s one of those folks who still has faith in the ...
Photo of Matthias Ott
Matthias Ott
For now, I only moved my site over to a VPS. The domain points to that server now, but is still at all-inkl. As are other domains, email, and other stuff. I don’t have client sites there, though.
Photo of Matthias Ott
Matthias Ott
Except for the fact that they didn’t tell me that they added a filter and I had to investigate myself and convince them that in fact *they* changed something, I was happy with them and a customer for >16 years now. And even that they did at least *something* can be seen as a positive, I guess …
Photo of Matthias Ott
Matthias Ott
@errorquark That sounds really interesting. I’m just getting my head around all this, so every direction that sounds promising is very helpful. I’m also relying a lot on auto-updating ipset lists and rate limiting. But it’s obviously not enough, had to block Singapore completely yesterday. Good luck evolving your defense strategy further! 🤞
Photo of Nils Hörrmann
Nils Hörrmann
@matthiasott Thanks for writing this down! We had a similar situation where a client's site was hammered by a bot network. They visited 250 pages then rotated the IP and their individual requests were actually throttled. But there were so many bots that they downloaded 100 GB of font data within two weeks – by just downloading the same (small) webfont over and over again. They are like robot vacuum cleaners ...
@matthiasott.com hat einen sehr guten zusammenfassenden Artikel zu dem ausufernden LLM Crawling geschrieben. Wir bei der FAU haben das Problem noch ein paar Stufen härter, weil wir über 1.500 Webauftritte betreuen. Lesebefehl anbei. matthiasott.com/articles/web...
Photo of st3phvee
st3phvee
I have the exact same issue on my website. Every day, I receive literal THOUSANDS of hits for “wp-admin” pages and PHP extensions that don’t actually exist on my website. I had to set up a WAF specifically for WordPress/PHP exploit attempts.
Photo of st3phvee
st3phvee
Great post! This crap is happening on my website, too. More than half of the traffic I receive is from bots… I worry about my RSS feed as well. I’d like to believe that everyone subscribed to my feed is a human with good intentions, but I’m not that naive.
Photo of barning
barning
@matthiasott Das ist mir auch passiert, nur dass bei mir regelmäßig Podcast Feeds gecrawlt werden und alles zum Einsturz bringen. Die einzige Lösung ist gerade meine Hostingstrategie langsam zu entzerren. https://niklasbarning.de/2025/12/02/picknick-an-der-datenautobahn/ Picknick an der Datenautobahn
Photo of Webrocker
Webrocker

Webspace Invaders - Matthias Ott

(…) In their hunger for data to train their large language models, companies from all over the world are systematically harvesting every word I’ve ever published, feeding it into their language models to keep them fresh – and the side effect, the collateral damage, is that Kevin in Montreal now can’t read my articles because my hosting provider decided the solution was to block Canada and half the rest of ...
Photo of Francesco Schwarz
Francesco Schwarz
Web­space Invaders via matthiasott.com So, if you’re running a website, check your logs. Set up some basic protections. Share what you find. Be part of this conversation. Write about the future you want. Because the alternative is watching the independent web getting scraped into oblivion while we all wonder what happened to the Web we loved.
Photo of Webrocker
Webrocker

Army of Bots

Colorful toy robots For some months now I have a simple detection against "bad" bots in place. Bots that scrape *everything* they find and very likely are vacuuming all the contents they get to feed the data grinders that train the LLMs of the world. Bots that not only ignore the "robots.txt" protocol, but actively see entries in the robots.txt file as an invitation to visit the contents that are listed there as ...
Photo of Jeremy Keith
Jeremy Keith
Webspace Invaders · Matthias Ott February 24th, 2026 There’s a power imbalance at work here that’s hard to ignore. Large “AI” companies, the ones with billions in venture capital, send their bots to harvest free content. Not only from big publishers or Wikipedia, but from small, independent websites, too. But we, the people running these sites – often as passion projects, as ways to freely share what ...

97 Likes

95 Reposts

ⓘ Webmentions are a way to notify other websites when you link to them, and to receive notifications when others link to you. Learn more about Webmentions.