I need clarifications to the functionality of the rate limiting feature, as the documentation states “the Rate Limiting counts mostly normal page requests but not requests for static assets, and not requests to admin-ajax.php”
If I’m correct Scrapy will not be blocked when bombarding my site with tens of requests per second like these:
34.31.178.14 – – [22/Apr/2025:12:18:37 +0200] “GET /portfolio-photostream/?tax_category=euphorbiaceae,boraginaceae,grallariidae&unfilter=1 HTTP/1.0” 200 10175 “https://www.ross.no/portfolio-photostream/?tax_category=euphorbiaceae,boraginaceae&unfilter=1” “Scrapy/2.11.2 (+https://scrapy.org)”
34.31.178.14 – – [22/Apr/2025:12:18:38 +0200] “GET /portfolio-photostream/?tax_category=euphorbiaceae,boraginaceae,ploceidae&unfilter=1 HTTP/1.0” 200 8726 “https://www.ross.no/portfolio-photostream/?tax_category=euphorbiaceae,boraginaceae&unfilter=1” “Scrapy/2.11.2 (+https://scrapy.org)”
34.31.178.14 – – [22/Apr/2025:12:18:34 +0200] “GET /portfolio-photostream/?tax_category=euphorbiaceae,ramphastidae,blomsterplanter&unfilter=1 HTTP/1.0” 200 8726 “https://www.ross.no/portfolio-photostream/?tax_category=euphorbiaceae,ramphastidae&unfilter=1” “Scrapy/2.11.2 (+https://scrapy.org)”
34.31.178.14 – – [22/Apr/2025:12:18:33 +0200] “GET /portfolio-photostream/?tax_category=euphorbiaceae,ramphastidae,cervidae&unfilter=1 HTTP/1.0” 200 8726 “https://www.ross.no/portfolio-photostream/?tax_category=euphorbiaceae,ramphastidae&unfilter=1” “Scrapy/2.11.2 (+https://scrapy.org)”
34.31.178.14 – – [22/Apr/2025:12:18:45 +0200] “GET /portfolio-photostream/?tax_category=euphorbiaceae,hypericaceae&date=2023_07&unfilter=1 HTTP/1.0” 200 8726 “https://www.ross.no/portfolio-photostream/?tax_category=euphorbiaceae,hypericaceae&unfilter=1” “Scrapy/2.11.2 (+https://scrapy.org)”
Please advise.
https://wordpress.org/support/topic/rate-limiting-for-query-gets-2/
As the page with query strings resolves to a legitimate filtered list of images rather than a 404 or suspicious attempt to exploit a plugin’s settings, it may be allowed unless it exceeds your Rate Limiting settings for “everyone” or “crawler”.
You can outright block these attempts if you don’t wish to see Scrapy accessing your site by using Wordfence > Blocking > Custom Pattern. In “Browser User Agent” you could type Scrapy/*, or alternatively in “Hostname” you could type *scrapy.org*. Don’t forget to hit the “BLOCK VISITORS MATCHING THIS PATTERN” afterwards to save. If you experience problems going forward, these blocks can be removed from the table below.
If you’d like to set your Rate Limiting Rules to deal with them instead, I usually use these values to start with:
Rate Limiting Screenshot
- If anyone’s requests exceed – 240 per minute
- If a crawler’s page views exceed – 120 per minute
- If a crawler’s pages not found (404s) exceed – 60 per minute
- If a human’s page views exceed – 120 per minute
- If a human’s pages not found (404s) exceed – 60 per minute
- How long is an IP address blocked when it breaks a rule – 30 minutes
I also always set the rule to Throttle instead of Block. Throttling is generally better than blocking because any good search engine understands what happened if it is mistakenly blocked and your site isn’t penalized because of it. Make sure and set your Rate Limiting Rules realistically and set the value for how long an IP is blocked to 30 minutes or so. This is a general guide so feel free to loosen them up if you notice too many unwanted blocks.
Let me know if that helps,
Peter.
My response above was held for moderation @rosmo01 even though I posted it a few minutes after the reply in your duplicate topic. I’m just responding again to flag the message if you didn’t receive a notification.
Peter.
The challenge is only with crawlers and scrapers, as they are excessively digging deep into meaningless queries, which means that any crawler or scraper could and will without warning drown the server with tens of thousands of requests per hour. I got 58k Scrapy query GETs from just one IP address within 20 minutes, so at any point there could be hundreds of IPs wanting to scrape/crawl. I had 400K query GETs from google within a short time, before I was able to limit via the hosting provider. From Vietnam Posts and Telecommunications Group I got hit by thousands of IPs running query GETs from more than 600 subnets, which took forever for me to ban.
My challenge is to limit unknown bots. I’ve never had an issue with Scrapy, so I did not have any user agent or pattern block in place.
Can the rate limiting be set to a pattern – like ‘&unfilter=1’ – this is the hallmark of a crawler or scraper if it appears from same IP in short succession, as no user has the time to filter, read, and the re-filter the content.