Going ahead with our web scraping adventure, there are a few more tricks which I had explored over the last weekend.
Since web scraping spiders can reduce the website performance considerably (if the spiders hit the website frequently), some websites implement techniques to ward off spiders crawling their website.
Websites have ‘robots.txt’ file which can be found at the root itself, such as http://www.example.com/robots.txt. Even wordpress blogs have this file. For this blog, ‘robots.txt’ is at https://devopsrecipe.wordpress.com/robots.txt. This file defines which user-agents are blocked from crawling which pages of the website. Visit http://www.robotstxt.org/robotstxt.html for more information on this.
‘robots.txt’ just instructs web scrapers what to scrape and what not to, but does not implement anything to block scrapers. So it all depends on your good judgement to exclude sites or pages of sites from your scraper.
Secondly, if websites identify a scraper hitting it, they can re-direct to a different link. For avoiding this, you can disable re-direction. For this, you need to add the below line in settings.py of your scrapy project:
REDIRECT_ENABLED=False