Tag Archives: python

Web Scraping using Scrapy

So recently I have been playing with web scraping using Python Scrapy module. It is fun when you can see the data across multiple pages of website in a csv format with a single hit of key.

This tutorial assumes that you have installed scrapy.

Now let’s get back straight to the topic. Scrapy provides you with various in-built commands and one of that can be used for creating the required files and folder structure. That scrapy command is ‘startproject’ and used as follows:

startproject

It creates the files and folder structure as below:

tree

Now the first file of importance is items.py. Here we declare items. Items are the attributes of the website data which we need to capture.

For example, in the below code, from an event ticketing website, we want to pick up the Date of the event, Name of the event, event’s organizer and the venue.


import scrapy
from scrapy.loader.processors import Join, MapCompose
from w3lib.html import replace_escape_chars

class ExampleItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Date = scrapy.Field(input_processor=MapCompose(lambda v: v.split(),replace_escape_chars,unicode.strip),output_processor=Join(),)
Name = scrapy.Field(input_processor=MapCompose(lambda v: v.split(),replace_escape_chars,unicode.strip),output_processor=Join(),)
Organizer = scrapy.Field(input_processor=MapCompose(lambda v: v.split(),replace_escape_chars,unicode.strip),output_processor=Join(),)
Location = scrapy.Field(input_processor=MapCompose(lambda v: v.split(),replace_escape_chars,unicode.strip),output_processor=Join(),)
pass

 

Here we are using something called input and output processors to manipulate the string data that we will be scraping from the website. More information about these processors is here. Input processors takes in the raw string returned by the scraper function and manipulates it as the the functions inside “MapCompose”.

Now, comes the main scarper function which we will be defining inside the spiders directory. Lets create a file as exampleCrawl.py.

touch

The code in exampleCrawl.py is :

import scrapy
from aus_events.items import ExampleItem
from scrapy.loader import ItemLoader
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider

class ExampleSpider(CrawlSpider):
name = "sample"
allowed_domains = ["example.com"]
start_urls = [
"https://www.example.com/events/?page=1"
]

rules = (
Rule(LinkExtractor(allow=("www\.example\.com\/events\/\?page=[0-9]")),callback='parse_item',follow=True),
)

def parse_item(self,response):

parent_url = response.xpath('//div[@class="l-block-2"]')
#print response.url

for card in parent_url:
l = ItemLoader(item=ExampleItem(), selector=card)
l.add_xpath('Name','a/div[2]/h4/text()')
l.add_xpath('Date','a/div[2]/time/text()')
l.add_xpath('Organizer','a/div[2]/div[1]/text()')
l.add_xpath('Location','a/div[2]/div[2]/text()')
yield l.load_item()

In this particular website, we need to crawl through every page and gather the data. Therefore, we are importing CrawlSpider from scrapy.spiders.

The url for different pages in this particular website is like:

http://www.example.com/events/?page=1

Here, we need to extract links for different pages, so we have used Rules where we have defined a regular expression for the url which we need to extract. And for every page url extracted, there is a function, parse_item, called which extracts the name, date, organizer and location for each event from that page.

For transforming data to csv format there are two ways. Either you can do it from command line while running the crawler or specify it in settings.py.

From command line:

crawl

Or in settings.py, mention below lines:


FEED_FORMAT = 'csv'
FEED_URI = "data.csv"