Scrapy

Language: Python

Web

Scrapy was created by Pablo Hoffman and released in 2008. It was designed to provide a framework for web scraping that is fast, extensible, and reliable. Scrapy has become popular in both industry and research for automated data extraction, web crawling, and building data pipelines.

Scrapy is a fast, open-source web crawling and web scraping framework for Python. It provides tools to extract data from websites, process it, and store it in formats like JSON, CSV, or databases.

Installation

pip: pip install scrapy
conda: conda install -c conda-forge scrapy

Usage

Scrapy allows you to define spiders that navigate through websites, extract information using selectors or XPath/CSS expressions, and export the collected data. It includes support for requests, middleware, pipelines, and asynchronous networking for high performance.

Creating a Scrapy project

# In terminal:
scrapy startproject myproject

Initializes a new Scrapy project called 'myproject', creating the necessary directory structure and configuration files.

Defining a simple spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

Defines a basic spider that scrapes quotes and authors from a sample website.

Running a spider and exporting data

# Terminal command:
scrapy crawl quotes -o quotes.json

Runs the 'quotes' spider and exports the scraped data into a JSON file.

Using pipelines

class QuotesPipeline:
    def process_item(self, item, spider):
        item['text'] = item['text'].strip()
        return item

Defines a pipeline to process items, e.g., cleaning text before saving it.

Handling pagination

def parse(self, response):
    for quote in response.css('div.quote'):
        yield {...}
    next_page = response.css('li.next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)

Demonstrates navigating through paginated pages by following the 'next' link.

Using XPath selectors

response.xpath('//div[@class="quote"]/span[@class="text"]/text()').getall()

Extracts quote texts using XPath selectors instead of CSS selectors.

Error Handling

TwistedError: DNS lookup failed: Check your network connection or domain name validity.
HttpError: Use proper exception handling with Scrapy middleware or retry requests.
ValueError: no JSON object could be decoded: Ensure the response body is correctly formatted before parsing.

Best Practices

Use pipelines for cleaning, validating, and storing scraped data.

Leverage middlewares for handling retries, user agents, and proxies.

Use asynchronous requests to speed up crawling large sites.

Respect `robots.txt` and website terms of service.

Organize multiple spiders logically within a Scrapy project.