Language: Python
Web
Scrapy was created by Pablo Hoffman and released in 2008. It was designed to provide a framework for web scraping that is fast, extensible, and reliable. Scrapy has become popular in both industry and research for automated data extraction, web crawling, and building data pipelines.
Scrapy is a fast, open-source web crawling and web scraping framework for Python. It provides tools to extract data from websites, process it, and store it in formats like JSON, CSV, or databases.
pip install scrapyconda install -c conda-forge scrapyScrapy allows you to define spiders that navigate through websites, extract information using selectors or XPath/CSS expressions, and export the collected data. It includes support for requests, middleware, pipelines, and asynchronous networking for high performance.
# In terminal:
scrapy startproject myprojectInitializes a new Scrapy project called 'myproject', creating the necessary directory structure and configuration files.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}Defines a basic spider that scrapes quotes and authors from a sample website.
# Terminal command:
scrapy crawl quotes -o quotes.jsonRuns the 'quotes' spider and exports the scraped data into a JSON file.
class QuotesPipeline:
def process_item(self, item, spider):
item['text'] = item['text'].strip()
return itemDefines a pipeline to process items, e.g., cleaning text before saving it.
def parse(self, response):
for quote in response.css('div.quote'):
yield {...}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)Demonstrates navigating through paginated pages by following the 'next' link.
response.xpath('//div[@class="quote"]/span[@class="text"]/text()').getall()Extracts quote texts using XPath selectors instead of CSS selectors.
Use pipelines for cleaning, validating, and storing scraped data.
Leverage middlewares for handling retries, user agents, and proxies.
Use asynchronous requests to speed up crawling large sites.
Respect `robots.txt` and website terms of service.
Organize multiple spiders logically within a Scrapy project.