Requests-HTML

Language: Python

Web

Requests-HTML was created by Kenneth Reitz as an extension to the popular Requests library. Its goal was to simplify web scraping and HTML parsing in Python by providing a high-level, Pythonic API. It supports JavaScript rendering using Pyppeteer, making it suitable for modern websites with dynamic content.

Requests-HTML is a Python library that combines the simplicity of the Requests library with powerful HTML parsing, web scraping, and asynchronous capabilities. It allows developers to easily fetch, parse, and manipulate HTML content.

Installation

pip: pip install requests-html
conda: conda install -c conda-forge requests-html

Usage

Requests-HTML allows sending HTTP requests, parsing HTML content, executing JavaScript, and extracting data using CSS selectors or XPath. It provides synchronous and asynchronous APIs for web scraping and automation tasks.

Fetching a webpage

from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://example.com')
print(response.html.title)

Creates an HTML session, fetches a webpage, and prints the page title.

Finding elements with CSS selectors

links = response.html.find('a')
for link in links:
    print(link.text, link.attrs.get('href'))

Uses CSS selectors to find all anchor tags and prints their text and href attributes.

Rendering JavaScript

r = session.get('https://example.com')
r.html.render()
print(r.html.html)

Renders JavaScript on the page, allowing scraping of dynamically generated content.

Extracting data with XPath

titles = response.html.xpath('//h1/text()')
print(titles)

Uses XPath expressions to extract the text of all `<h1>` elements.

Using asynchronous requests

from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()

async def get_page():
    r = await asession.get('https://example.com')
    print(r.html.title)

asession.run(get_page)

Demonstrates asynchronous fetching of a webpage using AsyncHTMLSession.

Handling forms

form = response.html.find('form', first=True)
data = {input.attrs['name']: 'value' for input in form.find('input')}
response = session.post(form.attrs['action'], data=data)

Extracts a form from the page, prepares the data, and submits it via POST.

Error Handling

HTMLSessionError: Ensure the session is properly created and the URL is correct.
RenderError: Ensure Pyppeteer is installed and the environment supports headless browser execution.
ConnectionError: Check network connectivity and the validity of the target URL.

Best Practices

Use `.render()` only when necessary since JavaScript rendering is resource-intensive.

Leverage CSS selectors for simple element extraction and XPath for complex queries.

Use sessions to persist cookies and headers across multiple requests.

Combine Requests-HTML with other libraries like BeautifulSoup or Pandas for processing and analysis.

Handle exceptions for network issues or rendering errors gracefully.