Language: Python
Web
Requests-HTML was created by Kenneth Reitz as an extension to the popular Requests library. Its goal was to simplify web scraping and HTML parsing in Python by providing a high-level, Pythonic API. It supports JavaScript rendering using Pyppeteer, making it suitable for modern websites with dynamic content.
Requests-HTML is a Python library that combines the simplicity of the Requests library with powerful HTML parsing, web scraping, and asynchronous capabilities. It allows developers to easily fetch, parse, and manipulate HTML content.
pip install requests-htmlconda install -c conda-forge requests-htmlRequests-HTML allows sending HTTP requests, parsing HTML content, executing JavaScript, and extracting data using CSS selectors or XPath. It provides synchronous and asynchronous APIs for web scraping and automation tasks.
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://example.com')
print(response.html.title)Creates an HTML session, fetches a webpage, and prints the page title.
links = response.html.find('a')
for link in links:
print(link.text, link.attrs.get('href'))Uses CSS selectors to find all anchor tags and prints their text and href attributes.
r = session.get('https://example.com')
r.html.render()
print(r.html.html)Renders JavaScript on the page, allowing scraping of dynamically generated content.
titles = response.html.xpath('//h1/text()')
print(titles)Uses XPath expressions to extract the text of all `<h1>` elements.
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_page():
r = await asession.get('https://example.com')
print(r.html.title)
asession.run(get_page)Demonstrates asynchronous fetching of a webpage using AsyncHTMLSession.
form = response.html.find('form', first=True)
data = {input.attrs['name']: 'value' for input in form.find('input')}
response = session.post(form.attrs['action'], data=data)Extracts a form from the page, prepares the data, and submits it via POST.
Use `.render()` only when necessary since JavaScript rendering is resource-intensive.
Leverage CSS selectors for simple element extraction and XPath for complex queries.
Use sessions to persist cookies and headers across multiple requests.
Combine Requests-HTML with other libraries like BeautifulSoup or Pandas for processing and analysis.
Handle exceptions for network issues or rendering errors gracefully.