Beautiful Soup

Language: Python

Web

Beautiful Soup was created by Leonard Richardson in 2004. It was designed to handle poorly-formed HTML and XML documents gracefully, making it ideal for web scraping tasks where the markup is inconsistent. It has become a widely used tool in data extraction, web scraping, and automated web interactions.

Beautiful Soup is a Python library for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, XML, and other markup languages, making web scraping easier and more reliable.

Installation

pip: pip install beautifulsoup4
conda: conda install -c anaconda beautifulsoup4

Usage

Beautiful Soup provides Pythonic methods and attributes to navigate, search, and modify a parse tree. You can easily extract tags, attributes, text, or nested elements from HTML/XML documents.

Parsing HTML

from bs4 import BeautifulSoup
html = '<html><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)

Parses an HTML string and extracts the text content of the `<title>` tag.

Finding all links

from bs4 import BeautifulSoup
html = '<a href="https://example.com">Example</a>'
soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for a in soup.find_all('a')]
print(links)

Extracts all URLs from `<a>` tags in the HTML.

Navigating the parse tree

from bs4 import BeautifulSoup
html = '<div><p>Paragraph 1</p><p>Paragraph 2</p></div>'
soup = BeautifulSoup(html, 'html.parser')
div = soup.div
for p in div.find_all('p'):
    print(p.text)

Demonstrates navigating the parse tree to extract text from nested `<p>` tags.

Selecting elements using CSS selectors

from bs4 import BeautifulSoup
html = '<ul><li>One</li><li>Two</li></ul>'
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('ul li')
for item in items:
    print(item.text)

Uses CSS selectors to find all `<li>` elements inside `<ul>`.

Modifying the tree

from bs4 import BeautifulSoup
html = '<p>Old Text</p>'
soup = BeautifulSoup(html, 'html.parser')
soup.p.string = 'New Text'
print(soup.p)

Shows how to modify the text content of an element.

Error Handling

AttributeError: 'NoneType' object has no attribute 'text': Check if the element exists before accessing its attributes or text.
FeatureNotFound: Couldn't find a tree builder with the features you requested: Install the appropriate parser library like `lxml` or `html5lib`.

Best Practices

Always specify a parser: 'html.parser', 'lxml', or 'html5lib'.

Use `.find()` or `.find_all()` for reliable element searches.

Use `.select()` for CSS selector queries when appropriate.

Handle exceptions when elements might not exist to avoid errors.

Combine with `requests` for fetching web pages efficiently.