Language: Python
Web
Beautiful Soup was created by Leonard Richardson in 2004. It was designed to handle poorly-formed HTML and XML documents gracefully, making it ideal for web scraping tasks where the markup is inconsistent. It has become a widely used tool in data extraction, web scraping, and automated web interactions.
Beautiful Soup is a Python library for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, XML, and other markup languages, making web scraping easier and more reliable.
pip install beautifulsoup4conda install -c anaconda beautifulsoup4Beautiful Soup provides Pythonic methods and attributes to navigate, search, and modify a parse tree. You can easily extract tags, attributes, text, or nested elements from HTML/XML documents.
from bs4 import BeautifulSoup
html = '<html><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)Parses an HTML string and extracts the text content of the `<title>` tag.
from bs4 import BeautifulSoup
html = '<a href="https://example.com">Example</a>'
soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for a in soup.find_all('a')]
print(links)Extracts all URLs from `<a>` tags in the HTML.
from bs4 import BeautifulSoup
html = '<div><p>Paragraph 1</p><p>Paragraph 2</p></div>'
soup = BeautifulSoup(html, 'html.parser')
div = soup.div
for p in div.find_all('p'):
print(p.text)Demonstrates navigating the parse tree to extract text from nested `<p>` tags.
from bs4 import BeautifulSoup
html = '<ul><li>One</li><li>Two</li></ul>'
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('ul li')
for item in items:
print(item.text)Uses CSS selectors to find all `<li>` elements inside `<ul>`.
from bs4 import BeautifulSoup
html = '<p>Old Text</p>'
soup = BeautifulSoup(html, 'html.parser')
soup.p.string = 'New Text'
print(soup.p)Shows how to modify the text content of an element.
Always specify a parser: 'html.parser', 'lxml', or 'html5lib'.
Use `.find()` or `.find_all()` for reliable element searches.
Use `.select()` for CSS selector queries when appropriate.
Handle exceptions when elements might not exist to avoid errors.
Combine with `requests` for fetching web pages efficiently.