Jsoup

Language: Java

Web / HTML Parsing

Developed to simplify HTML parsing in Java, Jsoup allows developers to work with messy or malformed HTML similar to how jQuery does in JavaScript. It is widely used for web scraping, data extraction, content sanitization, and automated web interactions in Java applications.

Jsoup is a Java library for working with real-world HTML. It provides a convenient API for fetching URLs, parsing HTML, extracting and manipulating data, and cleaning user-submitted content to prevent XSS attacks.

Installation

maven:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.16.1</version>
</dependency>

gradle: implementation 'org.jsoup:jsoup:1.16.1'

Usage

Jsoup allows connecting to URLs, parsing HTML into a DOM-like structure, querying elements using CSS selectors, and manipulating content. It can also sanitize untrusted HTML and extract data for processing or storage.

Fetching and parsing HTML from a URL

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

Document doc = Jsoup.connect("https://example.com").get();
System.out.println(doc.title());

Connects to a web page, parses the HTML, and prints the page title.

Parsing HTML from a string

String html = "<html><body><p>Hello, Jsoup!</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("p").text());

Parses an HTML string and extracts the text content of the `<p>` element.

Selecting elements with CSS selectors

Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href") + " -> " + link.text());
}

Selects all anchor elements with an href attribute and prints their URLs and text.

Extracting and modifying elements

Element paragraph = doc.selectFirst("p");
paragraph.text("Updated text!");

Finds the first paragraph and updates its text content.

Sanitizing HTML

String safeHtml = Jsoup.clean(unsafeHtml, Safelist.basic());

Cleans untrusted HTML to allow only safe tags and attributes, preventing XSS attacks.

Form submission

Document loginForm = Jsoup.connect("https://example.com/login")
    .data("username", "user")
    .data("password", "pass")
    .post();

Submits a form programmatically via POST using Jsoup.

Error Handling

IOException: Occurs when the connection fails or URL cannot be reached. Handle network failures appropriately.

IllegalArgumentException: Thrown when the input HTML or selector is invalid. Validate input before parsing.

NullPointerException: Occurs if an element is not found. Always check for null when using `selectFirst()` or similar methods.

Best Practices

Always close connections when fetching data from URLs.

Use CSS selectors for efficient element extraction.

Sanitize any user-submitted HTML before storing or displaying it.

Handle network exceptions when connecting to remote pages.

Use caching or throttling to avoid overloading target websites during scraping.

Official Docs Github