Language: Java
Web / HTML Parsing
Developed to simplify HTML parsing in Java, Jsoup allows developers to work with messy or malformed HTML similar to how jQuery does in JavaScript. It is widely used for web scraping, data extraction, content sanitization, and automated web interactions in Java applications.
Jsoup is a Java library for working with real-world HTML. It provides a convenient API for fetching URLs, parsing HTML, extracting and manipulating data, and cleaning user-submitted content to prevent XSS attacks.
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>implementation 'org.jsoup:jsoup:1.16.1'Jsoup allows connecting to URLs, parsing HTML into a DOM-like structure, querying elements using CSS selectors, and manipulating content. It can also sanitize untrusted HTML and extract data for processing or storage.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
Document doc = Jsoup.connect("https://example.com").get();
System.out.println(doc.title());Connects to a web page, parses the HTML, and prints the page title.
String html = "<html><body><p>Hello, Jsoup!</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("p").text());Parses an HTML string and extracts the text content of the `<p>` element.
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href") + " -> " + link.text());
}Selects all anchor elements with an href attribute and prints their URLs and text.
Element paragraph = doc.selectFirst("p");
paragraph.text("Updated text!");Finds the first paragraph and updates its text content.
String safeHtml = Jsoup.clean(unsafeHtml, Safelist.basic());Cleans untrusted HTML to allow only safe tags and attributes, preventing XSS attacks.
Document loginForm = Jsoup.connect("https://example.com/login")
.data("username", "user")
.data("password", "pass")
.post();Submits a form programmatically via POST using Jsoup.
Always close connections when fetching data from URLs.
Use CSS selectors for efficient element extraction.
Sanitize any user-submitted HTML before storing or displaying it.
Handle network exceptions when connecting to remote pages.
Use caching or throttling to avoid overloading target websites during scraping.