Proxy Port logo
Articles > jsoup

jsoup

  1. How it works
  2. Make request
  3. Proxy
Jsoup is a Java library used for parsing HTML documents and manipulating them using the Document Object Model (DOM). It provides a simple and intuitive API for extracting and manipulating data from HTML pages.

With Jsoup, you can parse HTML documents from a URL, file, or string, and then extract specific elements, such as headings, links, images, and tables. You can also modify the HTML content by adding, removing, or modifying elements, attributes, and text.

Jsoup is particularly useful for web scraping, which is the process of automatically extracting data from websites. It is often used in web crawlers, data mining applications, and other projects that require automated data extraction from HTML pages.

Jsoup is open source and available under the MIT License. It was developed by Jonathan Hedley and is actively maintained and updated by a community of developers.

How it works

Jsoup works by parsing HTML documents and creating a DOM tree structure that represents the HTML elements and their relationships. This allows developers to easily manipulate and extract data from the HTML document using Java code.

Here are the general steps for using Jsoup:
  • Load an HTML document: You can load an HTML document from a URL, file, or string using the Jsoup.connect(), Jsoup.parse(), or Jsoup.parseBodyFragment() methods.

  • Extract data: Once you have loaded the HTML document, you can use Jsoup's API to extract data from specific elements in the document. For example, you can use the getElementById(), getElementsByTag(), or getElementsByClass() methods to select specific elements in the HTML document.

  • Modify the HTML content: Jsoup allows you to modify the HTML content by adding, removing, or modifying elements, attributes, and text. For example, you can use the append(), prepend(), remove(), and attr() methods to manipulate the HTML content.

  • Save the modified HTML document: Once you have made changes to the HTML document, you can save it to a file or output it to the console using the toString() or outerHtml() methods.
Jsoup's API is designed to be simple and intuitive, making it easy for developers to extract and manipulate data from HTML documents. It also provides support for advanced features such as handling invalid HTML, cleaning and sanitizing HTML, and working with XML documents.

Make request

The Jsoup.connect() method is used to connect to a URL and retrieve the HTML document at that URL. Here's how it works:
  • First, you create a connection to the URL by calling the Jsoup.connect() method and passing in the URL as a string parameter. This returns a Connection object.

  • You can then set various properties of the connection, such as the user agent, timeout, and request headers, by calling methods on the Connection object.

  • Once you have configured the connection, you can retrieve the HTML document by calling the get() method on the Connection object. This sends a GET request to the server and returns the HTML document as a Document object.

Proxy

Jsoup supports the use of proxies when making HTTP requests. You can use the proxy() method on the Connection object to set a proxy for the connection.

After configuring the connection, we can call the get() method to retrieve the HTML document from the URL, using the proxy to make the HTTP request.

Note that you can also set other properties of the proxy, such as username and password, by creating an Authenticator object and registering it with the Proxy object. This allows you to authenticate with the proxy server if necessary.

Proxy for scraping
More