Articles | Proxy Port

☰

jsoup

jsoup

Jsoup is a Java library used for parsing and manipulating HTML documents. It provides an easy-to-use API for extracting and modifying data from HTML.

norconex

Norconex

Norconex Web Crawler is an open-source tool for web crawling and data extraction, used to collect data from websites for business analysis and insights.

guzzle

Guzzle

Guzzle is a PHP HTTP client library that simplifies sending requests to websites and APIs, providing easy-to-use features for web scraping tasks.

apache-nutch

Apache Nutch

Apache Nutch is an open-source web crawler and search engine software that enables the crawling and indexing of large volumes of web content.

ayakashi

Ayakashi.io

Ayakashi.io is a web scraping framework built on Node.js, featuring a headless browser automation, dynamic web pages and Single Page Applications scraping.

crawlee

Crawlee

Crawlee allows you to scrape and crawl websites. Concurrency, rate limiting, retries, proxies, and custom headers. Extract data from any website using CSS or regex.

playwright

Playwright

Playwright is a Node.js library for automating web browsers such as Chromium, Firefox, and WebKit, providing a high-level API to control web apps.

nodecrawler

Node Crawler

Node Crawler is a powerful web crawling and scraping library for Node.js that supports features like proxying, rate limiting, and jQuery integration.

ruia

Ruia

Ruia is an asynchronous Python web scraping micro-framework with a simple API, extensibility, and support for various types of web content.

autoscraper

AutoScraper

AutoScraper is a Python library that automates web scraping by using supervised learning to extract data from websites efficiently and accurately.

stormcrawler

StormCrawler

StormCrawler is an open-source web crawling framework built on Apache Storm for scalable, customizable, and distributed web crawling and data extraction.

mechanicalsoup

MechanicalSoup

MechanicalSoup is a Python library for automating web browsing and form filling, built on top of Requests and Beautiful Soup libraries.

beautifulsoup

Beautiful Soup

Beautiful Soup is a Python library for parsing HTML and XML documents, providing an easy-to-use interface for navigating parse trees.

kimurai

Kimurai

Kimurai is a flexible and lightweight web scraping framework for Ruby, with a simple syntax, great performance, and built-in support for multiple storage formats.

jaunt

Jaunt

Jaunt is a Java-based web scraping and automation framework that provides a simple, flexible, and robust API for extracting data from web pages.

cheerio

Cheerio

Cheerio is a great choice for web scraping in JavaScript because of its simplicity, speed, and flexibility, as well as its compatibility with Node.js and active community.

go-colly

Go Colly

Go Colly is a Go-based web scraping framework that provides a simple and efficient way to extract data from websites, with support for parallel requests, dynamic websites, and customization.

puppeteer

Puppeteer

Puppeteer is a Node.js library developed by Google, which provides a high-level API for controlling a headless Chrome browser. It enables developers to automate tasks, scrape data, test web applications.

apify

Apify

Apify is a powerful web scraping and automation platform that provides users with a comprehensive suite of tools for extracting data from websites, automating workflows, and deploying web crawlers in the cloud.

scrapy

Scrapy

Scrapy is a web crawling framework. Web scraping, data mining, information processing, and automated testing framework.

what-is-proxy

What is Proxy

Why you need to use proxy and how it works. How proxy can help you to stay more anonymous and open blocked sites.

proxy-for-web-scraping

Proxy for Web Scraping

A good proxy solution for Web Scraping should provide a proxy rotation mechanism and should be easily pluggable into scraping frameworks.