How-to Guides > How to set up a proxy for Scrapy

How to set up a proxy for Scrapy

Scrapy is a powerful web crawling and scraping framework that enables developers to easily extract data from websites. However, sometimes websites might block or throttle requests from a single IP address, making it difficult to collect the desired data. To overcome this, Scrapy allows developers to set up proxies that can be used to rotate IP addresses and bypass these restrictions. In this article, we'll explore how to set up a proxy for Scrapy both globally and for specific requests, allowing you to collect data from websites more efficiently and effectively.

The easiest way to set up a proxy for Scrapy is to use the scrapy-proxyport package.
Or, if you want to manually set the proxy, keep reading.

Proxy for a specific request

To set a proxy for a specific request in Scrapy, you can pass the proxy as a meta parameter when creating the Request object. Here's an example code snippet:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://www.example.com']

    def start_requests(self):
        # Set proxy for the first request
        yield scrapy.Request(
            url=self.start_urls[0],
            meta={
                'proxy': 'http://proxy.example.com:8080'
            }
        )

    def parse(self, response):
        # Make a new request with a different proxy
        yield scrapy.Request(
            url='http://www.example.com/page2',
            meta={
                'proxy': 'http://proxy2.example.com:8080'
            },
            callback=self.parse_page
        )

    def parse_page(self, response):
        # Do something with the response
        pass

In the example above, the first request will use the http://proxy.example.com:8080 proxy, while the second request will use the http://proxy2.example.com:8080 proxy. You can also set other Request parameters such as headers or cookies in the meta parameter.

Proxy as middleware

Our package will do all the work for you. If you want to do it manually, keep reading.

To set up a proxy for Scrapy, you can follow these steps:

Create a new Scrapy project or open an existing one.
Open the settings.py file in your project directory.
Locate the DOWNLOADER_MIDDLEWARES setting and add the following code:

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
'your_project_name.middlewares.ProxyMiddleware': 100,

Create a new file called middlewares.py in your project directory.
Add the following code to the middlewares.py file:

import random

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        # List of proxies to use
        proxy_list = [
            'http://proxy1.example.com:8080',
            'http://proxy2.example.com:8080',
            'http://proxy3.example.com:8080',
        ]
        # Choose a random proxy from the list
        proxy = random.choice(proxy_list)
        # Set the proxy for the request
        request.meta['proxy'] = proxy

Update the proxy_list variable with your own list of proxies.

Your Scrapy spider will now use a random proxy from the proxy_list variable for each request it makes.