While Scrapymakes it easy for you to develop scalable web scrapers and crawlers, without proxies, they are useless in many situations. Let me show you how to setup proxies for Scrapy and the best proxies to use.
As a developer interested in web scraping, I need you to have it at the back of your mind that even though web scraping is ethical and legal, websites do not like their pages and data to be scraped – and they would do anything within at their disposal to stop you. Unfortunately, only a few effective techniques are available to them, with IP tracking as the most important one since IP addresses are the universal means of identification in computer networks such as the Internet. While IP tracking seems effective to them, we, as web scrapers and automation developers, are fortunate enough to have proxies that help make IP tracking and blocking less effective.
Scrapy is a popular web scraping framework you can use to develop scalable scrapers and crawlers. As a web scraping tool, Scrapy has support for proxies, and you will most likely make use of proxies in your scraping project. If you do not know how to set up proxies in scrappy, then keep reading this article, and I will be showing you how to get it done in this article.
Scrapy – An Overview
Unlike the likes of Requests and BeautifulSoup, Scrapy is a complete web scraping and crawling framework – you cannot just use it for sending HTTP requests; you can also use it to parse HTML documents and carry out other tasks. In fact, Scrapy alone is just like the combination of Requests, BeautifulSoup, and other scraping libraries. One thing you will come to like about this tool is that it is extensive, and there is the option for you to add custom functionality. With Scrapy, you cannot only build a web scraper or crawler, but you can easily deploy it to the cloud.
This scraping framework was developed by Scrapinghub, a popular data service provider with interest also in the development of data extraction tools. Scrapy was first released in 2008. The tool was written with Python, and for Python spider development. It is arguably the fastest Python framework – also the most popular and quite powerful too. The major problem associated withScrapy is its learning curve and the fact that it is helpless when faced with a JavaScript-rich website.
Scrapy Proxy Settings
Compared to the combo of Requests and BeautifulSoup, Scrapy has a steeper learning curve. However, there is no doubt that it is more scalable and better suited for complex development. When it comes to setting proxies, it might interest you to know that the procedures are actually very easy. There are two methods you can follow to set up proxies in Scrapy. These are discussed below.
-
Method 1: Setting Proxies by passing it as a Request Parameter
The easiest method of setting proxies in Scrapy is y passing the proxy as a parameter. This method is perfect if you want to make use of a specific proxy. There is a middleware in Scrapy called HttpProxyMiddleware, which takes the proxy value from the request and set it up properly. Below is a sample code of how to set up proxies in Scrapy via Requests parameter.
def start_requests(self): for url in self.start_urls: return Request(url=url, callback=self.parse, headers={"User-Agent": "scrape web"}, meta={"proxy": "http:/154.112.82.262:8050"})
-
Method 2: Create Custom Proxy Middleware
For a more modular approach, I will advise you to create a custom middleware. I will show you how to create a custom proxy middleware and add it to your list of middleware. A middleware is basically a piece of code that Scrapy will run when processing requests. Below is a custom middleware template you can use.
from w3lib.http import basic_auth_header class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta[“proxy”] = "http://192.168.1.1:8050" request.headers[“Proxy-Authorization”] = basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
After writing the above, you can then enable it and put it before the HttpProxyMiddleware, as shown below.
DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CustomProxyMiddleware': 350, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400, }
How to Verify Scrapy Proxies
If you followed any of the procedures above and fill in the correct proxy data, then your proxies have been set from the Scrapy end. But are they working? You will need to test them out. To do that, send a request to any of the proxy lookup tools such as Whatismyip.com – if it shows your real IP address, then something is wrong either from the setup or from the proxies you are trying to use.
Residential Proxy Pool for Scrapy Projects
Scrapinghub has its own proxy service known as Zyte Proxy, which you can use together with Scrapy. Crawlera is priced per request. If you are looking forward to using a scrapy proxy pool, then you can make a choice from one of the proxy providers below – they are all residential proxies, and we have tested their proxies – and they can be trusted to sell only working proxies. You will need rotating proxies to avoid the hurdle of rotating proxies and managing a list of proxies that could get bad quickly.
Brightdata (Luminati)
- IP Pool Size: Over 72 million
- Locations: All countries in the world
- Concurrency Allowed: Unlimited
- Bandwidth Allowed: Starts at 40GB
- Cost: Starts at $500 monthly for 40GB
Brightdata formerly known as Luminati is arguably the best proxy provider in the market. They sell residential proxies, mobile proxies, and datacenter proxies. One thing you will come to like about this provider is that they offer a 7 days free trial. They have the largest residential proxy pool for scrapy with over 72 million residential IPs from all countries and major cities in each supported country.
They have one of the fastest speeds as far as scraping is concerned, and they have proven to work with all the popular web services. Their proxies are rotating proxies and change IP address after every request or a defined period of time.
Smartproxy
- IP Pool Size: Over 40 million
- Locations: 195 locations across the globe
- Concurrency Allowed: Unlimited
- Bandwidth Allowed: Starts at 5GB
- Cost: Starts at $75 monthly for 5GB
While Luminati proxies are secure, reliable, and fast, the minimum monetary requirement will scare most small scale scrapers aware as you require a minimum of $500. For a premium provider with a smaller minimum monetary requirement, you can go for Smartproxy. Smartproxy has a pool with over 40 million residential IPs.
They have support for both high rotating proxies and session-based proxies that will help you maintain sessions for up to 10 minutes. Smartproxy has support for about 195 countries and 8 major cities around the world.
Shifter
- IP Pool Size: Over 31 million
- Locations: 130 countries
- Concurrency Allowed: Unlimited
- Cost: Starts at $249.99 monthly for 10 ports
Shifter has support for major cities and many countries around the globe. From the supported locations, it has a total of over 31 million residential backconnectIPs, making it one of the largest proxy networks in the market. Shifter proxies are priced based on ports and not bandwidth, like the others discussed above.
With Shifter, you will enjoy unlimited bandwidth usage, which makes it perfect for scraping tasks that consume much bandwidth. Shifter backconnect proxies change IP after every 5 minutes. You can use their proxies for a good number of scraping tasks as they are fast, secure, and also reliable.
At the end (Proxies for Scrapy)
With or without using Scrapy, proxies are a must if you intend to scrape websites. Scrapy has support for using proxies, and setting proxies is incredibly easy in Scrapy. With the method discussed above, you should be able to set up proxies on Scrapy in no time.
I also made recommendations on the best rotating proxies to use with Scrapy if you do not want to use Scraping APIs such as ProxyCrawl, ScraperAPI, or Apify Proxy.
- How to Scrape a Website and Never Get Blacklisted & Blocked
- Web Scraping API to Help Scrape & Extract Data
- Selenium Proxy 101: How to Setup Proxies on Selenium
- Building a Web Crawler Using Selenium and Proxies
Scrapy proxy FAQs
What is a Scrapy proxy?
A Scrapy proxy is a proxy server that is used with the Scrapy framework to enable web scraping. These proxies allow Scrapy to access websites and online services from a different IP address, providing benefits such as improved performance and the ability to bypass geolocation restrictions.
How do I use a Scrapy proxy?
To use a Scrapy proxy, you will first need to find a proxy provider that offers proxies compatible with the Scrapy framework. Once you have found a provider, you can sign up for a subscription or purchase a package of proxies. Once you have your proxies, you can configure Scrapy to use them by setting the HTTP_PROXY
and HTTPS_PROXY
environment variables or by specifying the proxy in your Scrapy spider.
Are Scrapy proxies free?
Scrapy proxies are not typically free to use. Most proxy providers will charge a fee for access to their proxies, either on a subscription basis or as a one-time purchase. It is important to carefully compare the pricing and features of different proxy providers to find the best option for your needs and budget.
Are Scrapy proxies legal?
The use of Scrapy proxies is generally considered legal, as long as the data being accessed through the proxy is publicly available and the terms of use for the website or online service allow for proxy access. However, it is always important to respect the rights of website owners and not access their data without their permission. If in doubt, it is best to consult with a legal professional before using a Scrapy proxy.