The use of proxy IPs is a common and important tool in web crawling and data collection. Proxy IP can help us to hide the real IP address. Today, we will explain in detail how to use proxy IP in Python.
What is a proxy IP?
Proxy IP is an intermediary server that hides the user's real IP address by forwarding requests and responses. Using proxy IP can effectively avoid being blocked by the target website and improve the success rate of data collection.
Why should I use a proxy IP?
When performing web crawling, frequent requests may attract the attention of the target website, leading to IP blocking. Using a proxy IP can spread out the requests and reduce the risk of being blocked. In addition, Proxy IP can also help us access some websites that have geographical restrictions on IP.
How to use proxy IP in Python?
Next, we will explain in detail how to use proxy IPs in Python in a few steps.
1. Using the requests library to set up the proxy
Python's requests library is a very popular HTTP request library and it is very easy to set up proxies using it. Here is a simple example:
import requests
proxy = {
'http': 'http://123.123.123.123:8080',
'https': 'http://123.123.123.123:8080',
}
url = 'http://httpbin.org/ip'
response = requests.get(url, proxies=proxy)
print(response.text)
In this example, we specify the proxy IP address for HTTP and HTTPS requests by setting the proxies parameter. You can replace 'http://123.123.123.123:8080' with the proxy IP you are actually using.
2. Use the urllib library to set up the proxy
In addition to the requests library, Python's built-in library urllib can also be used to set up proxies. Here is an example:
import urllib.request
proxy = urllib.request.ProxyHandler({
'http': 'http://123.123.123.123:8080',
'https': 'http://123.123.123.123:8080', })
})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
url = 'http://httpbin.org/ip'
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8'))
In this example, we set the proxy IP via ProxyHandler and apply it globally using the build_opener and install_opener methods.
3. Use of third-party libraries - requests_html
requests_html is a requests-based library that provides additional functionality, including JavaScript rendering. Below is an example of using a proxy IP:
from requests_html import HTMLSession
session = HTMLSession()
proxy = {
'http': 'http://123.123.123.123:8080',
'https': 'http://123.123.123.123:8080',
}
url = 'http://httpbin.org/ip'
response = session.get(url, proxies=proxy)
print(response.text)
This example is similar to the usage of the requests library, but requests_html provides more functionality for scenarios where complex web pages need to be processed.
4. Using third-party libraries - Scrapy
Scrapy is a powerful crawler framework with built-in support for proxy IPs. Here is a simple example:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
HTTP_PROXY = 'http://123.123.123.123:8080'
# my_spider.py
import scrapy
class MySpider(scrapy.)
name = 'my_spider'
start_urls = ['http://httpbin.org/ip']
def parse(self, response).
self.logger.info('IP: %s', response.text)
In this example, we set the proxy IP by modifying the settings.py file. then in the Spider code, we can launch the request normally and Scrapy will automatically use the configured proxy IP.
How do I get a proxy IP?
There are several ways to get a proxy IP, including free proxy sites, paid proxy services and self-built proxy servers. Here are some common ways to get a proxy IP:
1. Free agent websites
There are many websites that offer free proxy IPs and so on. You can get free proxy IPs from these sites, but it is important to note that free proxy IPs are usually unstable and may fail.
2. Paid agency services
If you need stable and efficient proxy IPs, it is recommended to use paid proxy services. For example, IPIPGO provides high-quality proxy IPs suitable for large-scale data collection.
3. Self-built proxy servers
If you have the technical skills and resources, you can consider building your own proxy server. By purchasing a VPS server and configuring proxy software, you can get a dedicated proxy IP.
summarize
By the end of this article, I believe you have mastered the basics of how to use proxy IPs in Python. Whether you are using the requests library, the urllib library, or third-party libraries such as requests_html and Scrapy, setting up a proxy IP is very simple. I hope this has been helpful and wish you well on your data collection journey!