In the process of data crawling (crawling), the use of proxy IPs is a common and effective way to avoid being blocked or restricted from accessing the target website. Proxy IP can hide the real IP address of the crawler, making the crawler look like it comes from a different user, thus improving the crawling efficiency. Next, I will explain in detail how to use proxy IP in the crawler.
preliminary
Before you begin, you'll need to prepare the following tools and resources:
- Python programming language
- Some available proxy IP addresses
- Python's requests library.
Step 1: Install the necessary libraries
First, make sure you have Python installed. if not, you can download and install it from the Python website. Next, install the requests library:
pip install requests
Step 2: Get Proxy IP
You can find some proxy IP service providers online, for example: ipipgo
Get some proxy IPs from the ipipgo website and record their IP addresses and port numbers.
Step 3: Write the crawler code
Next, we'll write a simple Python crawler that uses proxy IPs to make network requests.
import requests
# Proxies List
proxies_list = [
{"http": "http://proxy1:port", "https": "https://proxy1:port"},
{"http": "http://proxy2:port", "https": "https://proxy2:port"},
{"http": "http://proxy3:port", "https": "https://proxy3:port"}, {"http": "http://proxy3:port", "https": "https://proxy3:port"}, }
# Add more proxy IPs
]
# Target URL
target_url = "http://example.com"
# Request function
def fetch_url(proxy):
try: response = requests.get(target_url, proxies, time)
response = requests.get(target_url, proxies=proxy, timeout=5)
print(f "Using proxy {proxy} Request successful, status code: {response.status_code}")
# Processing response content
print(response.text[:100]) # Print the first 100 characters.
except requests.RequestException as e:
Print(f "Using proxy {proxy} Request failed: {e}")
# Make the request using the proxy IPs in sequence
for proxy in proxies_list.
fetch_url(proxy)
In this script, we define a `fetch_url` function to request the destination URL via the specified proxy IP. we then make the requests using the proxy IPs in turn and output the results of each request.
Step 4: Run the script
Save the above code as a Python file, e.g. `proxy_scraper.py`. Run the script in a terminal:
python proxy_scraper.py
The script will request the target URL using different proxy IPs in turn and output the result of each request.
Advanced Usage: Random Proxy IP Selection
In practice, you may want to randomly select proxy IPs to avoid being detected by the target website. Below is an improved script that uses a randomly selected proxy IP for requests:
import requests
import random
# Proxies List
proxies_list = [
{"http": "http://proxy1:port", "https": "https://proxy1:port"},
{"http": "http://proxy2:port", "https": "https://proxy2:port"},
{"http": "http://proxy3:port", "https": "https://proxy3:port"}, {"http": "http://proxy3:port", "https": "https://proxy3:port"}, }
# Add more proxy IPs
]
# Target URL
target_url = "http://example.com"
# Request function
def fetch_url(proxy):
try: response = requests.get(target_url, proxies, time)
response = requests.get(target_url, proxies=proxy, timeout=5)
print(f "Using proxy {proxy} Request successful, status code: {response.status_code}")
# Processing response content
print(response.text[:100]) # Print the first 100 characters.
except requests.RequestException as e:
Print(f "Using proxy {proxy} Request failed: {e}")
# Randomly select a proxy IP for the request
for _ in range(10): # number of requests
proxy = random.choice(proxies_list)
fetch_url(proxy)
In this script, we use Python's `random.choice` function to randomly select a proxy IP from a list of proxy IPs to request. This effectively avoids detection by the target site and improves crawling efficiency.
caveat
There are a few things to keep in mind when using proxy IPs for crawling:
- Proxy IP quality:Make sure the proxy IP you are using is reliable, otherwise the request may fail.
- Request Frequency:Reasonably set the request frequency to avoid too frequent requests leading to IP blocking of the target website.
- Exception handling:In practical applications, various exceptions may be encountered, such as network timeout, proxy IP failure and so on. Appropriate exception handling mechanisms need to be added.
summarize
With the above steps, you can use proxy IPs in your crawler to improve crawling efficiency and avoid being blocked by the target website. Whether it's for privacy protection or to improve crawling efficiency, proxy IP is a technical tool worth trying.
I hope this article will help you better understand and use crawler proxy IP. wish you a smooth and efficient data crawling process!