I'm going to introduce you to how to add proxy IPs in Java for crawling. I know that the crawler is to imitate the behavior of human beings on the network, through the program to automate the acquisition of information on the web page. And in the process of crawling, using a proxy IP is very important to avoid being banned from the website because of frequent requests.
First, the role and use of proxy IP
In the network world, we use IP address to identify and find a specific device, which is like each person has a unique ID number. Proxy IP is equivalent to our crawler provides a way to "disguise identity", so that our crawling behavior looks more like the normal user browsing behavior, greatly reducing the risk of being banned.
Then I'll give you an introduction to how to use proxy IP in Java to crawl it!
Second, get the proxy IP
To use a proxy IP, you first need to find some available proxy IP addresses. Here I recommend using some proxy IP websites to get them.
public List getProxyIpList(){
List proxyIpList = new ArrayList();
// Use HttpClient to send a request to get the content of the page.
HttpGet httpGet = new HttpGet("http://www.proxywebsite.com");
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet); HttpEntity entity = httpClient.execute(httpGet); HttpEntity = httpClient.execute(httpGet)
HttpEntity entity = response.getEntity(); String html = EntityUser(); String html = EntityUser()
String html = EntityUtils.toString(entity);
// Extract the proxy IP address using a regular expression.
Pattern pattern = Pattern.compile("\d+\. \d+\. \d+\. \d+:\d+");
Matcher matcher = pattern.matcher(html);
// Extracted IP addresses are saved to the list
while (matcher.find()){
String proxyIp = matcher.group();
proxyIpList.add(proxyIp);
}
} catch (IOException e) {
e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }
} finally {
try {
if(response!=null){
response.close(); }
}
httpClient.close(); } catch (IOException e) { if(response!=null){ response.close(); }
} catch (IOException e) {
e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }
}
}
return proxyIpList; }
}
With the above code, we can get some available proxy IP addresses from the Proxy IP website and save them in a list.
Third, set the proxy IP
Next, we need to set the proxy IP in the crawler program to let the program use the proxy for data crawling. Below is the sample code for setting the proxy IP:
public void setProxy(String proxyHost, int proxyPort){
HttpClientBuilder builder = HttpClientBuilder.create();
HttpHost proxy = new HttpHost(proxyHost, proxyPort, "http");
builder.setProxy(proxy);
CloseableHttpClient httpClient = builder.build();
// Send the request using httpClient...
}
In the above code, we use the functionality provided by HttpClient to set the proxy IP. by specifying the host address and port number of the proxy IP, we can allow the program to use the proxy for data crawling.
Fourth, the use of proxy IP for crawling
When we get the proxy IP and set up, you can follow the normal crawler process to crawl the data. The following is a simple example code:
public void crawlWithProxy(){
List proxyIpList = getProxyIpList();
for(String proxyIp : proxyIpList){
String[] ipAndPort = proxyIp.split(":");
String ip = ipAndPort[0];
String ip = ipAndPort[0]; int port = Integer.parseInt(ipAndPort[1]);
setProxy(ip, port);
// Use httpClient to send a request to crawl the data...
}
}
With the above code, we can traverse the list of proxy IPs and use each proxy IP in turn for data crawling.
V. Summary
Through the introduction of this article, I believe that you have a better understanding of adding proxy IP in Java for crawling. The use of proxy IP can be a good way to protect our crawler program to avoid being blocked by the target site. Of course, in practice, we can further improve the use of proxy IP strategy, such as regularly updating the proxy IP list, checking the availability of proxy IP and so on.
I hope that today's sharing will help you, let our crawler program more efficient and stable operation it! Finally, I would also like to remind you to use the crawler program to comply with network ethics and laws and regulations, do not abuse the crawler technology, and protect your data security and privacy.