說明:以下的代碼基于httpclient4.5.2實現(xiàn)。
我們要使用java的HttpClient實現(xiàn)get請求抓取網(wǎng)頁是一件比較容易實現(xiàn)的工作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
public static String get(String url) { CloseableHttpResponseresponse = null ; BufferedReader in = null ; String result = "" ; try { CloseableHttpClienthttpclient = HttpClients.createDefault(); HttpGethttpGet = new HttpGet(url); response = httpclient.execute(httpGet); in = new BufferedReader( new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer( "" ); String line = "" ; String NL = System.getProperty( "line.separator" ); while ((line = in.readLine()) != null ) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if ( null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } |
要多線程執(zhí)行get請求時上面的方法也堪用。不過這種多線程請求是基于在每次調(diào)用get方法時創(chuàng)建一個HttpClient實例實現(xiàn)的。每個HttpClient實例使用一次即被回收。這顯然不是一種最優(yōu)的實現(xiàn)。
HttpClient提供了多線程請求方案,可以查看官方文檔的《 Pooling connection manager 》這一節(jié)。HttpCLient實現(xiàn)多線程請求是基于內(nèi)置的連接池實現(xiàn)的,其中有一個關鍵的類即PoolingHttpClientConnectionManager,這個類負責管理HttpClient連接池。在PoolingHttpClientConnectionManager中提供了兩個關鍵的方法:setMaxTotal和setDefaultMaxPerRoute。setMaxTotal設置連接池的最大連接數(shù),setDefaultMaxPerRoute設置每個路由上的默認連接個數(shù)。此外還有一個方法setMaxPerRoute——單獨為某個站點設置最大連接個數(shù),像這樣:
1
2
|
HttpHosthost = new HttpHost( "locahost" , 80 ); cm.setMaxPerRoute( new HttpRoute(host), 50 ); |
根據(jù)文檔稍稍調(diào)整下我們的get請求實現(xiàn):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
|
package com.zhyea.robin; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; public class HttpUtil { private static CloseableHttpClienthttpClient; static { PoolingHttpClientConnectionManagercm = new PoolingHttpClientConnectionManager(); cm.setMaxTotal( 200 ); cm.setDefaultMaxPerRoute( 20 ); cm.setDefaultMaxPerRoute( 50 ); httpClient = HttpClients.custom().setConnectionManager(cm).build(); } public static String get(String url) { CloseableHttpResponseresponse = null ; BufferedReaderin = null ; String result = "" ; try { HttpGethttpGet = new HttpGet(url); response = httpClient.execute(httpGet); in = new BufferedReader( new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer( "" ); String line = "" ; String NL = System.getProperty( "line.separator" ); while ((line = in.readLine()) != null ) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if ( null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } public static void main(String[] args) { System.out.println(get( "https://www.baidu.com/" )); } } |
這樣就差不多了。不過對于我自己而言,我更喜歡httpclient的fluent實現(xiàn),比如我們剛才實現(xiàn)的http get請求完全可以這樣簡單的實現(xiàn):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
package com.zhyea.robin; import org.apache.http.client.fluent.Request; import java.io.IOException; public class HttpUtil { public static String get(String url) { String result = "" ; try { result = Request.Get(url) .connectTimeout( 1000 ) .socketTimeout( 1000 ) .execute().returnContent().asString(); } catch (IOException e) { e.printStackTrace(); } return result; } public static void main(String[] args) { System.out.println(get( "https://www.baidu.com/" )); } } |
我們要做的只是將以前的httpclient依賴替換為fluent-hc依賴:
1
2
3
4
5
|
< dependency > < groupId >org.apache.httpcomponents</ groupId > < artifactId >fluent-hc</ artifactId > < version >4.5.2</ version > </ dependency > |
并且這個fluent實現(xiàn)天然就是采用PoolingHttpClientConnectionManager完成的。它設置的maxTotal和defaultMaxPerRoute的值分別是200和100:
1
2
3
|
CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute( 100 ); CONNMGR.setMaxTotal( 200 ); |
唯一一點讓人不爽的就是Executor沒有提供調(diào)整這兩個值的方法。不過這也完全夠用了,實在不行的話,還可以考慮重寫Executor方法,然后直接使用Executor執(zhí)行get請求:
1
2
|
Executor.newInstance().execute(Request.Get(url)) .returnContent().asString(); |
就這樣!