以前一直使用HttpClient4来获取URL的页面,那么HttpClient怎么使用呢?闲话少叙直接上代码吧!html
public class HTTPUtils { private static CloseableHttpClient httpClient; private static RequestConfig requestConfig = RequestConfig.custom() .setSocketTimeout(5000).setConnectTimeout(5000).build(); /** * * @param url * @return * @throws IOException */ public static String getHTML(String url) throws IOException { httpClient = HttpClients.createDefault(); HttpGet request = new HttpGet(url); request.setConfig(requestConfig); HttpResponse response = httpClient.execute(request); HttpEntity entity = response.getEntity(); // ContentType contentType = ContentType.get(entity); String html = EntityUtils.toString(entity, "GB18030"); httpClient.close(); // httpClient.getConnectionManager().shutdown(); return html; } }
该段代码重点在于requestConfig的定义,若是不设置超时时间,当批量操做大量网页的时候,会出现等待假死的状况。这种状况是特别严重的,会大大提升人工,因此加入超时设定来控制。获取html页面的时候,须要设置一下页面编码,不然默认ISO_8859_1字符编码。