利用httpclient、htmlunit、selenium 作简单爬虫,抓取页面数据

项目过程当中,总会遇到一些变态的或者特殊的需求,须要咱们去抓取本身的、或者别人的页面,来获取咱们想要的数据。javascript

(即简单的爬虫)抓取页面的方法有不少,经常使用的:css

 

1Httpclienthtml

 1 @Test  2     public void crawSignHtmlTest() {  3         CloseableHttpClient httpclient = HttpClients.createDefault();  4         try {  5             //建立httpget
 6             HttpGet httpget = new HttpGet("http://127.0.0.1:8080/index.html?companyName=testCompany");  7 
 8             httpget.setHeader("Accept", "text/html, */*; q=0.01");  9             httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch"); 10             httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8"); 11             httpget.setHeader("Connection", "keep-alive"); 12             httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)"); 13 
14             //System.out.println("executing request " + httpget.getURI()); 15             //执行get请求
16             CloseableHttpResponse response = httpclient.execute(httpget); 17             try { 18                 //获取响应实体
19                 HttpEntity entity = response.getEntity(); 20                 //响应状态
21  System.out.println(response.getStatusLine()); 22                 if(entity != null) { 23                     //响应内容长度 24                     //System.out.println("response length: " + entity.getContentLength()); 25                     //响应内容
26                     System.out.println("response content: "); 27  System.out.println(EntityUtils.toString(entity)); 28  } 29             } finally { 30  response.close(); 31  } 32         } catch (ClientProtocolException e) { 33  e.printStackTrace(); 34         } catch (ParseException e) { 35  e.printStackTrace(); 36         } catch (IOException e) { 37  e.printStackTrace(); 38         } finally { 39             //关闭连接,释放资源
40             try { 41  httpclient.close(); 42             } catch(IOException e) { 43  e.printStackTrace(); 44  } 45  } 46     }

 

利用 httpclient 抓取到数据为该 index.html 静态页面的源码,若是 html 页面中有 js 须要执行的代码的,此时抓到的页面,JS是没有执行的。java

若是想要抓到JS 渲染以后的 html 源码,则能够经过 htmlunit 来抓取。web

 

2,Htmlunitchrome

引入 htmlunit jar,调用,可获得JS 执行以后的代码canvas

 1 @Test  2     public void htmlUnitSignTest() throws Exception {  3         WebClient wc = new WebClient(BrowserVersion.CHROME);  4         wc.setJavaScriptTimeout(5000);  5         wc.getOptions().setUseInsecureSSL(true);//接受任何主机链接 不管是否有有效证书
 6         wc.getOptions().setJavaScriptEnabled(true);//设置支持javascript脚本
 7         wc.getOptions().setCssEnabled(false);//禁用css支持
 8         wc.getOptions().setThrowExceptionOnScriptError(false);//js运行错误时不抛出异常
 9         wc.getOptions().setTimeout(100000);//设置链接超时时间
10         wc.getOptions().setDoNotTrackEnabled(false); 11         wc.getOptions().setActiveXNative(true); 12 
13         wc.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3"); 14         wc.addRequestHeader("Accept-Encoding", "gzip, deflate, br"); 15         wc.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.9"); 16         wc.addRequestHeader("Connection", "keep-alive"); 17         wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36"); 18 
19 
20         //HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/demo.html?companyName=testCompany");
21         HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/sign.html?companyName=testCompany&p=1"); 22         String res = htmlpage.asXml(); 23         //处理源码
24  System.out.println(res); 25 
26 // HtmlForm form = htmlpage.getFormByName("f"); 27 // HtmlButton button = form.getButtonByName("btnDomName"); // 获取提交按钮 28 // HtmlPage nextPage = button.click(); 29 // System.out.println("等待20秒"); 30 // Thread.sleep(2000); 31 // System.out.println(nextPage.asText());
32  wc.close(); 33     }

 

htmlunit 经过建立 new WebClient()来构建一个浏览器模拟器,而后将获取到的 html 源码来进行执行 JS 渲染,最后获得一个 JS 执行后的 html 源码。浏览器

 可是在一些特殊的场景中,如 抓取画布 canvas 绘制出来的 base64 数据时,发现数据有问题,和浏览器上直接执行的结果不一致(巨坑,在这个上浪费了不少时间)。app

 

3,Seleniumless

引入 seleniumjar,另外需下载ChromeDriver.exe,调用也可获得JS 执行以后的代码

 

 1 public static void main(String[] args) throws IOException {  2 
 3         System.setProperty("webdriver.chrome.driver", "/srv/chromedriver.exe");// chromedriver服务地址
 4         ChromeOptions options = new ChromeOptions();  5         options.addArguments("--headless");  6         //WebDriver driver = new ChromeDriver(options); // 新建一个WebDriver 的对象,可是new 的是谷歌的驱动
 7 
 8         WebDriver driver = new ChromeDriver();  9         String url = "http://127.0.0.1:8080/index.html?companyName=testCompany"; 10         driver.get(url); // 打开指定的网站 11 
12         //获取当前浏览器的信息
13         System.out.println("Title:" + driver.getTitle()); 14         System.out.println("currentUrl:" + driver.getCurrentUrl()); 15 
16 
17         WebElement imgDom = ((ChromeDriver) driver).findElementById("imgDom"); 18  System.out.println(imgDom.getText()); 19 
20         //String imgBase64 = URLDecoder.decode(imgDom.getText(), "UTF-8"); 21         //imgBase64 = imgBase64.substring(imgBase64.indexOf(",") + 1);
22         byte[] fromBASE64ToByte = Base64Util.getFromBASE64ToByte(imgDom.getText()); 23         FileUtils.writeByteArrayToFile(new File("/srv/charter44.png"),fromBASE64ToByte); 24  driver.close(); 25     }

 

selenium 也是经过 new WebDriver() 来构建一个 浏览器模拟器,不只将获取到的 html 源码来进行执行 JS 渲染,最后获得一个 JS 执行后的 html 源码,连上述 htmlunit 执行中对画布 canvas 的不友好支持,在这里也获得了完美解决。selenium 赞!!!

相关文章
相关标签/搜索