有项目需求抓取淘宝天猫的商品详情。琢磨一段时间搞出来了。放出来让你们参考下。javascript
Maven依赖:
HtmlUnitcss
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.23</version> <exclusions> <exclusion> <artifactId>httpclient</artifactId> <groupId>org.apache.httpcomponents</groupId> </exclusion> </exclusions> </dependency>
准备工做:html
public static BrowserVersion getBrowserVersion() { BrowserVersion bv = BrowserVersion.BEST_SUPPORTED.clone(); // 设置语言,不然不知道传过来是什么编码 bv.setUserLanguage("zh_cn"); bv.setSystemLanguage("zh_cn"); bv.setBrowserLanguage("zh_cn"); // 源码里是写死Win32的,不知道到生产环境(linux)会不会变,稳妥起见仍是硬设 bv.setPlatform("Win32"); return bv; } public static WebClient newWebClient() { WebClient wc = new WebClient(bv); wc.getOptions().setUseInsecureSSL(true); // 容许使用不安全的SSL链接。若是不打开,站点证书过时的https将没法访问 wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器 wc.getOptions().setCssEnabled(false); //禁用css支持 // 禁用一些异常抛出 wc.getOptions().setThrowExceptionOnScriptError(false); wc.getOptions().setThrowExceptionOnFailingStatusCode(false); wc.getOptions().setDoNotTrackEnabled(false); // 随请求发送DoNotTrack wc.setJavaScriptTimeout(1000); // 设置JS超时,这里是1s wc.getOptions().setTimeout(5000); //设置链接超时时间 ,这里是5s。若是为0,则无限期等待 wc.setAjaxController(new NicelyResynchronizingAjaxController()); // 设置ajax控制器 return wc; }
淘宝详情抓取:
分析淘宝的页面,商品详情是异步从cdn加载的,咱们只要找到这个cdn的url,直接请求获取response便可。java
public String getTaobaoDetail(String url) { WebClient wc = newWebClient(); String detail = ""; try { WebRequest request = new WebRequest(UrlUtils.toUrlUnsafe(url)); request.setAdditionalHeaders(searchRequestHeader); Page page = wc.getPage(request); if(page.isHtmlPage()) { HtmlPage htmlPage = (HtmlPage) page; String html = htmlPage.asXml(); DomNodeList<HtmlElement> script = htmlPage.getHead().getElementsByTagName("script"); String detailUrl = ""; for(HtmlElement elm : script) { String textContent = elm.getTextContent(); if(textContent.contains("var g_config = {")) { for(String line : textContent.split("\n")) { if(line.startsWith(" descUrl")) { detailUrl = "http:" + RegexUtil.getFirstMatch(line, "'//dsc.taobaocdn.com/i[0-9]+/[0-9]+/[0-9]+/[0-9]+/.+[0-9]+'\\s+:" ).replaceAll("\\s+:","").replace("'",""); break; } } break; } } if(StringUtils.isNotBlank(detailUrl)) detail = wc.getPage(detailUrl).getWebResponse().getContentAsString().replace("var desc='","").replace("';",""); } } catch (Exception e) { e.printStackTrace(); } finally { wc.close(); } return detail; } public static String getFirstMatch(String str,String regex) { Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(str); String ret = null; if(matcher.find()) { ret = matcher.group(); } return ret; }
天猫详情抓取:
淘宝天猫是截然两种风格,没找到像淘宝详情页同样的cdn地址,只能从页面上去抓取了。
使用js模拟滚动,而后等待js执行完毕。至于多久真的看RP。。。linux
public String getTmallDetail(String url) { WebClient wc = newWebClient(); String detail = ""; try { WebRequest request = new WebRequest(UrlUtils.toUrlUnsafe(url)); request.setAdditionalHeaders(searchRequestHeader); wc.getCurrentWindow().getTopWindow().setOuterHeight(Integer.MAX_VALUE); wc.getCurrentWindow().getTopWindow().setInnerHeight(Integer.MAX_VALUE); Page page = wc.getPage(request); page.getEnclosingWindow().setOuterHeight(Integer.MAX_VALUE); page.getEnclosingWindow().setInnerHeight(Integer.MAX_VALUE); if(page.isHtmlPage()) { HtmlPage htmlPage = (HtmlPage) page; ScriptResult sr = htmlPage.executeJavaScript(String.format("javascript:window.scrollBy(0,%d);",Integer.MAX_VALUE)); // 执行页面全部渲染相关的JS int left = 0; do { left = wc.waitForBackgroundJavaScript(10); // System.out.println(left); } while (left > 7); // 有6-7个时间超长的js任务 htmlPage = (HtmlPage)sr.getNewPage(); detail = htmlPage.getElementById("description").asXml() .replaceAll("src=\"//.{0,100}.png\" data-ks-lazyload=", "src="); // 移除懒加载 } } catch (Exception e) { e.printStackTrace(); } finally { wc.close(); } return detail; }