这几天准备用程序抓下一个网站的数据, 具体哪一个就不说了, 为了减小人工劳动, 省点力气。用到的技术 Java, Selenium, chromeDriver, 系统ubuntu16.04html
<% for(var i=0; i < loop_times; i++) { %> <% var items = rider_list.slice(i * num_per_line, (i+1) * num_per_line); %> <tr> <% for (var j=0; j < items.length; j++) { %> <%
这样直接抓取html是没法拿到数据的,页面展现的内容是通过浏览器渲染过以后的结果, so。。。, 须要使用浏览器把拉下的源码执行js脚本,前端渲染出页面, 再使用xpath 解析数据。前端
WebDriver 支持如下的java
chromeDriver 下载地址: https://sites.google.com/a/chromium.org/chromedriver/downloads ,注意版本支持状况, 我用的是最新的版本2.37linux
Latest Release: ChromeDriver 2.37 Supports Chrome v64-66
$ wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - $ sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' sudo apt-get update sudo apt-get install google-chrome-stable
root@iZj6c1imv6wpn7tfmm7nusZ:/work/fantasy# ./chromedriver Starting ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7) on port 9515 Only local connections are allowed.
这里有几个点须要注意程序员
ChromeOptions options = new ChromeOptions(); options.addArguments("--headless"); options.addArguments("--disable-gpu"); options.addArguments("--no-sandbox");
option 须要设置,web
java 抓取分析代码chrome
private WebDriver webDriver; public XXXSpider() { String driver = System.getProperty("webdriver.chrome.driver"); if (driver == null) { logger.info("没有设置 driver 变量"); System.getProperties().setProperty("webdriver.chrome.driver", "/Users/chengpanwang/Downloads/chromedriver"); } else { logger.info("driver: {}", driver); } } public BigDecimal pageDetail(String url) { logger.info("详情页: {}", url); ........ try { ChromeOptions options = new ChromeOptions(); options.addArguments("--headless"); options.addArguments("--disable-gpu"); options.addArguments("--no-sandbox"); webDriver = new ChromeDriver(options); webDriver.get(url); WebElement webElement = webDriver.findElement(By.xpath("/html")); WebElement roleSkill = webElement.findElement(By.id("role_skill")); logger.info(roleSkill.getText()); logger.info("选中技术标签"); roleSkill.click(); WebElement skillTb = webElement.findElement(By.className("skillTb")); for (WebElement item : skillTb.findElements(By.tagName("td"))) { String level = item.findElement(By.tagName("p")).getText(); String h5 = item.findElement(By.tagName("h5")).getText(); .... 具体业务代码 } webDriver.close(); } catch (Exception e) { logger.error("", e); } return price; }