【前言】
最近中美贸易战很火,试着爬取下,知乎上关于贸易战的一些评论。
难点:知乎最近的Cookie复杂了不少,因此直接帐号密码登陆,知乎前端换react技术栈,对页面对象的选取,带来挺多困难。
【效果图】
帐号密码登陆--模拟鼠标刷新内容--获取答案元素输出
css
【代码】html
public class TradeWar { public static void main(String[] args) throws InterruptedException { System.setProperty("webdriver.gecko.driver", "C:\\code\\selenium\\geckodriver.exe"); WebDriver driver = new FirefoxDriver(); Actions action = new Actions(driver); //进入我的主页 driver.get("https://www.zhihu.com/#signin"); driverWait(driver, 2000); //输入帐号密码 driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[2]/span")).click(); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[1]/div[2]/div[1]/input")).sendKeys(new String[] { "帐号" }); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[2]/div/div[1]/input")).sendKeys(new String[] { "密码" }); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/button")).click(); driver.get("https://www.zhihu.com/topic/20177825/top-answers"); //下拉刷新足够的内容,具体能够设置10000+ for (int i = 0; i < 100; i++) { Thread.sleep(100); action.sendKeys(Keys.ARROW_DOWN).perform(); } //抓取内容并打印 System.out.println("开始打印"); List<WebElement> answers = driver.findElements(By.cssSelector("a[target='_blank']")); for (int i = 0; i < answers.size(); i++) { String answer = answers.get(i).getText(); System.out.println("【答案】"+answer + "\n"); } } //休眠 public static void driverWait(WebDriver driver,long time) { try { synchronized (driver) { System.out.println("begin wait() ThreadName=" + Thread.currentThread().getName()); driver.wait(time); System.out.println(" end wait() ThreadName=" + Thread.currentThread().getName()); } } catch (InterruptedException e) { e.printStackTrace(); } } }
【以前对比】
1.以前获取的cookie都是不带时间的,如今变成这样,cookie登陆不上了,还在修改前端
_zap,469c025b-7e65-4f9f-a00c-75f4cdf7e2ee,.zhihu.com,/,Mon Apr 13 19:28:59 CST 2020
2.以前用下面的classname均可以获取页面元素,如今都获取不到了react
//获取问题和答案 List<WebElement> questions = driver.findElements(By.className("question_link")); List<WebElement> answers = driver.findElements(By.className("zm-item-rich-text"));