前言:最近博主买了台Kindle,感受亚马逊上的图书资源质量挺好,还时不时地会有价格低但质量高的书出售,但限于亚马逊并无很好的优惠提醒功能,本身每天盯着又很累。因而,我本身写了一个基于Java的亚马逊图书监控的简单爬虫,只要出现特别优惠的书便会自动给指定的邮箱发邮件。html
简单地说一下实现的思路,本文只说明思路,须要完整项目的童鞋请移步文末java
URL类
返回的URLConnection对象
对网站进行访问,抓取数据。(这里有个小技巧,在访问亚马逊的时候若是没有在请求头上加入Accept-Encoding:gzip, deflate, br
这个参数,则不出几回便会被拒绝访问(返回503),加上以后返回的数据是经GZIP压缩过的,此时须要用GZIPInputStream
这个流去读取,不然读到的是乱码)由于只截取了部分代码,内容有所缺失,思路能看明白便可git
this.url = new URL("https://www.amazon.cn/gp/bestsellers/digital-text"); //打开一个链接 URLConnection connection = this.url.openConnection(); //设置请求头,防止被503 connection.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"); connection.setRequestProperty("Accept-Encoding", "gzip, deflate, br"); connection.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9"); connection.setRequestProperty("Host", "www.amazon.cn"); connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"); //发起链接 connection.connect(); //获取数据,由于服务器发过来的数据通过GZIP压缩,要用对应的流进行读取 BufferedInputStream bis = new BufferedInputStream(new GZIPInputStream(connection.getInputStream())); ByteArrayOutputStream baos = new ByteArrayOutputStream(); //读数据 while ((len = bis.read(tmp)) != -1) { baos.write(tmp, 0, len); } this.rawData = new String(baos.toByteArray(), "utf8"); bis.close();
//先用正则表达式去取单个li标签 Pattern p1 = Pattern.compile("<li class=\"zg-item-immersion\"[\\s\\S]+?</li>"); Matcher m1 = p1.matcher(this.rawData == null ? "" : this.rawData); while (m1.find()) { //取出单个li标签的名字和价格 Pattern p2 = Pattern.compile("alt=\"([\\u4E00-\\u9FA5:—,0-9a-zA-Z]+)[\\s\\S]+?¥(\\d{1,2}\\.\\d{2})"); Matcher m2 = p2.matcher(m1.group()); while (m2.find()) { //先取出名字 String name = m2.group(1); //再取出价格 double price = Double.parseDouble(m2.group(2)); //如有相同名字的书籍只记录价格低的 if (this.destData.containsKey(name)) { double oldPrice = this.destData.get(name).getPrice(); price = oldPrice > price ? price : oldPrice; } //将数据放入Map中 this.destData.put(name, new Price(price, new Date())); } }
我把完整的项目放在了个人Github上,更多详细状况(怎么配置、怎么用),有兴趣的童鞋能够去捧个场!
仓库地址:https://github.com/horvey/Amazon-BookMonitorgithub