前言:我在2015年末写过一篇使用基本的Java网络编程抓取一个视频网站上2015年全部电影的下载连接的文章。可是以我如今看来当时的代码有的地方其逻辑无疑仍是比较复杂的,所以在接触到更好用的工具(webmagic框架)以后就一直想将当初的代码重构一下,因此也就有了本篇文章
html
注:以前的那篇文章:[https://www.zifangsky.cn/244.html](https://www.zifangsky.cn/244.html)
java
下面我将跟你们一块儿来分析下如何实现这样的一个爬虫:mysql
首先观察咱们爬虫的起始页面是:http://www.80s.tw/movie/list/-2016—
同时在当前的电影列表页面,每一个电影详情页的URL用XPath表达式来表示就是://ul[@class=’me1 clearfix’]/li/a/@href
web
注:关于XPath表达式的用法能够参考这里的介绍:www.w3school.com.cn/xpath/xpath…sql
//div[@class=’pager’]/a/@href
固然,上面咱们介绍了电影列表页如何获取电影详情页以及其余列表页的XPath表达式。那么,若是是电影详情页面(PS:http://www.80s.tw/movie/17807
这种页面),咱们该如何获取电影的名字和下载连接呢?下面咱们就一块儿来分析下吧:数据库
//div[@class=’info’]/h1/text()
接下来咱们再看看电影的下载连接该如何来获取:编程
//li[@class=’clearfix dlurlelement backcolor1′]/span[@class=’dlname nm’]/input/@value
好了,咱们上面已经将在代码中须要获取的关键信息的XPath表达式都找到了,接下来就能够正式写代码来实现了网络
在代码实现部分我决定采用webmagic框架,由于这样比使用基本的的Java网络编程要简单得多
注:关于webmagic框架的一些基本用法能够参考我以前写过的这篇文章:www.zifangsky.cn/853.html框架
package cn.zifangsky.webmagic.movie;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
public class MovieSpider implements PageProcessor{
private Site site = Site.me().setTimeOut(30000).setRetryTimes(3)
.setSleepTime(1000).setCharset("UTF-8");
@Override
public Site getSite() {
Set<Integer> acceptStatCode = new HashSet<>();
acceptStatCode.add(200);
site = site.setAcceptStatCode(acceptStatCode).addHeader("Accept-Encoding", "/")
.setUserAgent(UserAgentUtils.radomUserAgent());
return site;
}
@Override
public void process(Page page) {
String url = page.getUrl().toString();
Pattern pattern1 = Pattern.compile("http://www.80s.tw/movie/list/-2016---(-p\\d*)?");
Matcher matcher1 = pattern1.matcher(url);
Pattern pattern2 = Pattern.compile("/movie/\\d+");
Matcher matcher2 = pattern2.matcher(url);
//列表页面
if(matcher1.find()){
//电影详情页URL集合
List<String> moviePageUrls = page.getHtml().xpath("//ul[@class='me1 clearfix']/li/a/@href").all();
if(moviePageUrls != null && moviePageUrls.size() > 0){
//将当前列表页的全部电影页面添加进去
page.addTargetRequests(moviePageUrls);
}
//当前列表页中的其余列表页的连接
List<String> listUrls = page.getHtml().xpath("//div[@class='pager']/a/@href").all();
if(listUrls != null && listUrls.size() > 0){
page.addTargetRequests(listUrls);
}
}else if(matcher2.find()){ //电影页面
//获取电影名字
String movieName = page.getHtml().xpath("//div[@class='info']/h1/text()").toString();
//获取电影播放连接
String movieUrl = page.getHtml().xpath("//li[@class='clearfix dlurlelement backcolor1']/span[@class='dlname nm']/input/@value").toString();
Movie movie = new Movie(movieName, movieUrl);
page.putField("movie", movie); //后面作数据的持久化
}
}
}复制代码
代码中的XPath表达式都已经在上面专门介绍了,其余代码自行参考注释来理解便可,这里就很少作解释了dom
实体类Movie:
package cn.zifangsky.webmagic.movie;
public class Movie {
private String movieName;
private String movieLink;
public Movie() {
}
public Movie(String movieName, String movieLink) {
this.movieName = movieName;
this.movieLink = movieLink;
}
public String getMovieName() {
return movieName;
}
public void setMovieName(String movieName) {
this.movieName = movieName;
}
public String getMovieLink() {
return movieLink;
}
public void setMovieLink(String movieLink) {
this.movieLink = movieLink;
}
@Override
public String toString() {
return "Movie [movieName=" + movieName + ", movieLink=" + movieLink + "]";
}
}复制代码
UserAgentUtils.java:
package cn.zifangsky.webmagic.movie;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class UserAgentUtils {
/** * 从预约义的User-Agent列表中随机抽取一个返回 * @return */
public static String radomUserAgent(){
List<String> list = new ArrayList<>();
list.add("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36");
list.add("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.04");
list.add("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
list.add("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0");
list.add("Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/4.0; InfoPath.2; SV1; .NET CLR 2.0.50727; WOW64)");
list.add("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36");
list.add("Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko");
list.add("Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)");
list.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0");
list.add("Mozilla/5.0 (X11; Linux i686; rv:40.0) Gecko/20100101 Firefox/40.0");
list.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36");
list.add("Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)");
list.add("Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11");
list.add("Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25");
list.add("Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)");
list.add("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1");
Random random = new Random();
return list.get(random.nextInt(list.size()));
}
}复制代码
注:这个User-Agent信息能够省略
package cn.zifangsky.webmagic.movie;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
public class SaveDataPipeline implements Pipeline {
/** * 爬虫数据的持久化 */
@Override
public void process(ResultItems resultItems, Task task) {
Movie movie = resultItems.get("movie");
if (movie != null) {
try {
Connection connection = JDBCConnection.getConnection();
PreparedStatement pStatement = connection
.prepareStatement("insert into movie(MovieName,MovieLink) values(?,?)");
pStatement.setString(1, movie.getMovieName());
pStatement.setString(2, movie.getMovieLink());
pStatement.executeUpdate();
pStatement.close();
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}复制代码
这里主要是使用了基本的JDBC将数据保存到数据库中,对应的获取JDBC链接代码是:
package cn.zifangsky.webmagic.movie;
import java.sql.Connection;
import java.sql.DriverManager;
public class JDBCConnection {
private static final String driver = "com.mysql.jdbc.Driver";
private static final String url = "jdbc:mysql://127.0.0.1:3306/movie?useUnicode=true&characterEncoding=utf-8";
private static final String username = "root";
private static final String password = "root";
public static Connection getConnection(){
try {
Class.forName(driver);
return DriverManager.getConnection(url, username, password);
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}复制代码
一样,对应的SQL语句是:
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for movie
-- ----------------------------
DROP TABLE IF EXISTS `movie`;
CREATE TABLE `movie` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`MovieName` varchar(500) DEFAULT NULL,
`MovieLink` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;复制代码
package cn.zifangsky.webmagic.movie;
import org.junit.Test;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
public class TestSpider {
@Test
public void saveMovie(){
OOSpider.create(new MovieSpider())
.addUrl("http://www.80s.tw/movie/list/-2016---")
.addPipeline(new ConsolePipeline())
.addPipeline(new SaveDataPipeline())
.thread(5)
.run();
}
}复制代码
运行这个单元测试以后,等待一会时间以后观察数据库就能够发现电影的下载连接已经所有获取到了:
最后,我已经将抓取到的电影结果上传到网盘了,感兴趣的童鞋能够下载来看看:pan.baidu.com/s/1pLwbXSf