Java爬虫框架：SeimiCrawler——结构化解析与数据存储

时间 2019-11-15

标签 java 爬虫框架 seimicrawler 构化解析数据存储栏目 Java 繁體版

原文原文链接

本文将介绍如何使用SeimiCrawler将页面中信息提取为结构化数据并存储到数据库中，这也是你们很是常见的使用场景。数据抓取将以抓取博客园的博客为例。java

创建基本数据结构

为了演示，简单起见只创建一个用来存储博客标题和内容两个主要信息的表便可。表以下：mysql

CREATE TABLE `blog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(300) DEFAULT NULL,
  `content` text,
  `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

同时建一个与之对应的Bean对象，以下：git

package cn.wanghaomiao.model;

import cn.wanghaomiao.seimi.annotation.Xpath;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
 * Xpath语法能够参考 http://jsoupxpath.wanghaomiao.cn/
 */
public class BlogContent {
    @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
    private String title;
    //也能够这么写 @Xpath("//div[@id='cnblogs_post_body']//text()")
    @Xpath("//div[@id='cnblogs_post_body']/allText()")
    private String content;

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        if (StringUtils.isNotBlank(content)&&content.length()>100){
            //方便查看截断下
            this.content = StringUtils.substring(content,0,100)+"...";
        }
        return ToStringBuilder.reflectionToString(this);
    }
}

这里面的@Xpath注解要着重介绍下，注解中配置的是针对对应字段的XPath提取规则，后面会介绍到SeimiCrawler会调用Response.render(Class<T> bean)来自动解析填充对应字段。对于开发者而言，提取结构化数据所要作的最主要的工做就在这里，且就这么多，接下来介绍的就是总体上如何串起来的。github

实现数据存储

本文演示使用的是paoding-jade，一款人人网早期开源出来的ORM框架。因为SeimiCrawler的对象池以及依赖管理是使用spring来实现的，因此SeimiCrawler自然支持一切能够和spring整合的ORM框架。要启用Jade需添加pom依赖：spring

<dependency>
	<groupId>net.paoding</groupId>
	<artifactId>paoding-rose-jade</artifactId>
	<version>2.0.u01</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-dbcp2</artifactId>
	<version>2.1.1</version>
</dependency>
<dependency>
	<groupId>mysql</groupId>
	<artifactId>mysql-connector-java</artifactId>
	<version>5.1.37</version>
</dependency>

添加resources下seimi-jade.xml配置文件：sql

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

       <bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource">
           <property name="driverClassName" value="com.mysql.jdbc.Driver" />
           <property name="url" value="jdbc:mysql://127.0.0.1:3306/xiaohuo?useUnicode=true&characterEncoding=UTF8&autoReconnect=true&autoReconnectForPools=true&zeroDateTimeBehavior=convertToNull" />
           <property name="username" value="xx" />
           <property name="password" value="xx" />
       </bean>
       <!-- 启用Jade配置 -->
       <bean class="net.paoding.rose.jade.context.spring.JadeBeanFactoryPostProcessor" />
</beans>

编写DAO，数据库

package cn.wanghaomiao.dao;

import cn.wanghaomiao.model.BlogContent;
import net.paoding.rose.jade.annotation.DAO;
import net.paoding.rose.jade.annotation.ReturnGeneratedKeys;
import net.paoding.rose.jade.annotation.SQL;

@DAO
public interface StoreToDbDAO {
    @ReturnGeneratedKeys
    @SQL("insert into blog (title,content,update_time) values (:1.title,:1.content,now())")
    public int save(BlogContent blog);
}

数据存储搞定后接下来就是咱们的爬虫规则类了apache

Crawler

直接上：数据结构

package cn.wanghaomiao.crawlers;

import cn.wanghaomiao.dao.StoreToDbDAO;
import cn.wanghaomiao.model.BlogContent;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.springframework.beans.factory.annotation.Autowired;

import java.util.List;

/**
 * 将解析出来的数据直接存储到数据库中
 */
@Crawler(name = "storedb")
public class DatabaseStoreDemo extends BaseSeimiCrawler {
    @Autowired
    private StoreToDbDAO storeToDbDAO;

    @Override
    public String[] startUrls() {
        return new String[]{"http://www.cnblogs.com/"};
    }

    @Override
    public void start(Response response) {
        JXDocument doc = response.document();
        try {
            List<Object> urls = doc.sel("//a[@class='titlelnk']/@href");
            logger.info("{}", urls.size());
            for (Object s:urls){
                push(Request.build(s.toString(),"renderBean"));
            }
        } catch (Exception e) {
            //ignore
        }
    }
    public void renderBean(Response response){
        try {
            BlogContent blog = response.render(BlogContent.class);
            logger.info("bean resolve res={},url={}",blog,response.getUrl());
            //使用神器paoding-jade存储到DB
            int blogId = storeToDbDAO.save(blog);
            logger.info("store sus,blogId = {}",blogId);
        } catch (Exception e) {
            //ignore
        }
    }
}

Github上亦有完整的demo，你们能够下载下来，自行尝试，点击直达。框架