最近关注SeimiCrawler整合Mybatis的朋友比较多,故仅以此文抛砖引玉。若是是不了解SeimiCrawler的朋友也能够经过此文简单了解下SeimiCrawler。html
SeimiCrawler是一个敏捷的,独立部署的,支持分布式的Java爬虫框架,但愿能在最大程度上下降新手开发一个可用性高且性能不差的爬虫系统的门槛,以及提高开发爬虫系统的开发效率。在SeimiCrawler的世界里,绝大多数人只需关心去写抓取的业务逻辑就够了,其他的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发,同时融合了Java语言自己特色与Spring的特性,并但愿在国内更方便且广泛的使用更有效率的XPath解析HTML,因此SeimiCrawler默认的HTML解析器是JsoupXpath(独立扩展项目,非jsoup自带),默认解析提取HTML数据工做均使用XPath来完成(固然,数据处理亦能够自行选择其余解析器)。并结合SeimiAgent完全完美解决复杂动态页面渲染抓取问题。java
Github托管mysql
下面正式开始整合Mybatis的内容。数据库以MySQL为例。android
<dependency> <groupId>cn.wanghaomiao</groupId> <artifactId>SeimiCrawler</artifactId> <version>1.2.0</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-dbcp2</artifactId> <version>2.1.1</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-pool2</artifactId> <version>2.4.2</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.37</version> </dependency> <dependency> <groupId>org.mybatis</groupId> <artifactId>mybatis-spring</artifactId> <version>1.3.0</version> </dependency> <dependency> <groupId>org.mybatis</groupId> <artifactId>mybatis</artifactId> <version>3.4.1</version> </dependency>
假设建有数据库,库名为xiaohuo
,内含表结构以下:git
CREATE TABLE `blog` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(300) DEFAULT NULL, `content` text, `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
package cn.wanghaomiao.model; import cn.wanghaomiao.seimi.annotation.Xpath; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.builder.ToStringBuilder; /** * Xpath语法能够参考 http://jsoupxpath.wanghaomiao.cn/ * @since 2015/10/27. */ public class BlogContent { private Integer id; @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()") private String title; //也能够这么写 @Xpath("//div[@id='cnblogs_post_body']//text()") @Xpath("//div[@id='cnblogs_post_body']/allText()") private String content; public Integer getId() { return id; } public void setId(Integer id) { this.id = id; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getContent() { return content; } public void setContent(String content) { this.content = content; } @Override public String toString() { if (StringUtils.isNotBlank(content)&&content.length()>100){ //方便查看截断下 this.content = StringUtils.substring(content,0,100)+"..."; } return ToStringBuilder.reflectionToString(this); } }
mybatis-config.xml
文件一些基本的全局设置github
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN" "http://mybatis.org/dtd/mybatis-3-config.dtd"> <configuration> <settings> <setting name="mapUnderscoreToCamelCase" value="true"/> </settings> </configuration>
seimi-mybatis.xml
文件<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd"> <context:annotation-config /> <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource"> <property name="driverClassName" value="${database.driverClassName}"/> <property name="url" value="${database.url}"/> <property name="username" value="${database.username}"/> <property name="password" value="${database.password}"/> </bean> <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true"> <property name="configLocation" value="classpath:mybatis-config.xml"/> </bean> <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory"> <property name="dataSource" ref="mybatisDataSource"/> </bean> <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer"> <property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/> <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/> </bean> </beans>
配置文件中的${database.driverClassName}
是因为SeimiCrawler的demo工程还有动态配置的相关设置,此处亦可直接写死,没必要再读其余配置。spring
cn.wanghaomiao.dao.mybatis
目录下添加DAOpackage cn.wanghaomiao.dao.mybatis; import cn.wanghaomiao.model.BlogContent; import org.apache.ibatis.annotations.Insert; import org.apache.ibatis.annotations.Options; import org.apache.ibatis.annotations.Param; /** * @since 2016/7/27. */ public interface MybatisStoreDAO { @Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())") @Options(useGeneratedKeys = true, keyProperty = "blog.id") int save(@Param("blog") BlogContent blog); }
至此,Mybatis部分的已经就绪了。sql
package cn.wanghaomiao.crawlers; import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO; import cn.wanghaomiao.model.BlogContent; import cn.wanghaomiao.seimi.annotation.Crawler; import cn.wanghaomiao.seimi.def.BaseSeimiCrawler; import cn.wanghaomiao.seimi.struct.Request; import cn.wanghaomiao.seimi.struct.Response; import cn.wanghaomiao.xpath.model.JXDocument; import org.springframework.beans.factory.annotation.Autowired; import java.util.List; /** * 将解析出来的数据直接存储到数据库中,整合mybatis实现 * * @author 汪浩淼 [et.tw@163.com] * @since 2016/07/27. */ @Crawler(name = "mybatis") public class DatabaseMybatisDemo extends BaseSeimiCrawler { @Autowired private MybatisStoreDAO storeToDbDAO; @Override public String[] startUrls() { return new String[]{"http://www.cnblogs.com/"}; } @Override public void start(Response response) { JXDocument doc = response.document(); try { List<Object> urls = doc.sel("//a[@class='titlelnk']/@href"); logger.info("{}", urls.size()); for (Object s : urls) { push(Request.build(s.toString(), "renderBean")); } } catch (Exception e) { e.printStackTrace(); } } public void renderBean(Response response) { try { BlogContent blog = response.render(BlogContent.class); logger.info("bean resolve res={},url={}", blog, response.getUrl()); //使用神器paoding-jade存储到DB int changeNum = storeToDbDAO.save(blog); int blogId = blog.getId(); logger.info("store success,blogId = {},changeNum={}", blogId, changeNum); } catch (Exception e) { e.printStackTrace(); } } }
接下来简单启动下,数据库
public class Boot { public static void main(String[] args){ Seimi s = new Seimi(); s.start("mybatis"); } }
能够看到以下日志:apache
00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=<null>,title=CoordinatorLayout自定义Bahavior特效及其源码分析CoordinatorLayout自定义Bahavior特效及其源码分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中能够算是最重要的一个东西,运用它能够作出一些不错的特效...],url=http://www.cnblogs.com/soaringEveryday/p/5711545.html 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1
整合完毕!
生产环境工程打包部署以及启动,推荐使用maven-seimicrawler-plugin
打包插件,详细请继续参阅maven-seimicrawler-plugin或是“Seimi基础系列1-SeimiCrawler打包部署工具使用”。