教您使用java爬虫gecco抓取JD所有商品信息（一）

#教您使用java爬虫gecco抓取JD所有商品信息（一） ##gecco爬虫若是对gecco尚未了解能够参看一下gecco的github首页。gecco爬虫十分的简单易用，JD所有商品信息的抓取9个类就能搞定。 ##JD网站的分析要抓取JD网站的所有商品信息，咱们要先分析一下网站，京东网站能够大致分为三级，首页上经过分类跳转到商品列表页，商品列表页对每一个商品有详情页。那么咱们经过找到全部分类就能逐个分类抓取商品信息。 ##入口地址 http://www.jd.com/allSort.aspx，这个地址是JD所有商品的分类列表，咱们以该页面做为开始页面，抓取JD的所有商品信息 ###新建开始页面的HtmlBean类AllSortcss

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {

	private static final long serialVersionUID = 665662335318691818L;
	
	@Request
	private HttpRequest request;

	//手机
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
	private List<Category> mobile;
	
	//家用电器
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
	private List<Category> domestic;

	public List<Category> getMobile() {
		return mobile;
	}

	public void setMobile(List<Category> mobile) {
		this.mobile = mobile;
	}

	public List<Category> getDomestic() {
		return domestic;
	}

	public void setDomestic(List<Category> domestic) {
		this.domestic = domestic;
	}

	public HttpRequest getRequest() {
		return request;
	}

	public void setRequest(HttpRequest request) {
		this.request = request;
	}
}

能够看到，这里以抓取手机和家用电器两个大类的商品信息为例，能够看到每一个大类都包含若干个子分类，用List<Category>表示。gecco支持Bean的嵌套，能够很好的表达html页面结构。Category表示子分类信息内容，HrefBean是共用的连接Bean。html

public class Category implements HtmlBean {

	private static final long serialVersionUID = 3018760488621382659L;

	@Text
	@HtmlField(cssPath="dt a")
	private String parentName;
	
	@HtmlField(cssPath="dd a")
	private List<HrefBean> categorys;

	public String getParentName() {
		return parentName;
	}

	public void setParentName(String parentName) {
		this.parentName = parentName;
	}

	public List<HrefBean> getCategorys() {
		return categorys;
	}

	public void setCategorys(List<HrefBean> categorys) {
		this.categorys = categorys;
	}
	
}

##获取页面元素cssPath的小技巧上面两个类难点就在cssPath的获取上，这里介绍一些cssPath获取的小技巧。用Chrome浏览器打开须要抓取的网页，按F12进入发者模式。选择你要获取的元素，如图：java

在浏览器右侧选中该元素，鼠标右键选择Copy--Copy selector，便可得到该元素的cssPathjquery

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

若是你对jquery的selector有了解，另外咱们只但愿得到dl元素，所以便可简化为：git

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

##编写AllSort的业务处理类完成对AllSort的注入后，咱们须要对AllSort进行业务处理，这里咱们不作分类信息持久化等处理，只对分类连接进行提取，进一步抓取商品列表信息。看代码：github

@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline<AllSort> {

	@Override
	public void process(AllSort allSort) {
		List<Category> categorys = allSort.getMobile();
		for(Category category : categorys) {
			List<HrefBean> hrefs = category.getCategorys();
			for(HrefBean href : hrefs) {
				String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
				HttpRequest currRequest = allSort.getRequest();
				SchedulerContext.into(currRequest.subRequest(url));
			}
		}
	}

}

@PipelinName定义该pipeline的名称，在AllSort的@Gecco注解里进行关联，这样，gecco在抓取完并注入Bean后就会逐个调用@Gecco定义的pipeline了。为每一个子连接增长"&delivery=1&page=1&JL=4_10_0&go=0"的目的是只抓取京东自营而且有货的商品。SchedulerContext.into()方法是将待抓取的连接放入队列中等待进一步抓取。浏览器