上一篇介绍的基本的使用方式,自由度很高,可是编写的代码相对就多了。而我所在的行业其实大部分都是定题爬虫, 只须要采集指定的页面并结构化数据。为了提升开发效率, 我实现了利用实体配置的方式来实现爬虫html
利用NUGET添加包mysql
DotnetSpider2.Extensiongit
定义一个原始的数据对象类github
public class Product : SpiderEntity { }
能够看到每一个商品都在class为gl-i-wrap j-sku-item的DIV下面,所以添加EntitySelector到数据对象Product的类名上面。( XPath的写法不是惟一的,不熟悉的能够去W3CSCHOLL学习一下, 框架也支持使用Css甚至正则来选择出正确的Html片断)。 正则表达式
[EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { }
添加数据库及索引信息sql
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { }
假设你须要采集SKU信息,观察HTML结构,计算出相对的XPath, 为何是相对XPath?由于EntitySelector已经把HTML截成片断了,内部的Html元素查询都是相对于EntitySelector查询出来的元素。最后再加上数据库中列的信息数据库
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } }
爬虫内部,连接是经过Request对象来存储信息的,构造Request对象时能够添加额外的属性值,这时候容许数据对象从Request的额外属性值中查询数据架构
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment)] public string Category { get; set; } }
public class JdSkuSampleSpider : EntitySpider { public JdSkuSampleSpider() : base("JdSkuSample", new Site { //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API")) }) { } protected override void MyInit(params string[] arguments) { Identity = Identity ?? "JD SKU SAMPLE"; ThreadNum = 1; // dowload html by http client Downloader = new HttpClientDownloader(); // storage data to mysql, default is mysql entity pipeline, so you can comment this line. Don't miss sslmode. AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;")); AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } }); AddEntityType<Product>(); } }
其中AddStartUrl第二个参数Dictionary<string, object>就是用于Enviroment查询的数据框架
TargetUrlsSelector,能够配置数据连接的合法性验证,以及目标URL的获取。以下表示目标URL的获取区域是由XPATH选择,而且要符合正则表达式 &page=[0-9]+&ide
[EntityTable("test", "jd_sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] [TargetUrlsSelector(XPaths = new[] { "//span[@class=\"p-num\"]" }, Patterns = new[] { @"&page=[0-9]+&" })] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment)] public string Category { get; set; } }
添加一个MySql的数据管道,只须要配置好链接字符串便可
context.AddPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
public class JdSkuSampleSpider : EntitySpider { public JdSkuSampleSpider() : base("JdSkuSample", new Site { //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API")) }) { } protected override void MyInit(params string[] arguments) { Identity = Identity ?? "JD SKU SAMPLE"; ThreadNum = 1; // dowload html by http client Downloader = new HttpClientDownloader(); // storage data to mysql, default is mysql entity pipeline, so you can comment this line. Don't miss sslmode. AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;")); AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } }); AddEntityType<Product>(); } } [EntityTable("test", "jd_sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] [TargetUrlsSelector(XPaths = new[] { "//span[@class=\"p-num\"]" }, Patterns = new[] { @"&page=[0-9]+&" })] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku", Length = 100)] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment, Length = 100)] public string Category { get; set; } [PropertyDefine(Expression = "cat3", Type = SelectorType.Enviroment)] public int CategoryId { get; set; } [PropertyDefine(Expression = "./div[1]/a/@href")] public string Url { get; set; } [PropertyDefine(Expression = "./div[5]/strong/a")] public long CommentsCount { get; set; } [PropertyDefine(Expression = ".//div[@class='p-shop']/@data-shop_name", Length = 100)] public string ShopName { get; set; } [PropertyDefine(Expression = ".//div[@class='p-name']/a/em", Length = 100)] public string Name { get; set; } [PropertyDefine(Expression = "./@venderid", Length = 100)] public string VenderId { get; set; } [PropertyDefine(Expression = "./@jdzy_shop_id", Length = 100)] public string JdzyShopId { get; set; } [PropertyDefine(Expression = "Monday", Type = SelectorType.Enviroment)] public DateTime RunId { get; set; } }
public class Program { public static void Main(string[] args) { JdSkuSampleSpider spider = new JdSkuSampleSpider(); spider.Run(); } }
不到57行代码完成一个爬虫,是否是异常的简单?
https://github.com/zlzforever/DotnetSpider 望各位大佬加星
博文写得比较早, 框架修改有时会来不及更新博文中的代码, 请查看DotnetSpider.Sample项目中的样例爬虫
QQ群: 477731655