最近项目有需求从一个老的站点抓取信息而后倒入到新的系统中。因为老的系统已经没有人维护,数据又比较分散,而要提取的数据在网页上表现的反而更统一,因此计划经过网络请求而后分析页面的方式来提取数据。而两年前的这个时候,我彷佛作过相同的事情——缘分这件事情,真是有趣。html
在采集信息这件事情中,最麻烦的每每是不一样的页面的分解、数据的提取——由于页面的设计和结构每每千差万别。同时,对于有些页面,一般不得不绕着弯子请求(ajax、iframe等),这致使数据提取成了最耗时也最痛苦的过程——由于你须要编写大量的逻辑代码将整个流程串联起来。我隐隐记得15年的7月,也就是两年前的这个时候,我就思考过这个问题。当时引入了一个类型CommonExtractor
来解决这个问题。整体的定义是这样的:node
public class CommonExtractor { public CommonExtractor(PageProcessConfig config) { PageProcessConfig = config; } protected PageProcessConfig PageProcessConfig; public virtual void Extract(CrawledHtmlDocument document) { if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i))) return; var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl }; ExtractData(node, PageProcessConfig); } protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig) { var data = new Dictionary<string, ExtractionResult>(); foreach (var config in blockConfig.DataExtractionConfigs) { if (node == null) continue; /*使用'.'将当前节点做为上下文*/ var selectedNodes = node.Node.SelectNodes("." + config.XPath); var result = new ExtractionResult(config, node.FromUrl); if (selectedNodes != null && selectedNodes.Any()) { foreach (var sNode in selectedNodes) { if (config.Attribute != null) result.Fill(sNode.Attributes[config.Attribute].Value); else result.Fill(sNode.InnerText); } data[config.Key] = result; } else { data[config.Key] = null; } } if (DataExtracted != null) { var args = new DataExtractedEventArgs(data, node.FromUrl); DataExtracted(this, args); } return data; } public EventHandler<DataExtractedEventArgs> DataExtracted; }
代码有点乱(由于当时使用的是Abot进行爬网),可是意图仍是挺明确的,但愿从一个html文件中提取出有用的信息,而后经过一个配置来指定如何提取信息。这种处理方式存在的主要问题是:没法应对复杂结构,在应对特定的结构的时候必须引入新的配置,新的流程,同时这个新的流程不具有较高程度的可重用性。ajax
为了应对现实状况中的复杂性,最基本的处理必须设计的简单。从之前代码中捕捉到灵感,对于数据提取,其实咱们想要的就是:json
由此,给出了最基本的接口定义:数组
public interface IContentProcessor { /// <summary> /// 处理内容 /// </summary> /// <param name="source"></param> /// <returns></returns> object Process(object source); }
在上述的接口定义中,IContentProcessor
接口的实现方法若是足够庞大,其实能够解决任何html页面的数据提取,可是,这意味着其可复用性会愈来愈低,同时维护将愈来愈困难。因此,咱们更但愿其方法实现足够小。可是,越小表明着其功能越少,那么,为了面对复杂的现实需求,必须让这些接口能够组合起来。因此,要为接口添加新的要素:子处理器。网络
public interface IContentProcessor { /// <summary> /// 处理内容 /// </summary> /// <param name="source"></param> /// <returns></returns> object Process(object source); /// <summary> /// 该处理器的顺序,越小越先执行 /// </summary> int Order { get; } /// <summary> /// 子处理器 /// </summary> IList<IContentProcessor> SubProcessors { get; } }
这样一来,各个Processor
就能够进行协做了。其嵌套关系和Order
属性共同决定了其执行的顺序。同时,整个处理流程也具有了管道的特色:上一个Processor
的处理结果能够做为下一个Processor
的处理源。数据结构
虽然解决了处理流程的可组合性,可是就目前而言,处理的结果仍是不可组合的,由于没法应对复杂的结构。为了解决这个问题,引入了IContentCollector,这个接口继承自IContentProcessor,可是提出了额外的要求,以下:app
public interface IContentCollector : IContentProcessor { /// <summary> /// 数据收集器收集的值对应的键 /// </summary> string Key { get; } }
该接口要求提供一个Key来标识结果。这样,咱们就能够用一个Dictionary<string,object>
把复杂的结构管理起来了。由于字典的项对应的值也能够是Dictionary<string,object>
,这个时候,若是使用json做为序列化手段的话,是很是容易将结果反序列化成复杂的类的。async
至于为何要将这个接口继承自IContentProcessor
,这是为了保证节点类型的一致性,从而方便经过配置来构造整个处理流程。ide
从上面的设计中能够看到,整个处理流程实际上是一棵树,结构很是规范。这就为配置提供了可行性,这里使用一个Content-Processor-Options
类型来表示每一个Processor
节点的类型和必要的初始化信息。定义以下所示:
public class ContentProcessorOptions { /// <summary> /// 构造Processor的参数列表 /// </summary> public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>(); /// <summary> /// Processor的类型信息 /// </summary> public string ProcessorType { get; set; } /// <summary> /// 指定一个子Processor,用于快速初始化Children,从而减小嵌套。 /// </summary> public string SubProcessorType { get; set; } /// <summary> /// 子项配置 /// </summary> public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>(); }
在Options中引入了SubProcessorType
属性来快速初始化只有一个子处理节点的ContentCollector
,这样就能够减小配置内容的层级,从而使得配置文件更加清晰。而如下方法则表示了如何经过一个Content-Processor-Options
初始化Processor
。这里使用了反射,可是因为不会频繁初始化,因此不会有太大的问题。
public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions) { Type instanceType = null; try { instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true); } catch { foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies()) { if (assembly.IsDynamic) continue; instanceType = assembly.GetExportedTypes() .FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType); if (instanceType != null) break; } } if (instanceType == null) return null; var instance = Activator.CreateInstance(instanceType); foreach (var property in contentProcessorOptions.Properties) { var instanceProperty = instance.GetType().GetProperty(property.Key); if (instanceProperty == null) continue; var propertyType = instanceProperty.PropertyType; var sourceValue = property.Value.ToString(); var dValue = sourceValue.Convert(propertyType); instanceProperty.SetValue(instance, dValue); } var processorInstance = (IContentProcessor) instance; if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace()) { var quickOptions = new ContentProcessorOptions { ProcessorType = contentProcessorOptions.SubProcessorType, Properties = contentProcessorOptions.Properties }; var quickProcessor = BuildContentProcessor(quickOptions); processorInstance.SubProcessors.Add(quickProcessor); } foreach (var processorOption in contentProcessorOptions.Children) { var processor = BuildContentProcessor(processorOption); processorInstance.SubProcessors.Add(processor); } return processorInstance; }
经过一个例子来讲明问题:好比,一个html文档中提取了n个p标签,返回了一个string []
,同时将这个做为源传递给下一个处理节点。下一个处理节点会正确的处理每一个string
,可是若是此节点也是针对一个string
返回一个string[]
的话,这个string []
应该被一个Connector
拼接起来。不然的话,结果就变成了2维
、3维度
乃至是更多维度的数组。这样的话,每一个节点的逻辑就变复杂同时不可控了。因此集合须要收敛到一个维度。
因为当前使用的.NET CORE的配置文件系统,没法在一个Dictionary<string,object>
中将其子项设置为集合。
该处理器用于从网络上下载一段html文本,将文本内容做为源传递给下一个处理器;能够同时指定请求url或者将上一个请求节点传递过来的源做为url进行请求。实现以下:
public class HttpRequestContentProcessor : BaseContentProcessor { public bool UseUrlWhenSourceIsNull { get; set; } = true; public string Url { get; set; } public bool IgnoreBadUri { get; set; } protected override object ProcessElement(object element) { if (element == null) return null; if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute)) { if (IgnoreBadUri) return null; throw new FormatException($"须要请求的地址{Url}格式不正确"); } return DownloadHtml(element.ToString()); } public override object Process(object source) { if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace()) return DownloadHtml(Url); return base.Process(source); } private static async Task<string> DownloadHtmlAsync(string url) { using (var client = new HttpClient()) { var result = await client.GetAsync(url); var html = await result.Content.ReadAsStringAsync(); return html; } } private string DownloadHtml(string url) { return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url)); } }
测试以下:
[TestMethod] public void HttpRequestContentProcessorTest() { var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"}; var result = processor.Process(null); Assert.IsTrue(result.ToString().Contains("baidu")); }
该处理器经过接受一个XPath路径来获取指定的信息。能够经过指定ValueProvider
和ValueProviderKey
来指定如何从一个节点中获取数据,实现以下:
public class XpathContentProcessor : BaseContentProcessor { /// <summary> /// 索引的元素路径 /// </summary> public string Xpath { get; set; } /// <summary> /// 值得提供器的键 /// </summary> public string ValueProviderKey { get; set; } /// <summary> /// 提供器的类型 /// </summary> public XpathNodeValueProviderType ValueProviderType { get; set; } /// <summary> /// 节点的索引 /// </summary> public int? NodeIndex { get; set; } /// <summary> /// /// </summary> public string ResultConnector { get; set; } = Constants.DefaultResultConnector; public override object Process(object source) { var result = base.Process(source); return DeterminAndReturn(result); } protected override object ProcessElement(object element) { var result = base.ProcessElement(element); if (result == null) return null; var str = result.ToString(); return ProcessWithXpath(str, Xpath, false); } protected object ProcessWithXpath(string documentText, string xpath, bool returnArray) { if (documentText == null) return null; var document = new HtmlDocument(); document.LoadHtml(documentText); var nodes = document.DocumentNode.SelectNodes(xpath); if (nodes == null) return null; if (returnArray && nodes.Count > 1) { var result = new List<string>(); foreach (var node in nodes) { var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey); if (!nodeResult.IsNullOrWhiteSpace()) { result.Add(nodeResult); } } return result; } else { var result = string.Empty; foreach (var node in nodes) { var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey); if (!nodeResult.IsNullOrWhiteSpace()) { if (result.IsNullOrWhiteSpace()) result = nodeResult; else result = $"{result}{ResultConnector}{nodeResult}"; } } return result; } } }
将这个Processor
和上一个Processor
组合起来,咱们抓一下百度首页的title
:
[TestMethod] public void XpathContentProcessorTest() { var xpathProcessor = new XpathContentProcessor { Xpath = "//title", ValueProviderType = XpathNodeValueProviderType.InnerText }; var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" }; xpathProcessor.SubProcessors.Add(processor); var result = xpathProcessor.Process(null); Assert.AreEqual("百度一下,你就知道", result.ToString()); }
Collector
最大的做用是解决复杂的输出模型的问题。一个复杂数据结构的Collector
的实现以下:
public class ComplexContentCollector : BaseContentCollector { /// <summary> /// Complex Content Collector 须要子的数据提取器提供一个Key,因此忽略Processor /// </summary> /// <param name="source"></param> /// <returns></returns> protected override object ProcessElement(object source) { var result = new Dictionary<string, object>(); foreach (var contentCollector in SubProcessors.OfType<IContentCollector>()) { result[contentCollector.Key] = contentCollector.Process(source); } return result; } }
对应的测试以下:
[TestMethod] public void ComplexContentCollectorTest2() { var xpathProcessor = new XpathContentProcessor { Xpath = "//title", ValueProviderType = XpathNodeValueProviderType.InnerText }; var xpathProcessor2 = new XpathContentProcessor { Xpath = "//p[@id=\"cp\"]", ValueProviderType = XpathNodeValueProviderType.InnerText, Order = 1 }; var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1}; var complexCollector = new ComplexContentCollector(); var baseCollector = new BaseContentCollector(); baseCollector.SubProcessors.Add(processor); baseCollector.SubProcessors.Add(complexCollector); var titleCollector = new BaseContentCollector{Key = "Title"}; titleCollector.SubProcessors.Add(xpathProcessor); var footerCollector = new BaseContentCollector {Key = "Footer"}; footerCollector.SubProcessors.Add(xpathProcessor2); footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3}); complexCollector.SubProcessors.Add(titleCollector); complexCollector.SubProcessors.Add(footerCollector); var result = (Dictionary<string,object>)baseCollector.Process(null); Assert.AreEqual("百度一下,你就知道", result["Title"]); Assert.AreEqual("©2014 Baidu 使用百度前必读 京ICP证030173号", result["Footer"]); }
如今,使用如下代码进行测试:
public void RunConfig(string section) { var builder = new ConfigurationBuilder() .SetBasePath(AppDomain.CurrentDomain.BaseDirectory) .AddJsonFile("appsettings1.json"); var configurationRoot = builder.Build(); var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>(); var processor = Helper.BuildContentProcessor(options); var result = processor.Process(null); var json = JsonConvert.SerializeObject(result); System.Console.WriteLine(json); }
使用的配置:
"newsListOptions": { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "Properties": {}, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Url": "https://www.cnblogs.com/news/", "Order": "0" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@class=\"post_item\"]", "Order": "1", "ValueProviderType": "OuterHtml", "OutputToArray": true } }, { "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector", "Properties": { "Order": "2" }, "Children": [ { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Url", "ValueProviderType": "Attribute", "ValueProviderKey": "href" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//span[@class=\"article_comment\"]", "Key": "CommentCount", "ValueProviderType": "InnerText", "Order": "0" }, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor", "Properties": { "RegexPartten": "[0-9]+", "Order": "1" } } ] }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//*[@class=\"digg\"]//span", "Key": "LikeCount", "ValueProviderType": "InnerText" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Title", "ValueProviderType": "InnerText" } } ] } ] },
获取的结果:
[ { "Url": "//news.cnblogs.com/n/574269/", "CommentCount": "1", "LikeCount": "3", "Title": "刘强东:京东13年了,真正懂咱们的人仍是不多" }, { "Url": "//news.cnblogs.com/n/574267/", "CommentCount": "0", "LikeCount": "0", "Title": "联想也开始大谈人工智能,不过它最迫切的目标是卖更多PC" }, { "Url": "//news.cnblogs.com/n/574266/", "CommentCount": "0", "LikeCount": "0", "Title": "除了小米1几乎都支持 - 小米MIUI9升级机型一览" }, ... ]
这里面涉及到计算,和集合操做,同时集合元素是个字典,因此须要引入两个一个新的Processor
,一个用于筛选,一个用于映射。
public class ListItemPickContentProcessor : BaseContentProcessor { public string Key { get; set; } /// <summary> /// 用来操做的类型 /// </summary> public string OperatorTypeFullName { get; set; } /// <summary> /// 用来对比的值 /// </summary> public string OperatorValue { get; set; } /// <summary> /// 下标 /// </summary> public int Index { get; set; } /// <summary> /// 模式 /// </summary> public ListItemPickMode PickMode { get; set; } /// <summary> /// 操做符 /// </summary> public ListItemPickOperator PickOperator { get; set; } public override object Process(object source) { var preResult = base.Process(source); if (!Helper.IsEnumerableExceptString(preResult)) { if (source is Dictionary<string, object>) return ((Dictionary<string, object>) preResult)[Key]; return preResult; } return Pick(source as IEnumerable); } private object Pick(IEnumerable source) { var objCollection = source.Cast<object>().ToList(); if (objCollection.Count == 0) return objCollection; var item = objCollection[0]; var compareDictionary = new Dictionary<object, IComparable>(); if (item is IDictionary) { foreach (Dictionary<string, object> dic in objCollection) { var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName)); compareDictionary.Add(dic, key); } } else { foreach (var objItem in objCollection) { var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName)); compareDictionary.Add(objItem, key); } } IEnumerable<object> result; switch (PickOperator) { case ListItemPickOperator.OrderDesc: result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key); break; default: throw new NotSupportedException(); } switch (PickMode) { case ListItemPickMode.First: return result.FirstOrDefault(); case ListItemPickMode.Last: return result.LastOrDefault(); case ListItemPickMode.Index: return result.Skip(Index - 1).Take(1).FirstOrDefault(); default: throw new NotImplementedException(); } } private Type ResolveType(string typeName) { if (typeName == typeof(Int32).FullName) return typeof(Int32); throw new NotSupportedException(); } public enum ListItemPickMode { First, Last, Index } public enum ListItemPickOperator { LittleThan, GreaterThan, Order, OrderDesc } }
这里用了比较多的反射,可是暂时不考虑性能问题。
public class DictionaryPickContentProcessor : BaseContentProcessor { public string Key { get; set; } protected override object ProcessElement(object element) { if (element is IDictionary) { return (element as IDictionary)[Key]; } return element; } }
这个Processor
将从字典中抽取一条记录。
使用的配置:
"mostCommentsOptions": { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "Properties": {}, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Url": "https://www.cnblogs.com/news/", "Order": "0" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@class=\"post_item\"]", "Order": "1", "ValueProviderType": "OuterHtml", "OutputToArray": true } }, { "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector", "Properties": { "Order": "2" }, "Children": [ { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Url", "ValueProviderType": "Attribute", "ValueProviderKey": "href" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//span[@class=\"article_comment\"]", "Key": "CommentCount", "ValueProviderType": "InnerText", "Order": "0" }, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor", "Properties": { "RegexPartten": "[0-9]+", "Order": "1" } } ] }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//*[@class=\"digg\"]//span", "Key": "LikeCount", "ValueProviderType": "InnerText" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Title", "ValueProviderType": "InnerText" } } ] }, { "ProcessorType":"IC.Robot.ContentProcessor.ListItemPickContentProcessor", "Properties":{ "OperatorTypeFullName":"System.Int32", "Key":"CommentCount", "PickMode":"First", "PickOperator":"OrderDesc", "Order":"4" } }, { "ProcessorType":"IC.Robot.ContentProcessor.DictionaryPickContentProcessor", "Properties":{ "Order":"5", "Key":"Url" } }, { "ProcessorType":"IC.Robot.ContentProcessor.FormatterContentProcessor", "Properties":{ "Formatter":"https:{0}", "Order":"6" } }, { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Order": "7" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@id=\"news_content\"]//p[2]", "Order": "8", "ValueProviderType": "InnerHtml", "OutputToArray": false } } ] }
获取的结果:
昨日,京东忽然通知平台商户,将关闭每天快递服务接口。这意味着京东平台上的商户之后不能再用每天快递发货了。
Processor
调度的问题(深度优先、广度优先等)写代码仍是颇有趣的,不是吗?