电商网站产品采集,我相信是一个很须要的功能,主要可以减轻产品编辑和上架所带来一些繁琐的事情,固然除了电商这块,其余方面数据采集也是须要的,比喻新闻类门户等,这系列咱们来说讲如何来采集速卖通产品,主要要说明一下,这里不考虑Ajax,验证码等特殊环境,咱们采集速卖通若是不是采集很频繁,就看成为一个良性的采集环境,下面就不太多废话了,直接先分析采集产品信息,包含变体(有Color,Size,且价格不同,作为多个产品采集下来)。javascript
一.采集前技术和相关工具html
1. C#正则表达式学习:(正则表达式用来干吗的,在这里就不用我介绍。)java
2. 正则表达式表工具:RegexMatchTracer 界面以下,我用了好久,这个工具匹配到的很快,很准确,且写得正则表达式和.NET兼容性很高。界面以下:git
3. 请求类库:RestSharp 请求代码其实能够本身用.NET封装,我很懒,不想老是拿着一些用了800年的代码倒腾来倒腾去。ReshSharp 代码很简单,你们一看就懂,实例代码以下:github
1 var client = new RestClient("http://example.com"); 2 // client.Authenticator = new HttpBasicAuthenticator(username, password);
3
4 var request = new RestRequest("resource/{id}", Method.POST); 5 request.AddParameter("name", "value"); // adds to POST or URL querystring based on Method
6 request.AddUrlSegment("id", "123"); // replaces matching token in request.Resource 7
8 // add parameters for all properties on an object
9 request.AddObject(object); 10
11 // or just whitelisted properties
12 request.AddObject(object, "PersonId", "Name", ...); 13
14 // easily add HTTP Headers
15 request.AddHeader("header", "value"); 16
17 // add files to upload (works with compatible verbs)
18 request.AddFile("file", path); 19
20 // execute the request
21 IRestResponse response = client.Execute(request); 22 var content = response.Content; // raw content as string 23
24 // or automatically deserialize result 25 // return content type is sniffed but can be explicitly set via RestClient.AddHandler();
26 IRestResponse<Person> response2 = client.Execute<Person>(request); 27 var name = response2.Data.Name; 28
29 // or download and save file to disk
30 client.DownloadData(request).SaveAs(path); 31
32 // easy async support
33 client.ExecuteAsync(request, response => { 34 Console.WriteLine(response.Content); 35 }); 36
37 // async with deserialization
38 var asyncHandle = client.ExecuteAsync<Person>(request, response => { 39 Console.WriteLine(response.Data.Name); 40 }); 41
42 // abort the request on demand
43 asyncHandle.Abort();
4. 对了还要用到Json.NET,由于这里主要用他来解析Json,你们可能会奇怪为何采集会要解析JSON,我这里先卖个关子。用到的解析部分实例代码以下:正则表达式
1 string json = @"{ 2 CPU: 'Intel', 3 Drives: [ 4 'DVD read/writer', 5 '500 gigabyte hard drive' 6 ] 7 }"; 8
9 JObject o = JObject.Parse(json); 10
11 string json = @"[ 12 'Small', 13 'Medium', 14 'Large' 15 ]"; 16 JArray a = JArray.Parse(json);
5. 在线JSON格式化工具,这里我用的开源中国的 http://tool.oschina.net/ 在线工具 之 Json 格式化工具 http://tool.oschina.net/codeformat/jsonexpress
6. 分析HTML代码能力json
二. 速卖通采集前分析数组
1. 采集产品通常采集信息有产品名称,产品图片,产品单价,产品币别,若是有变体 产品颜色,产品大小(S,M,L等),产品实价,产品折扣价,产品描述等。浏览器
2. 通常采集,主要经过请求HTTP URL取HTML,经过正则表达式一个一个匹配,获得全部信息。
3. 速卖通产品实例 URL : 点击查看 访问后效果以下:
4. 除了产品变体和产品图片,其余经过正则表达式都好匹配,没有什么太大疑问,那咱们来讲明产品变体和产品图片抓法。
5. 产品变体,我以前有说过,跟Color,Size 有关系,那么一个产品变体数目,就是Color 数量 * Size 数量 ,全部上面实例 6种颜色和1种大小,那么它的变体数量就 6*1 = 6,所以采集该实例时,就会产生6个产品变体。
那变体怎么采集呢?按照日常 正则表达式思路,必须找HTML规律,咱们先来看看产品颜色HTML代码以下:
<ul id="j-sku-list-1" class="sku-attr-list util-clearfix" data-sku-prop-id="14">
<li class="item-sku-image"><a data-role="sku" data-sku-id="193" id="sku-1-193" title="Black" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.1"><img src="http://g02.a.alicdn.com/kf/HTB1whl0LVXXXXbiaXXXq6xXFXXXg/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Black" bigpic="http://g02.a.alicdn.com/kf/HTB1whl0LVXXXXbiaXXXq6xXFXXXg/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
<li class="item-sku-image"><a data-role="sku" data-sku-id="173" id="sku-1-173" title="Blue" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.2"><img src="http://g02.a.alicdn.com/kf/HTB1SCp4LVXXXXb5XVXXq6xXFXXXF/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Blue" bigpic="http://g02.a.alicdn.com/kf/HTB1SCp4LVXXXXb5XVXXq6xXFXXXF/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
<li class="item-sku-image"><a data-role="sku" data-sku-id="691" id="sku-1-691" title="Gray" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.3"><img src="http://g03.a.alicdn.com/kf/HTB1fxapLVXXXXXGXXXXq6xXFXXXf/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Gray" bigpic="http://g03.a.alicdn.com/kf/HTB1fxapLVXXXXXGXXXXq6xXFXXXf/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
<li class="item-sku-image"><a data-role="sku" data-sku-id="350852" id="sku-1-350852" title="Orange" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.4"><img src="http://g01.a.alicdn.com/kf/HTB1YXVZLVXXXXawaXXXq6xXFXXXR/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Orange" bigpic="http://g01.a.alicdn.com/kf/HTB1YXVZLVXXXXawaXXXq6xXFXXXR/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
<li class="item-sku-image"><a data-role="sku" data-sku-id="1052" id="sku-1-1052" title="Pink" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.5"><img src="http://g03.a.alicdn.com/kf/HTB10_ymLVXXXXb3XXXXq6xXFXXX5/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Pink" bigpic="http://g03.a.alicdn.com/kf/HTB10_ymLVXXXXb3XXXXq6xXFXXX5/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
<li class="item-sku-image"><a data-role="sku" data-sku-id="496" id="sku-1-496" title="Purple" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.6"><img src="http://g01.a.alicdn.com/kf/HTB11wt1LVXXXXbaaXXXq6xXFXXXb/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Purple" bigpic="http://g01.a.alicdn.com/kf/HTB11wt1LVXXXXbaaXXXq6xXFXXXb/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a></li>
</ul>
ui->li->a 标签,就是各颜色信息,如黑色:
1 <a data-role="sku" data-sku-id="193" id="sku-1-193" title="Black" href="javascript:;" data-spm-anchor-id="2114.12010108.1000016.1"><img src="http://g02.a.alicdn.com/kf/HTB1whl0LVXXXXbiaXXXq6xXFXXXg/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg_50x50.jpg" title="Black" bigpic="http://g02.a.alicdn.com/kf/HTB1whl0LVXXXXbiaXXXq6xXFXXXg/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"></a>
比较其余颜色,不难发现 a 标签 data-sku-id 属性值不一样颜色不同,且是数字类型,位数不一,初步推测多是颜色的一个编码。而标签title属性就是颜色真正英文名称如Black,Blue等。
产品大小HTML代码以下:
1 <ul id="j-sku-list-2" class="sku-attr-list util-clearfix" data-sku-prop-id="5" data-sku-show-type="none" data-widget-cid="widget-17">
2 <li><a data-role="sku" data-sku-id="361386" id="sku-2-361386" href="javascript:void(0)" data-spm-anchor-id="2114.12010108.1000016.7"><span>M</span></a></li>
3 </ul>
也是ui->li->a 标签,就是各大小信息,如M:
1 <a data-role="sku" data-sku-id="361386" id="sku-2-361386" href="javascript:void(0)" data-spm-anchor-id="2114.12010108.1000016.7"><span>M</span></a>
不难发现a 标签 data-sku-id 属性值也是数字类型,虽然是一种产品大小,可是我以为这个规律应该跟颜色同样,是各类产品大小代码,你们可能发现 id 属性的规律 “sku-2-”+代码,是产品大小信息id,其实颜色也是有id “sku-1-”+代码 ,全部发现 id sku-1 开头的是颜色,sku-2 开头的大小。那产品大小缩写名称不像颜色样放在title里面,而是a->span标签的内容。
综合上面所找的规律
1>. a 标签,
2>. data-sku-id属性数字,位数不一
3>. title是颜色英文名称
4>. a->span 标签内容是产品大小缩写名称
5>. 且标签有可能在请求完后,返回的不必定和浏览器开发工具(F12)显示同样 .
是否是很难写出一个高效,而无误的正则表达式。
那是否是还有其余方便的方法呢? 正在我想放弃的时候,就有一个想法(速卖通产品信息页面颜色应该也是经过javascript渲染出来的,那应该搜索一些颜色名称,看是否是有什么发现!),我就拿“Black” 在F12搜索了一下!
确实找到了四处,可是我看了一下,没有找到渲染颜色相关javascript代码,我又想不该该啊,后来想一想不是有颜色编号? 我就赶忙拿“193” BLACK 搜索一下,果真惊喜出现了,如图
红框的部分,就是“193”搜索到信息,那咱们再来搜索一下大小“361386” ,结果又惊呆了,搜索结果以下图:
是否是两个图,搜索到结果,都指向同一个代码区块。这时我确信这段代码就是速卖通渲染颜色和大小javascript,准确来讲应该json格式数据。
开始研究该段代码咯
6.JSON代码分析,先用在线格式化工具,进行JSON格式化一下,不然太难看,眼睛受不了,考出var skuProducts 这个数组全部数据,格式化代码以下
1 [ 2 { 3 "skuAttr": "14:193;5:361386", 4 "skuPropIds": "193,361386", 5 "skuVal": { 6 "actSkuBulkCalPrice": "6.84", 7 "actSkuBulkPrice": "6.84", 8 "actSkuCalPrice": "7.20", 9 "actSkuDisplayBulkPrice": "US $6.84", 10 "actSkuMultiCurrencyBulkPrice": "6.84", 11 "actSkuMultiCurrencyCalPrice": "7.2", 12 "actSkuMultiCurrencyDisplayPrice": "7.20", 13 "actSkuMultiCurrencyPrice": "US $7.20", 14 "actSkuPrice": "7.20", 15 "availQuantity": 7, 16 "bulkOrder": 5, 17 "inventory": 10, 18 "isActivity": true, 19 "skuBulkCalPrice": "15.2", 20 "skuBulkPrice": "15.20", 21 "skuCalPrice": "16.00", 22 "skuDisplayBulkPrice": "US $15.20", 23 "skuMultiCurrencyBulkPrice": "15.2", 24 "skuMultiCurrencyCalPrice": "16.0", 25 "skuMultiCurrencyDisplayPrice": "16.00", 26 "skuMultiCurrencyPrice": "US $16.00", 27 "skuPrice": "16.00"
28 } 29 }, 30 { 31 "skuAttr": "14:173;5:361386", 32 "skuPropIds": "173,361386", 33 "skuVal": { 34 "actSkuBulkCalPrice": "6.84", 35 "actSkuBulkPrice": "6.84", 36 "actSkuCalPrice": "7.20", 37 "actSkuDisplayBulkPrice": "US $6.84", 38 "actSkuMultiCurrencyBulkPrice": "6.84", 39 "actSkuMultiCurrencyCalPrice": "7.2", 40 "actSkuMultiCurrencyDisplayPrice": "7.20", 41 "actSkuMultiCurrencyPrice": "US $7.20", 42 "actSkuPrice": "7.20", 43 "availQuantity": 9, 44 "bulkOrder": 5, 45 "inventory": 10, 46 "isActivity": true, 47 "skuBulkCalPrice": "15.2", 48 "skuBulkPrice": "15.20", 49 "skuCalPrice": "16.00", 50 "skuDisplayBulkPrice": "US $15.20", 51 "skuMultiCurrencyBulkPrice": "15.2", 52 "skuMultiCurrencyCalPrice": "16.0", 53 "skuMultiCurrencyDisplayPrice": "16.00", 54 "skuMultiCurrencyPrice": "US $16.00", 55 "skuPrice": "16.00"
56 } 57 }, 58 { 59 "skuAttr": "14:691;5:361386", 60 "skuPropIds": "691,361386", 61 "skuVal": { 62 "actSkuBulkCalPrice": "6.84", 63 "actSkuBulkPrice": "6.84", 64 "actSkuCalPrice": "7.20", 65 "actSkuDisplayBulkPrice": "US $6.84", 66 "actSkuMultiCurrencyBulkPrice": "6.84", 67 "actSkuMultiCurrencyCalPrice": "7.2", 68 "actSkuMultiCurrencyDisplayPrice": "7.20", 69 "actSkuMultiCurrencyPrice": "US $7.20", 70 "actSkuPrice": "7.20", 71 "availQuantity": 10, 72 "bulkOrder": 5, 73 "inventory": 10, 74 "isActivity": true, 75 "skuBulkCalPrice": "15.2", 76 "skuBulkPrice": "15.20", 77 "skuCalPrice": "16.00", 78 "skuDisplayBulkPrice": "US $15.20", 79 "skuMultiCurrencyBulkPrice": "15.2", 80 "skuMultiCurrencyCalPrice": "16.0", 81 "skuMultiCurrencyDisplayPrice": "16.00", 82 "skuMultiCurrencyPrice": "US $16.00", 83 "skuPrice": "16.00"
84 } 85 }, 86 { 87 "skuAttr": "14:350852;5:361386", 88 "skuPropIds": "350852,361386", 89 "skuVal": { 90 "actSkuBulkCalPrice": "6.84", 91 "actSkuBulkPrice": "6.84", 92 "actSkuCalPrice": "7.20", 93 "actSkuDisplayBulkPrice": "US $6.84", 94 "actSkuMultiCurrencyBulkPrice": "6.84", 95 "actSkuMultiCurrencyCalPrice": "7.2", 96 "actSkuMultiCurrencyDisplayPrice": "7.20", 97 "actSkuMultiCurrencyPrice": "US $7.20", 98 "actSkuPrice": "7.20", 99 "availQuantity": 10, 100 "bulkOrder": 5, 101 "inventory": 10, 102 "isActivity": true, 103 "skuBulkCalPrice": "15.2", 104 "skuBulkPrice": "15.20", 105 "skuCalPrice": "16.00", 106 "skuDisplayBulkPrice": "US $15.20", 107 "skuMultiCurrencyBulkPrice": "15.2", 108 "skuMultiCurrencyCalPrice": "16.0", 109 "skuMultiCurrencyDisplayPrice": "16.00", 110 "skuMultiCurrencyPrice": "US $16.00", 111 "skuPrice": "16.00"
112 } 113 }, 114 { 115 "skuAttr": "14:1052;5:361386", 116 "skuPropIds": "1052,361386", 117 "skuVal": { 118 "actSkuBulkCalPrice": "6.84", 119 "actSkuBulkPrice": "6.84", 120 "actSkuCalPrice": "7.20", 121 "actSkuDisplayBulkPrice": "US $6.84", 122 "actSkuMultiCurrencyBulkPrice": "6.84", 123 "actSkuMultiCurrencyCalPrice": "7.2", 124 "actSkuMultiCurrencyDisplayPrice": "7.20", 125 "actSkuMultiCurrencyPrice": "US $7.20", 126 "actSkuPrice": "7.20", 127 "availQuantity": 10, 128 "bulkOrder": 5, 129 "inventory": 10, 130 "isActivity": true, 131 "skuBulkCalPrice": "15.2", 132 "skuBulkPrice": "15.20", 133 "skuCalPrice": "16.00", 134 "skuDisplayBulkPrice": "US $15.20", 135 "skuMultiCurrencyBulkPrice": "15.2", 136 "skuMultiCurrencyCalPrice": "16.0", 137 "skuMultiCurrencyDisplayPrice": "16.00", 138 "skuMultiCurrencyPrice": "US $16.00", 139 "skuPrice": "16.00"
140 } 141 }, 142 { 143 "skuAttr": "14:496;5:361386", 144 "skuPropIds": "496,361386", 145 "skuVal": { 146 "actSkuBulkCalPrice": "6.84", 147 "actSkuBulkPrice": "6.84", 148 "actSkuCalPrice": "7.20", 149 "actSkuDisplayBulkPrice": "US $6.84", 150 "actSkuMultiCurrencyBulkPrice": "6.84", 151 "actSkuMultiCurrencyCalPrice": "7.2", 152 "actSkuMultiCurrencyDisplayPrice": "7.20", 153 "actSkuMultiCurrencyPrice": "US $7.20", 154 "actSkuPrice": "7.20", 155 "availQuantity": 9, 156 "bulkOrder": 5, 157 "inventory": 10, 158 "isActivity": true, 159 "skuBulkCalPrice": "15.2", 160 "skuBulkPrice": "15.20", 161 "skuCalPrice": "16.00", 162 "skuDisplayBulkPrice": "US $15.20", 163 "skuMultiCurrencyBulkPrice": "15.2", 164 "skuMultiCurrencyCalPrice": "16.0", 165 "skuMultiCurrencyDisplayPrice": "16.00", 166 "skuMultiCurrencyPrice": "US $16.00", 167 "skuPrice": "16.00"
168 } 169 } 170 ]
格式化后,是否是一眼发现是一个json数组,且数组长度正好是6,正好产品变体数量吻合,那就开始从这个json数组分析可以从中抓到那些产品信息。
先拿json数组中的一个json对象来分析,其余依次类推,代码以下:
1 { 2 "skuAttr": "14:193;5:361386", 3 "skuPropIds": "193,361386", 4 "skuVal": { 5 "actSkuBulkCalPrice": "6.84", 6 "actSkuBulkPrice": "6.84", 7 "actSkuCalPrice": "7.20", 8 "actSkuDisplayBulkPrice": "US $6.84", 9 "actSkuMultiCurrencyBulkPrice": "6.84", 10 "actSkuMultiCurrencyCalPrice": "7.2", 11 "actSkuMultiCurrencyDisplayPrice": "7.20", 12 "actSkuMultiCurrencyPrice": "US $7.20", 13 "actSkuPrice": "7.20", 14 "availQuantity": 7, 15 "bulkOrder": 5, 16 "inventory": 10, 17 "isActivity": true, 18 "skuBulkCalPrice": "15.2", 19 "skuBulkPrice": "15.20", 20 "skuCalPrice": "16.00", 21 "skuDisplayBulkPrice": "US $15.20", 22 "skuMultiCurrencyBulkPrice": "15.2", 23 "skuMultiCurrencyCalPrice": "16.0", 24 "skuMultiCurrencyDisplayPrice": "16.00", 25 "skuMultiCurrencyPrice": "US $16.00", 26 "skuPrice": "16.00"
27 } 28 }
咱们刚刚是否是拿产品颜色编号,产品大小编写搜索到的上面javascript 代码的,那很显然,产品颜色,大小均可以从上面代码找到,那具体在哪里呢?
那Black 编号 193 ,大小编号 361386 就一眼看到上面的 第一个属性,和第一个熟悉都有,分别代码 "skuAttr": "14:193;5:361386", "skuPropIds": "193,361386"
若是抓第一个属性势必要split两次才能取到值,这样看第二个属性比较合适split一次,便可抓到值。固然这样仍是抓到编号,要抓到颜色英文名称,大小英文名称,那还要结合上面颜色和大小html标签,写两个简单的正则表达式,而后放占位,经过string.format获得最终的正则表达式,就能够比较可靠的,高效的抓到颜色,大小。文章最后我再给出全部正则表达式。
继续观察json对象,看到的就是一些us $,price 字眼,能够判断基本上就是单价,币别信息,价格分实价,折扣价,以下图
上图实价是us $16.00 ,而折扣价是us $7.20 ,上面是否是发现有不少个相同的,有单独的只有一个价格,有和币别在一块儿的,这里就看我的喜爱了,我这里就找单独单价的(发现币别还有一个地方能够取,晚点再说!),我不喜爱总是split,或者字符串截取,由于这样很容易出错,并且处理繁琐。那单独的是否是也有不少个,怎么选择呢? 这里个人原则是属性名称容易看得懂的,且简单的,从上面代码能够发现"skuPrice": "16.00" 取实价,"actSkuPrice": "7.20" 取折扣价。那这样产品价格就已经搞定了。
那币别呢?我以前不是说,币别不在这个json里面取,还要split繁琐得很,并且US $ 取到的还不是国际标准币别代码(USD),那怎么办呢,还能怎么办呢再搜索一下源码吧。咱们搜一下USD看看!以下图:
搜到4处,可是最好识别的一处是上图标出的,window.runParams.currencyCode="USD"; 是否是从属性命名就以为这个币别抓取最合适,且你们发现正好和产品变体的JSON在一块,如何获取就是简单的正则表达式就能够获取了。币别也搞定。
7. 产品图片是产品全部原始大小图片,不须要采集缩略图,无论一个产品变体有多少个,他对应的产品图片都是同一套,全部只要采集一次便可。
通常也是找图片img->src 标签,而后写正则表达式,我也不是很喜欢正则表达式。因此我仍是想找找其余地方,看看有没有先产品变体同样相似的JSON。
仍是F12 随便找一张大图的img->src 将内容考出来,搜索HTML源码,搜索串以下:
http://g02.a.alicdn.com/kf/HTB11XybLVXXXXbDXFXXq6xXFXXXM/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg
搜索到最中意结果以下图:
又是找到一个JSON数组,过高兴了吧,提取代码确认是不是产品图片,代码以下:
1 [ 2 "http://g02.a.alicdn.com/kf/HTB11XybLVXXXXbDXFXXq6xXFXXXM/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg", 3 "http://g03.a.alicdn.com/kf/HTB1G45fLVXXXXbEXpXXq6xXFXXXi/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg", 4 "http://g02.a.alicdn.com/kf/HTB1dhCfLVXXXXb8XpXXq6xXFXXXX/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg", 5 "http://g03.a.alicdn.com/kf/HTB1Kd04LVXXXXcEXVXXq6xXFXXXL/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg", 6 "http://g02.a.alicdn.com/kf/HTB1ks1oLVXXXXbtXXXXq6xXFXXXH/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg", 7 "http://g01.a.alicdn.com/kf/HTB1jsGfLVXXXXXXXFXXq6xXFXXXm/Yoga-Tank-Tops-Shirts-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirts-Sports-Woman-Gym.jpg"
8 ]
首先从数组长度等于6,说明有六张图片,与下图红框的六张吻合。
这样还不能说明是产品图片,那咱们从数组把第一个图片连接拿出来,在浏览器浏览一下,请求结果以下图:
是否是都是这个穿黑衣服外国美女啊,综上所述能够确定这个数组就是传说中的产品图片。
好了,到了这里比较难采集的多已经分析完成了,至于产品名称,产品描述基本能够用正则表达式取到,固然取json也须要正则表达式来取,那本篇文章稍后我会把全部正则表达式列出来,就基本结束了,
因为篇幅过长问题,实践的部分,我会尽快再开一篇 “速卖通产品采集系列之产品采集实践” 博文,并把源代码放出来,敬请期待。
三. 正则表达式以下(因为正则表达式我的不是很精通,写得不够通用,请笑纳):
1.产品名称:(?<=<h1 class=\"product-name\" itemprop=\"name\">).*?(?=</h1>)
2.产品图片:(?<=window.runParams.imageBigViewURL=).*?(?=;)
3.变体JSON:(?<=var skuProducts=).*?(?=;\s*var skuAttrIds=)
4.产品颜色:(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-1-{0}\" title=\").*?(?=\") “{0}”不是正则表达式,是string.format 的占位
5.产品大小:(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-2-{0}\" href=\"javascript:void\(0\)\"\s+><span>).*?(?=</) “{0}”不是正则表达式,是string.format 的占位
6.币别:(?<=window.runParams.currencyCode=\").*?(?=\";)