环境:Windows7 +Python3.6+Pycharm2017html
目标:抓取美团美食移动端 深圳地区店铺的信息,包括:店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度。最后抓取2.1W条信息,程序运行约1h。工具(requests、selenium、chrome)python
---所有文章: 京东爬虫 、链家爬虫、美团爬虫、微信公众号爬虫、字体反爬、Django笔记、阿里云部署、vi\vim入门----web
1、美团桌面端算法
打开深圳美团https://sz.meituan.com/,点击美食,F12进入浏览器开发者模式。点击右上方Network和XHR,而后随便点击一个分区,好比香蜜湖。能够抓到一个请求叫:getPoiList?cityName=XXXXXXXX。点击能够看到请求的url中有一个参数_token。这个token参数应该经过某种算法算出来的,若是要模拟浏览器发请求,首先要知道如何生成token。这个token应该是经过JS生成的,通常遇到js加密的,要么破解加密原理,而后本身用代码实现。要么就是直接调用它的js代码。并且这个参数估计是最近几个月才加进去的,网上查了一遍也没有找到解决办法,本身看js文件也看不出什么,因此桌面端只能放弃。若有大神知道怎么处理这个token,望告知,谢谢!!若是真要拿token,用selenium+chrome应该也能够,每一个token应该有一段有效期。chrome
2、美团移动端 json
桌面端搞不定,只能选择其余途径。如今不少网站都会有桌面版,移动版,还有APP,通常移动版的反爬会简单些。打开美团移动版 https://i.meituan.com/ ,F12打开浏览器开发者模式,能够点击下图1处的两个方框,模拟手机浏览器。vim
而后点击美食,进入下图界面,看到右边的两个请求。第一个请求是页面的基本框架信息,好比上面各类分类信息,后面会用到。第二个请求list,是一个动态请求,用以得到商家信息。点击发现是一个post请求,请求的参数以下图红框中所示,多点击几家店铺就能看出参数的含义。变化的就四个参数areaId--地区分类、cataId--美食分类、offset--翻页参数、uuid--网站分发的id。api
直接模拟浏览器发送post请求,修改offset来实现翻页,每页有15条数据,每翻一页 offset值加15。实测在当前美食页面下直接翻页,最多能翻67页,1005条数据,后面好像出验证码仍是没数据返回了。因此咱们要对店铺进行分类抓取。浏览器
咱们须要的信息在店铺的详情页面,通常详情页面的url都是几个关键参数的拼凑,而这几个关键参数是能够在上面的列表页面抓取到的。咱们点开一家店铺,观察url,发现主要是两个参数,一个是店铺的id:6268902,还有一个就是ct_poi参数,这两个参数均可以在上面的post请求返回数据中找到。服务器
还有就是咱们进入页面详情浏览器能捕捉到不少的请求,咱们须要的店铺信息 店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度,是哪一个请求返回的,须要确认下。实际就是第一个请求,上面这个url。
点开第一个请求返回的html代码,直接ctrl+F搜索店铺电话号码,就能找到位置。在一个<script crossorigin='anonymous'>标签中,这样的标签有好几个,须要区分。用xpath解析的时候取标签内容,而后截取内容字符串前16位,看是否是window._appState,以此判断,剩下的就是json数据处理。
3、基本思路
至此,爬取的基本思路就有了。先经过列表页面抓取店铺的id和ct_poi参数,构造详情页面url,再访问详情页面抓取信息。因为翻页只能翻67页,因此咱们须要分类抓取。咱们这里选择按区域分类,应该这样能够保证每个区域下店铺数量小于67页(1005条)。店铺总数网站全城虽然显示的是46655,可是下面每一个区域加起来应该是2.4W,并且所有类目下显示的也是总数2.4W,因此我以为应该是总数在2.4W。因此如今的问题就是把每一个区域的areaId抓到。
4、区域id抓取
点击美食页面 https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1
查看html代码,也是在一个<script crossorigin='anonymous'>标签中,能够看到每一个区域对应的id。只是在浏览器上显示的数据并不完整,能够下载html到本地用编辑器打开。也是json格式数据的处理。这里就是南澳新区的数据要特殊处理下,由于它下面没有分区,我直接把它加到了坪山区内。
5、店铺id和ct_poi参数抓取
有了每一个区域的id,能够直接构造post请求获取店铺信息,该请求须要加上cookie,一条cookie就能够抓完。返回数据是json格式,包含15条店铺信息,提取其中的店铺id和ct_poi保存到本地csv文件中。抓取完成后能够对信息作一次去重,店铺id相同的就认为是重复信息。代码中把店铺的分类cateName也保存下来,详情页面好像没有这个信息。代码以下,应该改下cookie就能够运行。去重后一共抓取到21872条数据。
#coding=utf-8 import csv import time import requests import json #区域店铺id ct_Poi cateName抓取,传入参数为区域id def crow_id(areaid): id_list=[] url='https://meishi.meituan.com/i/api/channel/deal/list' head={'Host': 'meishi.meituan.com', 'Accept': 'application/json', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36', 'Cookie':'XXXXXXXXXXXXXX' } p = {'https': 'https://27.157.76.75:4275'} data={"uuid":"09dbb48e-4aed-4683-9ce5-c14b16ae7539","version":"8.3.3","platform":3,"app":"","partner":126,"riskLevel":1,"optimusCode":10,"originUrl":"http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1","offset":0,"limit":15,"cateId":1,"lineId":0,"stationId":0,"areaId":areaid,"sort":"default","deal_attr_23":"","deal_attr_24":"","deal_attr_25":"","poi_attr_20043":"","poi_attr_20033":""} r=requests.post(url,headers=head,data=data,proxies=p) result=json.loads(r.text) totalcount=result['data']['poiList']['totalCount'] #获取该分区店铺总数,计算出要翻的页数 datas=result['data']['poiList']['poiInfos'] print(len(datas),totalcount) for d in datas: d_list=['','','',''] d_list[0]=d['name'] d_list[1] = d['cateName'] d_list[2] = d['poiid'] d_list[3] = d['ctPoi'] id_list.append(d_list) print('Page:1') #将数据保存到本地csv with open('meituan_id.csv','a', newline='',encoding='gb18030')as f: write=csv.writer(f) for i in id_list: write.writerow(i) #开始爬取第2页到最后一页 offset=0 if totalcount>15: totalcount-=15 while offset<totalcount: id_list = [] offset+=15 m=offset/15+1 print('Page:%d'%m) #构造post请求参数,经过改变offset实现翻页 data2 = {"uuid": "09dbb48e-4aed-4683-9ce5-c14b16ae7539", "version": "8.3.3", "platform": 3, "app": "", "partner": 126, "riskLevel": 1, "optimusCode": 10, "originUrl": "http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1", "offset": offset, "limit": 15, "cateId": 1, "lineId": 0, "stationId": 0, "areaId": areaid, "sort": "default", "deal_attr_23": "", "deal_attr_24": "", "deal_attr_25": "", "poi_attr_20043": "", "poi_attr_20033": ""} try: r = requests.post(url, headers=head, data=data2,proxies=p) print(r.text) result = json.loads(r.text) datas = result['data']['poiList']['poiInfos'] print(len(datas)) for d in datas: d_list = ['', '', '', ''] d_list[0] = d['name'] d_list[1] = d['cateName'] d_list[2] = d['poiid'] d_list[3] = d['ctPoi'] id_list.append(d_list) #保存到本地 with open('meituan_id.csv', 'a', newline='',encoding='gb18030')as f: write = csv.writer(f) for i in id_list: write.writerow(i) except Exception as e: print(e) if __name__=='__main__': #直接将html代码中区域的信息复制出来,南澳新区的数据须要处理下,它下面没有分区 a = {"areaObj": {"28": [{"id": 28, "name": "所有", "regionName": "福田区", "count": 4022}, {"id": 1056, "name": "香蜜湖", "regionName": "香蜜湖", "count": 105}, {"id": 744, "name": "梅林", "regionName": "梅林", "count": 421}, {"id": 1055, "name": "上沙/下沙", "regionName": "上沙/下沙", "count": 291}, {"id": 2008, "name": "华强南", "regionName": "华强南", "count": 263}, {"id": 742, "name": "八卦岭/园岭", "regionName": "八卦岭/园岭", "count": 217}, {"id": 741, "name": "华强北", "regionName": "华强北", "count": 572}, {"id": 743, "name": "皇岗/水围", "regionName": "皇岗/水围", "count": 136}, {"id": 756, "name": "新城市广场", "regionName": "新城市广场", "count": 140}, {"id": 6595, "name": "车公庙", "regionName": "车公庙", "count": 305}, {"id": 6596, "name": "景田", "regionName": "景田", "count": 144}, {"id": 6597, "name": "新洲/石厦", "regionName": "新洲/石厦", "count": 374}, {"id": 6974, "name": "竹子林", "regionName": "竹子林", "count": 107}, {"id": 6975, "name": "市民中心", "regionName": "市民中心", "count": 39}, {"id": 7993, "name": "会展中心", "regionName": "会展中心", "count": 461}, {"id": 7994, "name": "岗厦", "regionName": "岗厦", "count": 110}, {"id": 7996, "name": "福田保税区", "regionName": "福田保税区", "count": 29}], "29": [{"id": 29, "name": "所有", "regionName": "罗湖区", "count": 2191}, {"id": 6976, "name": "国贸", "regionName": "国贸", "count": 232}, {"id": 758, "name": "莲塘", "regionName": "莲塘", "count": 125}, {"id": 2009, "name": "笋岗", "regionName": "笋岗", "count": 159}, {"id": 748, "name": "翠竹路沿线", "regionName": "翠竹路沿线", "count": 42}, {"id": 745, "name": "东门", "regionName": "东门", "count": 484}, {"id": 746, "name": "宝安南路沿线", "regionName": "宝安南路沿线", "count": 67}, {"id": 757, "name": "火车站", "regionName": "火车站", "count": 96}, {"id": 6598, "name": "万象城", "regionName": "万象城", "count": 127}, {"id": 6599, "name": "喜荟城/水库", "regionName": "喜荟城/水库", "count": 99}, {"id": 7659, "name": "地王大厦", "regionName": "地王大厦", "count": 85}, {"id": 8469, "name": "黄贝岭", "regionName": "黄贝岭", "count": 136}, {"id": 8470, "name": "春风万佳/文锦渡", "regionName": "春风万佳/文锦渡", "count": 19}, {"id": 8471, "name": "布心/太白路", "regionName": "布心/太白路", "count": 154}, {"id": 8790, "name": "田贝/水贝", "regionName": "田贝/水贝", "count": 85}, {"id": 8794, "name": "银湖/泥岗", "regionName": "银湖/泥岗", "count": 37}, {"id": 8795, "name": "新秀/罗芳", "regionName": "新秀/罗芳", "count": 33}, {"id": 13080, "name": "梧桐山", "regionName": "梧桐山", "count": 34}, {"id": 14095, "name": "KK mall", "regionName": "KK mall", "count": 74}], "30": [{"id": 30, "name": "所有", "regionName": "南山区", "count": 3905}, {"id": 751, "name": "南头", "regionName": "南头", "count": 325}, {"id": 750, "name": "华侨城", "regionName": "华侨城", "count": 126}, {"id": 749, "name": "蛇口", "regionName": "蛇口", "count": 9}, {"id": 1057, "name": "南油", "regionName": "南油", "count": 218}, {"id": 1058, "name": "科技园", "regionName": "科技园", "count": 460}, {"id": 1059, "name": "西丽", "regionName": "西丽", "count": 586}, {"id": 4811, "name": "南山中心区", "regionName": "南山中心区", "count": 635}, {"id": 6591, "name": "海岸城/保利", "regionName": "海岸城/保利", "count": 158}, {"id": 6592, "name": "前海", "regionName": "前海", "count": 32}, {"id": 6593, "name": "白石洲", "regionName": "白石洲", "count": 190}, {"id": 6594, "name": "欢乐海岸", "regionName": "欢乐海岸", "count": 22}, {"id": 7597, "name": "太古城", "regionName": "太古城", "count": 57}, {"id": 7599, "name": "花园城", "regionName": "花园城", "count": 42}, {"id": 13109, "name": "海上世界", "regionName": "海上世界", "count": 225}, {"id": 23117, "name": "世界之窗", "regionName": "世界之窗", "count": 97}, {"id": 25152, "name": "南山京基百纳", "regionName": "南山京基百纳", "count": 22}, {"id": 36635, "name": "深圳湾", "regionName": "深圳湾", "count": 17}], "31": [{"id": 31, "name": "所有", "regionName": "盐田区", "count": 407}, {"id": 754, "name": "大小梅沙", "regionName": "大小梅沙", "count": 36}, {"id": 755, "name": "沙头角", "regionName": "沙头角", "count": 118}, {"id": 8789, "name": "东部华侨城", "regionName": "东部华侨城", "count": 11}, {"id": 8796, "name": "盐田海鲜食街", "regionName": "盐田海鲜食街", "count": 22}, {"id": 15349, "name": "壹海城", "regionName": "壹海城", "count": 51}, {"id": 38055, "name": "溪涌", "regionName": "溪涌", "count": ""}], "32": [{"id": 32, "name": "所有", "regionName": "宝安区", "count": 6071}, {"id": 6587, "name": "西乡", "regionName": "西乡", "count": 15}, {"id": 6586, "name": "新安", "regionName": "新安", "count": 413}, {"id": 6585, "name": "石岩", "regionName": "石岩", "count": 466}, {"id": 752, "name": "宝安中心区", "regionName": "宝安中心区", "count": 458}, {"id": 4653, "name": "港隆城", "regionName": "港隆城", "count": 137}, {"id": 6588, "name": "沙井", "regionName": "沙井", "count": 824}, {"id": 6589, "name": "福永", "regionName": "福永", "count": 631}, {"id": 7684, "name": "松岗", "regionName": "松岗", "count": 435}, {"id": 7685, "name": "公明", "regionName": "公明", "count": 433}, {"id": 7719, "name": "海雅缤纷城", "regionName": "海雅缤纷城", "count": 125}, {"id": 7735, "name": "固戍", "regionName": "固戍", "count": 237}, {"id": 8006, "name": "桃源居", "regionName": "桃源居", "count": 25}, {"id": 14404, "name": "时代城", "regionName": "时代城", "count": 2}, {"id": 17088, "name": "罗田/燕川", "regionName": "罗田/燕川", "count": 45}, {"id": 17089, "name": "西田", "regionName": "西田", "count": 29}, {"id": 17091, "name": "圳美", "regionName": "圳美", "count": 32}, {"id": 17092, "name": "田寮/长圳", "regionName": "田寮/长圳", "count": 3}, {"id": 23524, "name": "沙井京基百纳", "regionName": "沙井京基百纳", "count": 98}, {"id": 27275, "name": "宝立方", "regionName": "宝立方", "count": 125}, {"id": 36634, "name": "宝安机场", "regionName": "宝安机场", "count": 244}, {"id": 37084, "name": "光明新区", "regionName": "光明新区", "count": 1}], "33": [{"id": 33, "name": "所有", "regionName": "龙岗区", "count": 5193}, {"id": 753, "name": "罗岗/求水山", "regionName": "罗岗/求水山", "count": 145}, {"id": 6600, "name": "五和/民营市场", "regionName": "五和/民营市场", "count": 250}, {"id": 6601, "name": "平湖", "regionName": "平湖", "count": 356}, {"id": 7656, "name": "横岗", "regionName": "横岗", "count": 568}, {"id": 7658, "name": "南澳", "regionName": "南澳", "count": 32}, {"id": 7663, "name": "南联", "regionName": "南联", "count": 311}, {"id": 7664, "name": "坪地", "regionName": "坪地", "count": 131}, {"id": 8472, "name": "大运", "regionName": "大运", "count": 186}, {"id": 9013, "name": "李朗聚星商城", "regionName": "李朗聚星商城", "count": 63}, {"id": 13335, "name": "较场尾/大鹏所城", "regionName": "较场尾/大鹏所城", "count": 152}, {"id": 13358, "name": "水头", "regionName": "水头", "count": 20}, {"id": 13359, "name": "东涌", "regionName": "东涌", "count": 2}, {"id": 13361, "name": "万科广场/世贸", "regionName": "万科广场/世贸", "count": 107}, {"id": 13412, "name": "华南城/奥特莱斯", "regionName": "华南城/奥特莱斯", "count": 191}, {"id": 18069, "name": "大芬/南岭", "regionName": "大芬/南岭", "count": 359}, {"id": 18228, "name": "双龙", "regionName": "双龙", "count": 316}, {"id": 19456, "name": "慢城/三联", "regionName": "慢城/三联", "count": 111}, {"id": 19457, "name": "布吉街/东站/天虹", "regionName": "布吉街/东站/天虹", "count": 404}, {"id": 26297, "name": "天虹/坂田/杨美", "regionName": "天虹/坂田/杨美", "count": 344}, {"id": 26298, "name": "岗头/万科/雪象", "regionName": "岗头/万科/雪象", "count": 199}, {"id": 35919, "name": "华为坂田基地", "regionName": "华为坂田基地", "count": 9}, {"id": 36519, "name": "杨梅坑/桔钓沙", "regionName": "杨梅坑/桔钓沙", "count": 39}, {"id": 36520, "name": "葵涌", "regionName": "葵涌", "count": 37}, {"id": 36530, "name": "官湖", "regionName": "官湖", "count": 9}, {"id": 36531, "name": "西涌", "regionName": "西涌", "count": 49}, {"id": 36636, "name": "坪山高铁站", "regionName": "坪山高铁站", "count": 41}, {"id": 37501, "name": "龙岗中心城", "regionName": "龙岗中心城", "count": 365}], "9553": [{"id": 9553, "name": "所有", "regionName": "龙华区", "count": 3080}, {"id": 1061, "name": "龙华", "regionName": "龙华", "count": 958}, {"id": 6584, "name": "民治", "regionName": "民治", "count": 164}, {"id": 7721, "name": "观澜", "regionName": "观澜", "count": 433}, {"id": 7722, "name": "大浪", "regionName": "大浪", "count": 398}, {"id": 9326, "name": "梅林关", "regionName": "梅林关", "count": 125}, {"id": 9327, "name": "锦绣江南", "regionName": "锦绣江南", "count": 33}, {"id": 36633, "name": "深圳北站", "regionName": "深圳北站", "count": 190}, {"id": 37723, "name": "龙华新区", "regionName": "龙华新区", "count": 14}], "23420": [{"id": 23420, "name": "所有", "regionName": "坪山区", "count": 393}, {"id": 6602, "name": "坪山", "regionName": "坪山", "count": 232}, {"id": 23429, "name": "坑梓/竹坑", "regionName": "坑梓/竹坑", "count": 128}, {"id": 9535, "name": "南澳大鹏新区", "regionName": "南澳大鹏新区", "count": 91}] }} datas = a['areaObj'] b = datas.values() area_list=[] for data in b: for d in data[1:]: area_list.append(d) #将每一个区域信息保存到列表,元素是字典 l=0 old=time.time() for i in area_list: l+=1 print('开始抓取第%d个区域:'%l,i['regionName'], '店铺总数:',i['count']) try: crow_id(i['id']) now=time.time()-old print(i['name'],'抓取完成!','时间:%d'%now) except Exception as e: print(e)
6、店铺详情页面抓取
店铺详情页面的url已经能够构造,如今就是直接访问。就是一个简单的get请求,可是要带上完整的cookie,cookie有问题的话很快会弹验证码。一个cookie能够爬1000次后才会出现验证码,可是也有几百次出现的。用requests的session模块好像拿不到完整的cookie,本文是用selenium+chrome,使用代理ip访问美团,而后获取cookie,再把cookie和ip返回用以发起requests请求。实际测试中出现验证码后不换cookie,只更换ip也能够继续抓取。
代码有两块,一个是主程序,还有一个get_cookie文件,用以cookie、ip的获取处理的,还有页面详情的解析模块。cookie、ip处理函数,先提取一个ip(我买的代理),而后访问美团深圳首页,sleep几秒,这个很关键,让页面彻底加载,否则会少cookie。再访问美食页面。ip质量参差不齐,使用前最好先测试下。这里用访问美食页面所需的时间来判断,大于3S的NG,从新提取ip。小于三秒的ok。而后获取下cookie,这里须要判断cookie是否完整,主要是_utma、_utmc、_utmz这几个参数有时会缺失,没有这几个参数很快会弹验证码,通常cookie长度18。页面解析函数也很简单,返回一个标志位mark和店铺信息info,标志位用以判断本次抓取是否成功。
主函数采用了多线程,比较简单,先获取ip、cookie,再开始爬取。须要注意的是爬取过程当中异常的处理。主要异常有两种,一个是timeout:这种异常先sleep1秒,再抓一次,仍是不行的话就判断本条抓取失败,若是连续三条抓取失败就须要从新获取ip、cookie。还有就是直接报‘因为目标计算机积极拒绝,没法链接’,访问次数太频繁了,被服务器识别了,就须要从新获取ip、cookie。
get_cookie 模块代码以下:
from selenium import webdriver import requests import time import json from lxml import etree #返回一个ip和对应的cookie,cookie以字符串形式返回。ip须要通过测试 def get_cookie(): mark=0 while mark==0: #购买的ip获取地址 p_url = 'XXXXXXXXXXXXX' r = requests.get(p_url) html = json.loads(r.text) a = html['data'][0]['ip'] b = html['data'][0]['port'] val = '--proxy-server=http://' + str(a) + ':' + str(b) val2 = 'https://' + str(a) + ':' + str(b) p = {'https': val2} print('获取IP:',p) chrome_options = webdriver.ChromeOptions() chrome_options.add_argument(val) driver = webdriver.Chrome(executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',chrome_options=chrome_options) driver.set_page_load_timeout(8) #设置超时 driver.set_script_timeout(8) url='https://i.meituan.com/shenzhen/' #美团深圳首页 url2='https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1'#美食页面 try: driver.get(url) time.sleep(2.5) c1=driver.get_cookies() now = time.time() driver.get(url2) tt=time.time()-now print(tt) time.sleep(0.5) #ip速度测试,打开时间大于3S的NG if tt < 3: c=driver.get_cookies() driver.quit() print('*******************') print(len(c1),len(c)) #判断cookie是否完整,正常的长度应该是18 if len(c)>17: mark=1 # print(c) x={} for line in c: x[line['name']]=line['value'] #将cookie合成字符串,以便添加到header中,字符串较长就分了两段处理 co1='__mta='+x['__mta']+'; client-id='+x['client-id']+'; IJSESSIONID='+x['IJSESSIONID']+'; iuuid='+x['iuuid']+'; ci=30; cityname=%E6%B7%B1%E5%9C%B3; latlng=; webp=1; _lxsdk_cuid='+x['_lxsdk_cuid']+'; _lxsdk='+x['_lxsdk'] co2='; __utma='+x['__utma']+'; __utmc='+x['__utmc']+'; __utmz='+x['__utmz']+'; __utmb='+x['__utmb']+'; i_extend='+x['i_extend']+'; uuid='+x['uuid']+'; _hc.v='+x['_hc.v']+'; _lxsdk_s='+x['_lxsdk_s'] co=co1+co2 print(co) return(p,co) else: print('缺乏Cookie,长度:',len(c)) else: print('超时') driver.quit() time.sleep(3) except: driver.quit() pass #解析店铺详情页面,返回店铺信息info和一个标志位mark #传入参数u包含url和店铺分类,pc包含cookie和ip,m表明抓取的数量,n表示线程号,ll表示剩余店铺数量,ttt该线程抓取的总时长 def parse(u,pc,m,n,ll,ttt): mesg='Thread:'+str(n)+' No:'+str(m)+' Time:'+str(ttt)+' left:'+str(ll)#记录当前线程爬取的信息 url = u[0] cate = u[1] p=pc[0] cookie=pc[1] mark = 0 #标志位,0表示抓取正常,1,2表示两种异常 head = {'Host': 'meishi.meituan.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Upgrade - Insecure - Requests': '1', 'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36', 'Cookie':cookie } info = [] #店铺信息存储 try: r = requests.get(url, headers=head, timeout=3, proxies=p) r.encoding = 'utf-8' html = etree.HTML(r.text) datas = html.xpath('body/script[@crossorigin="anonymous"]') for data in datas: try: strs = data.text[:16] if strs == 'window._appState': result = data.text[19:-1] result = json.loads(result) name = result['poiInfo']['name'] addr = result['poiInfo']['addr'] phone = result['poiInfo']['phone'] aveprice = result['poiInfo']['avgPrice'] opentime = result['poiInfo']['openInfo'] opentime = opentime.replace('\n', ' ') avescore = result['poiInfo']['avgScore'] marknum = result['poiInfo']['MarkNumbers'] lng = result['poiInfo']['lng'] lat = result['poiInfo']['lat'] info = [name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat] print(url) print(mesg,name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat) except: pass except Exception as e: print('Error Thread:',n) #打印出异常的线程号 print(e) s = str(e)[-22:-6] if s == '因为目标计算机积极拒绝,没法链接': print('因为目标计算机积极拒绝,没法链接',n) mark=1 #1类错误,须要更换ip else: mark=2 #2类错误,再抓取一次 return(mark,info) #返回标志位和店铺信息
主函数模块代码以下:
import csv import time import threading from get_cookie import get_cookie from get_cookie import parse def crow(n,l): #参数n 区分第几个线程,l存储url的列表 lock=threading.Lock() sym=0 #是否连续三次抓取失败的标志位 pc=get_cookie() #获取IP 和 Cookie m=0 #记录抓取的数量 now=time.time() while True: if len(l)>0: u=l.pop(0) ll=len(l) m+=1 ttt=time.time()-now result=parse(u,pc,m,n,ll,ttt) mark=result[0] info=result[1] if mark==2: time.sleep(1.5) result = parse(u, pc,m,n,ll,ttt) mark = result[0] info = result[1] if mark !=0: sym+=1 if mark==1: pc=get_cookie() result = parse(u, pc,m,n,ll,ttt) mark = result[0] info = result[1] if mark !=0: sym+=1 if mark==0: #抓取成功 sym=0 lock.acquire() with open('meituan.csv', 'a', newline='', encoding='gb18030')as f: write = csv.writer(f) write.writerow(info) f.close() lock.release() if sym>2: #连续三次抓取失败,换ip、cookie sym=0 pc=get_cookie() else: print('&&&&线程:%d结束'%n) break if __name__=='__main__': url_list=[] with open('mt_id.csv','r',encoding='gb18030')as f: read=csv.reader(f) for line in read: d_list=['',''] url='https://meishi.meituan.com/i/poi/'+str(line[2])+'?ct_poi='+str(line[3]) d_list[0]=url d_list[1]=line[1] url_list.append(d_list) f.close() th_list=[] for i in range(1,6): t=threading.Thread(target=crow,args=(i,url_list,)) print('*****线程%d开始启动...'%i) t.start() th_list.append(t) time.sleep(30) for t in th_list: t.join()
7、结果
开5个线程的话应该一个小时就能够抓完,最后一共抓取到21828条数据,丢了不到50条数据。
水平有限,若有错误望指正。还有桌面版的抓取若有解决方法望告知,谢谢。
---所有文章: 京东爬虫 、链家爬虫、美团爬虫、微信公众号爬虫、字体反爬、Django笔记、阿里云部署、vi\vim入门、 Git基本操做 ----
更多案例持续更新,欢迎关注我的公众号!
打赏做者