目的:从表格中读取名字,抓取名字对应对应的人物相关信息,填入表格
代码:https://github.com/Candlend/prabook_crawler
表格形式有:
一开始本打算使用python的CSV库,后发觉两个问题:
最终我选择操作Excel读写,这里有3个第三方模块:
xlrd只能读取xls和xlsx文件,xlwt只能生成并写入xls文件,不能在已有的excel文件基础上进行修改,而xlutils可以配合xlrd修改excel文件(保存时同名覆盖,未被修改的内容不变),但其弊端在于无法修改样式。我最终尝试了xlrd和xlutils,未使用xlwt库。
于此同时,所输入的命令行参数含义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import xlrd from xlutils.copy import copy import sys import crawler for i in rang rdata = xlrd.open_workbook(sys.argv[1]) wdata = copy(rdata) rtable = rdata.sheets()[0] wtable = wdata.get_sheet(0) nrows = rtable.nrows start = int(sys.argv[3]) end = int(sys.argv[4])+1 if sys.argv[5]: print ("use strict") strict = 1 else: strict = 0 path = sys.argv[2] limit = float(sys.argv[6]) print ("range: %d~%d" % (int(sys.argv[3]),int(sys.argv[4]))) print ('save position: ' + path) print ("the lower limit of reliability: ",limit)
for i in range(start,end): name = rtable.row_values(i)[2] celebrity = crawler.crawl(name, i, strict, limit) #见下一章 |
这次爬取网站我未考虑urllib库,一开始打算入门scrapy框架,但后来发现scrapy跟request相比上手较慢,且在初步了解requests后便想到了对策,从而可以很快达成我的目的,就转而使用了requests库。
然而,在发起请求时仍然出现了问题。我应当爬取的网址为: https://www.prabook.com/web/search.html#general=name 爬虫所能访问的却是: https://www.prabook.com/web/search.html 后面的信息无法发送。
事实上,后面的信息由页面的js读取然后发送了一个新的json请求给后端,实际搜索结果来自于: https://prabook.com/web/search.json?_dc=0&start=0&rows=10&general=name
问号后面是高级搜索的各个参数值,这四个参数是必须的。_dc是毫秒级时间戳(一开始误以为是防爬随机字符串),其他参数的意义就显而易见了。
1 2 3 |
url= "https://prabook.com/web/search.json" p={'general':name,'_dc':int(time.time()),'start':0,'rows':5} r1=requests.get(url,params=p) |
利用该json文件,我的目的是直接从中获取一部分人物信息和信息所在的主要页面的网址。
期间,还使用Levenshtein模块进行了筛选。Levenshtein距离,即编辑距离,意思是两个字符串之间,由一个转换成另一个所需的最少编辑操作次数。安装方法:pip install python-Levenshtein
我利用Levenshtein.ratio(a,b)
获得表格中人物名字与搜索结果的相似度,以此作为搜索结果的置信度(有待改进)。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for eachresult in r1.json()['result']: eachname= eachresult["fullName"].replace("<mark>","").replace("</mark>","") # print (eachname) eachreliability = Levenshtein.ratio(eachname.lower().replace('.',''),name.lower().replace('.','')) if eachreliability > celebrity["reliability"]: celebrity["reliability"] = eachreliability j = eachresult celebrity['name'] = eachname print ('[%d] reliability: %.2f%%' % (number,celebrity["reliability"]*100)) if celebrity["reliability"] < limit: print ("[%d] The reliability is too low!" % number) if strict == 1: return celebrity |
我所使用的是Chrome的开发者模式,尽管我了解的仅有Elements标签页,但这次爬取网站已经足够。
对于Selectors选择器,这方面的知识一开始我还是从scrapy的官方教程中获得的,我选择了xpath而非css定位(只是单纯未尝试)。
在信息所在的主要页面,事实上人物信息格式是不统一的,所以筛选人物信息极其麻烦。
我在此使用了正则表达式(我使用的是regex第三方模块而非python自带的re模块,原因是re模块中,后行断言里只能是标准字符)。至于在分析页面结构时所遇到的各种问题,具体的就请直接看程序代码和网页源代码了。
但我在编码上遇到了两个问题(尽管我认为无论哪一个都是utf-8编码):
—
会变成乱码â
print (json.dumps(celebrity,sort_keys=True, indent=4, separators=(',', ': ')))
预览信息时,原本的—
会变成unicode编码字符\u2014
我对于编码这块内容还不怎么熟悉,使用了治标不治本的蠢办法:python自带的replace
方法。
在这一次的爬虫中,我没有特意地去隐藏自己的爬虫身份,如延时访问、动态代理、伪装浏览器等,原因如下:
如果在实际运用中被网站发现并被服务器屏蔽,我再做出相关对策。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
from lxml import etree import requests import time import json import regex as re import Levenshtein
def crawl(name, number, strict, limit): # print(name) print ("[%d] Crawling..." % number) celebrity = {'Education':'','Personality':'','Background':'','Birthday':'','Birthplace':'','Foreign':'','Career':'','Connections-Married':'', \ 'father':'','mother':'','spouse':'','how many children':'','children':'','name':'','reliability':0} url= "https://prabook.com/web/search.json" p={'general':name,'_dc':int(time.time()),'start':0,'rows':5} r1=requests.get(url,params=p) for eachresult in r1.json()['result']: eachname= eachresult["fullName"].replace("<mark>","").replace("</mark>","") # print (eachname) eachreliability = Levenshtein.ratio(eachname.lower().replace('.',''),name.lower().replace('.','')) if eachreliability > celebrity["reliability"]: celebrity["reliability"] = eachreliability j = eachresult celebrity['name'] = eachname print ('[%d] reliability: %.2f%%' % (number,celebrity["reliability"]*100)) if celebrity["reliability"] < limit: print ("[%d] The reliability is too low!" % number) if strict == 1: return celebrity celebrity['Background']=j["staticBackground"] celebrity['Birthday']="%d/%d/%d" % (j["birthYear"],j["birthMonth"],j["birthDay"]) celebrity['Birthplace']=j["birthPlace"] try: if j["nationalities"][0] == "American": celebrity['Foreign'] = 'N' else: celebrity['Foreign'] = 'Y' except IndexError: pass path="https://prabook.com/web" + j["seoUrl"] r2=requests.get(path) html = r2.text.encode("utf-8") tree = etree.HTML(html) links = tree.xpath('//article[@class="article__item"]') if len(links) == 0: # print ("no article") return celebrity try: Interests=tree.xpath('//p[@class="interest-list__element"]/text()')[0] celebrity["Personality"]=Interests.replace("\r","").replace("\n","").replace("\t","").replace("â","—") except: # print ("no interest") pass for eachlink in links: title1 = eachlink.xpath('h3[@class="article__title"]/text()')[0].replace("\r","").replace("\n","").replace("\t","") if title1 == "Education" or title1 == "Career" or title1 == "Background": text = eachlink.xpath('p[@class="article__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—") if celebrity[title1] == '': celebrity[title1]=text if title1 == "Connections": try: text = eachlink.xpath('p[@class="article__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—") married = re.findall(r'(?<=Married[^\d:;]+?, )[^.]*(?=.)',text) if len(married) == 1: celebrity["Connections-Married"] = married[0] # else: # print ("something wrong %d" % len(married)) except IndexError: # print ("He has no wife.") pass links2 = eachlink.xpath('dl[@class="def-list"]') for eachlink2 in links2: try: title2 = eachlink2.xpath('dt[@class="def-list__title"]/text()')[0].replace("\r","").replace("\n","").replace("\t","") except IndexError: # print ("no title") pass if title2 == "father:" or title2 == "mother:" or title2 == "spouse:" or title2 == "children:" or title2 == "spouses:": text = eachlink2.xpath('dd[@class="def-list__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—").strip() if title2 == "spouses:": if celebrity["spouse"] == "": celebrity["spouse"] = text else: celebrity["spouse"] += "; " + text else: if celebrity[title2[:-1]] == "": celebrity[title2[:-1]] = text else: celebrity[title2[:-1]] += "; " + text # print (text) if title2 == "children:": if celebrity["how many children"] == "": celebrity["how many children"] = 1 else: celebrity["how many children"] += 1
else: continue
print ("[%d] Get!" % number) return (celebrity)
if __name__ == '__main__': name = input('name: ') strict = input('strict: ') limit = input('limit: ') celebrity = crawl(name, 0, strict, limit) print (json.dumps(celebrity,sort_keys=True, indent=4, separators=(',', ': '))) |
这一步没有什么好说明的,代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
wtable.write(i,6,celebrity['Background']) wtable.write(i,7,celebrity['Birthday']) wtable.write(i,8,celebrity['Birthplace']) wtable.write(i,9,celebrity['Foreign']) wtable.write(i,10,celebrity['Education']) wtable.write(i,11,celebrity['Career']) wtable.write(i,12,celebrity['Personality']) wtable.write(i,13,celebrity['Connections-Married']) wtable.write(i,14,celebrity['father']) wtable.write(i,15,celebrity['mother']) wtable.write(i,16,celebrity['spouse']) wtable.write(i,17,celebrity['how many children']) wtable.write(i,18,celebrity['children']) wtable.write(i,35,"%.2f%%" % (celebrity["reliability"]*100)) wtable.write(i,36,name) wtable.write(i,37,celebrity['name']) wdata.save(path) |
由于网络爬虫多是I/O密集型代码,加上我所爬取的网站更是国外网站,访问速度极慢,我出于这一考虑使用了multiprocessing.dummy
的多线程,而不是对CPU密集型代码友好的多进程(虽然经常听说Python 的多线程是鸡肋)。
然而网上许多网络爬虫的例子中,仍然使用了multiprocessing的多进程,我没有深入研究,尚不明其原因。
1 2 3 4 5 6 7 |
from multiprocessing.dummy import Pool
def process(i): pass
pool = Pool(processes=4) pool.map(process,range(start,end)) |
一开始我试图使用pyinstaller在Linux上打包成Windows的exe后缀文件,当然是没有成功的,可执行文件需要在对应平台生成。但似乎可以使用wine来达成这个效果,未尝试。
于是我将代码转到windows上,并安装各种依赖,遇到了很多问题,主要是:
Microsoft Visual C++ 14.0 is required
,照提示安装完Microsoft Visual C++ 14.0之后,仍出现问题(未截屏,难以说明),后google得知要移动某一文件的位置,最终解决了问题。
1 2 |
Error loading Python DLL 'C:\Users\Candlend\Desktop\prabook_crawler\build\test\python36.dll' LoadLibrary: 找不到指定的模块 |
后将C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\python36.dll
移动到该位置后,打包测试用的hello_world.exe
正常运行,然而prabook_crawler
的可执行文件仍然报错,但报错信息变为No module named '_socket'
。但我的python并没有缺少这一模块,不明其原因。最后使用pyinstaller -F
指令生成单个可执行文件,没有产生问题。
第一次尝试写网络爬虫,就结果而言,目的达成,但程序仍有许多不足之处,这篇博客中也有很多不准确的说法。目前很多东西还是本着不求甚解的态度,没有去深入研究。但我先将所遇到的问题记录下来,一方面是希望获得指点,解答我的疑问,一方面是想在以后的学习中,结合新遇到的问题一起系统地研究,之后完善这篇博客。
执行效果如图: