Python 爬虫实战学习笔记

目标网站:https://www.prabook.com 

目的:从表格中读取名字,抓取名字对应对应的人物相关信息,填入表格

代码:https://github.com/Candlend/prabook_crawler 

读取表格,获得人物名字

表格形式有:

  1. 制表符分隔的TXT文件
  2. 逗号分隔的CSV文件
  3. Excel表格

一开始本打算使用python的CSV库,后发觉两个问题:

  1. 我所要读取的表格本就是xlsx文件,虽然可以将其转换为CSV格式,但没有意义
  2. 人物信息会使用逗号,在使用逗号分割单元格的CSV文件中担心会产生问题(未证实)

最终我选择操作Excel读写,这里有3个第三方模块:

  1. xlrd
  2. xlwt
  3. xlutils

xlrd只能读取xls和xlsx文件,xlwt只能生成并写入xls文件,不能在已有的excel文件基础上进行修改,而xlutils可以配合xlrd修改excel文件(保存时同名覆盖,未被修改的内容不变),但其弊端在于无法修改样式。我最终尝试了xlrd和xlutils,未使用xlwt库。

于此同时,所输入的命令行参数含义如下:

  1. 代表所要读取的excel文档
  2. 代表所要导出的excel文档(可覆盖原文档,但必须是旧版的xls格式,不然产生错误)
  3. 读取人物名字的起始编号
  4. 读取人物名字的结束编号
  5. 1/0代表严格模式的开启与否(严格模式开启则置信度低于下限的人物信息不写入表格)
  6. 置信度下限

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

import xlrd

from xlutils.copy import copy

import sys

import crawler

for i in rang

rdata = xlrd.open_workbook(sys.argv[1])

wdata = copy(rdata)

rtable = rdata.sheets()[0]

wtable = wdata.get_sheet(0)

nrows = rtable.nrows

start = int(sys.argv[3])

end = int(sys.argv[4])+1

if sys.argv[5]:

    print ("use strict")

    strict = 1

else:

    strict = 0

path = sys.argv[2]

limit = float(sys.argv[6])

print ("range: %d~%d" % (int(sys.argv[3]),int(sys.argv[4])))

print ('save position: ' + path)

print ("the lower limit of reliability: ",limit)

 

for i in range(start,end):

    name = rtable.row_values(i)[2]

    celebrity = crawler.crawl(name, i, strict, limit) #见下一章

输入名字,返回人物相关信息

发起请求

这次爬取网站我未考虑urllib库,一开始打算入门scrapy框架,但后来发现scrapy跟request相比上手较慢,且在初步了解requests后便想到了对策,从而可以很快达成我的目的,就转而使用了requests库。

然而,在发起请求时仍然出现了问题。我应当爬取的网址为: https://www.prabook.com/web/search.html#general=name 爬虫所能访问的却是: https://www.prabook.com/web/search.html 后面的信息无法发送。

事实上,后面的信息由页面的js读取然后发送了一个新的json请求给后端,实际搜索结果来自于: https://prabook.com/web/search.json?_dc=0&start=0&rows=10&general=name

问号后面是高级搜索的各个参数值,这四个参数是必须的。_dc是毫秒级时间戳(一开始误以为是防爬随机字符串),其他参数的意义就显而易见了。

 

1

2

3

url= "https://prabook.com/web/search.json"

p={'general':name,'_dc':int(time.time()),'start':0,'rows':5}

r1=requests.get(url,params=p)

利用该json文件,我的目的是直接从中获取一部分人物信息和信息所在的主要页面的网址。

期间,还使用Levenshtein模块进行了筛选。Levenshtein距离,即编辑距离,意思是两个字符串之间,由一个转换成另一个所需的最少编辑操作次数。安装方法:pip install python-Levenshtein

我利用Levenshtein.ratio(a,b)获得表格中人物名字与搜索结果的相似度,以此作为搜索结果的置信度(有待改进)。

 

1

2

3

4

5

6

7

8

9

10

11

12

13

for eachresult in r1.json()['result']:

    eachname= eachresult["fullName"].replace("<mark>","").replace("</mark>","")

    # print (eachname)

    eachreliability = Levenshtein.ratio(eachname.lower().replace('.',''),name.lower().replace('.',''))

    if eachreliability > celebrity["reliability"]:

        celebrity["reliability"] = eachreliability

        j = eachresult

        celebrity['name'] = eachname

        print ('[%d] reliability: %.2f%%' % (number,celebrity["reliability"]*100))

        if  celebrity["reliability"] < limit:

            print ("[%d] The reliability is too low!" % number)

            if strict == 1:

                return celebrity

分析页面结构

我所使用的是Chrome的开发者模式,尽管我了解的仅有Elements标签页,但这次爬取网站已经足够。

对于Selectors选择器,这方面的知识一开始我还是从scrapy的官方教程中获得的,我选择了xpath而非css定位(只是单纯未尝试)。

在信息所在的主要页面,事实上人物信息格式是不统一的,所以筛选人物信息极其麻烦。

我在此使用了正则表达式(我使用的是regex第三方模块而非python自带的re模块,原因是re模块中,后行断言里只能是标准字符)。至于在分析页面结构时所遇到的各种问题,具体的就请直接看程序代码和网页源代码了。

但我在编码上遇到了两个问题(尽管我认为无论哪一个都是utf-8编码):

  1. 从网站上获取文字信息时,原本的会变成乱码â
  2. 当我使用print (json.dumps(celebrity,sort_keys=True, indent=4, separators=(',', ': ')))预览信息时,原本的会变成unicode编码字符\u2014

我对于编码这块内容还不怎么熟悉,使用了治标不治本的蠢办法:python自带的replace方法。

爬虫隐藏

在这一次的爬虫中,我没有特意地去隐藏自己的爬虫身份,如延时访问、动态代理、伪装浏览器等,原因如下:

  1. 访问速度本身就慢
  2. 代理服务器质量未必高
  3. 所要爬取的信息量不大

如果在实际运用中被网站发现并被服务器屏蔽,我再做出相关对策。

Crawler主要代码

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

from lxml import etree

import requests

import time

import json

import regex as re

import Levenshtein

 

def crawl(name, number, strict, limit):

    # print(name)

    print ("[%d] Crawling..." % number)

    celebrity = {'Education':'','Personality':'','Background':'','Birthday':'','Birthplace':'','Foreign':'','Career':'','Connections-Married':'', \

    'father':'','mother':'','spouse':'','how many children':'','children':'','name':'','reliability':0}

    url= "https://prabook.com/web/search.json"

    p={'general':name,'_dc':int(time.time()),'start':0,'rows':5}

    r1=requests.get(url,params=p)

    for eachresult in r1.json()['result']:

        eachname= eachresult["fullName"].replace("<mark>","").replace("</mark>","")

        # print (eachname)

        eachreliability = Levenshtein.ratio(eachname.lower().replace('.',''),name.lower().replace('.',''))

        if eachreliability > celebrity["reliability"]:

            celebrity["reliability"] = eachreliability

            j = eachresult

            celebrity['name'] = eachname

    print ('[%d] reliability: %.2f%%' % (number,celebrity["reliability"]*100))

    if  celebrity["reliability"] < limit:

        print ("[%d] The reliability is too low!" % number)

        if strict == 1:

            return celebrity

    celebrity['Background']=j["staticBackground"]

    celebrity['Birthday']="%d/%d/%d" % (j["birthYear"],j["birthMonth"],j["birthDay"])

    celebrity['Birthplace']=j["birthPlace"]

    try:

        if j["nationalities"][0] == "American":

            celebrity['Foreign'] = 'N'

        else:

            celebrity['Foreign'] = 'Y'

    except IndexError:

        pass

    path="https://prabook.com/web" + j["seoUrl"]

    r2=requests.get(path)

    html = r2.text.encode("utf-8")

    tree = etree.HTML(html)

    links = tree.xpath('//article[@class="article__item"]')

    if len(links) == 0:

        # print ("no article")

        return celebrity

    try:

        Interests=tree.xpath('//p[@class="interest-list__element"]/text()')[0]

        celebrity["Personality"]=Interests.replace("\r","").replace("\n","").replace("\t","").replace("â","—")

    except:

        # print ("no interest")

        pass

    for eachlink in links:

        title1 = eachlink.xpath('h3[@class="article__title"]/text()')[0].replace("\r","").replace("\n","").replace("\t","")

        if title1 == "Education" or title1 == "Career" or title1 == "Background":

            text = eachlink.xpath('p[@class="article__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—")

            if celebrity[title1] == '':

                celebrity[title1]=text

        if title1 == "Connections":

            try:

                text = eachlink.xpath('p[@class="article__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—")

                married = re.findall(r'(?<=Married[^\d:;]+?, )[^.]*(?=.)',text)

                if len(married) == 1:

                    celebrity["Connections-Married"] = married[0]

                # else:

                    # print ("something wrong %d" % len(married))

            except IndexError:

                # print ("He has no wife.")

                pass

            links2 = eachlink.xpath('dl[@class="def-list"]')

            for eachlink2 in links2:

                try:

                    title2 = eachlink2.xpath('dt[@class="def-list__title"]/text()')[0].replace("\r","").replace("\n","").replace("\t","")

                except IndexError:

                    # print ("no title")

                    pass

                if title2 == "father:" or title2 == "mother:" or title2 == "spouse:" or title2 == "children:" or title2 == "spouses:":

                    text = eachlink2.xpath('dd[@class="def-list__text"]/text()')[0].replace("\r","").replace("\n","").replace("\t","").replace("â","—").strip()

                    if title2 == "spouses:":

                        if celebrity["spouse"] == "":

                            celebrity["spouse"] = text

                        else:

                            celebrity["spouse"] += "; " + text

                    else:

                        if celebrity[title2[:-1]] == "":

                            celebrity[title2[:-1]] = text

                        else:

                            celebrity[title2[:-1]] += "; " + text

                        # print (text)

                        if title2 == "children:":

                            if celebrity["how many children"] == "":

                                celebrity["how many children"] = 1

                            else:

                                celebrity["how many children"] += 1

 

        else:

            continue

 

    print ("[%d] Get!" % number)

    return (celebrity)

 

 

if __name__ == '__main__':

    name = input('name: ')

    strict = input('strict: ')

    limit = input('limit: ')

    celebrity = crawl(name, 0, strict, limit)

    print (json.dumps(celebrity,sort_keys=True, indent=4, separators=(',', ': ')))

将人物相关信息填入表格

这一步没有什么好说明的,代码如下:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

wtable.write(i,6,celebrity['Background'])

wtable.write(i,7,celebrity['Birthday'])

wtable.write(i,8,celebrity['Birthplace'])

wtable.write(i,9,celebrity['Foreign'])

wtable.write(i,10,celebrity['Education'])

wtable.write(i,11,celebrity['Career'])

wtable.write(i,12,celebrity['Personality'])

wtable.write(i,13,celebrity['Connections-Married'])

wtable.write(i,14,celebrity['father'])

wtable.write(i,15,celebrity['mother'])

wtable.write(i,16,celebrity['spouse'])

wtable.write(i,17,celebrity['how many children'])

wtable.write(i,18,celebrity['children'])

wtable.write(i,35,"%.2f%%" % (celebrity["reliability"]*100))

wtable.write(i,36,name)

wtable.write(i,37,celebrity['name'])

wdata.save(path)

使用多线程加快爬取速度

由于网络爬虫多是I/O密集型代码,加上我所爬取的网站更是国外网站,访问速度极慢,我出于这一考虑使用了multiprocessing.dummy的多线程,而不是对CPU密集型代码友好的多进程(虽然经常听说Python 的多线程是鸡肋)。

然而网上许多网络爬虫的例子中,仍然使用了multiprocessing的多进程,我没有深入研究,尚不明其原因。

 

1

2

3

4

5

6

7

from multiprocessing.dummy import Pool

 

def process(i):

    pass

 

pool = Pool(processes=4)

pool.map(process,range(start,end))

将程序打包成可执行文件

一开始我试图使用pyinstaller在Linux上打包成Windows的exe后缀文件,当然是没有成功的,可执行文件需要在对应平台生成。但似乎可以使用wine来达成这个效果,未尝试。

于是我将代码转到windows上,并安装各种依赖,遇到了很多问题,主要是:

  1. 安装Levenshtein库时提示Microsoft Visual C++ 14.0 is required,照提示安装完Microsoft Visual C++ 14.0之后,仍出现问题(未截屏,难以说明),后google得知要移动某一文件的位置,最终解决了问题。
  2. 使用pyinstaller直接打包产生问题

 

1

2

Error loading Python DLL 'C:\Users\Candlend\Desktop\prabook_crawler\build\test\python36.dll'

LoadLibrary: 找不到指定的模块

后将C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\python36.dll移动到该位置后,打包测试用的hello_world.exe正常运行,然而prabook_crawler的可执行文件仍然报错,但报错信息变为No module named '_socket'。但我的python并没有缺少这一模块,不明其原因。最后使用pyinstaller -F指令生成单个可执行文件,没有产生问题。

总结

第一次尝试写网络爬虫,就结果而言,目的达成,但程序仍有许多不足之处,这篇博客中也有很多不准确的说法。目前很多东西还是本着不求甚解的态度,没有去深入研究。但我先将所遇到的问题记录下来,一方面是希望获得指点,解答我的疑问,一方面是想在以后的学习中,结合新遇到的问题一起系统地研究,之后完善这篇博客。

执行效果如图:

result