Python爬虫-爬取豆瓣图书Top250

时间 2019-11-08

标签 python 爬虫豆瓣图书 top250 栏目 Python 繁體版

原文原文链接

豆瓣网站很人性化，对于新手爬虫比较友好，没有若是调低爬取频率，不用担忧会被封 IP。但也不要太频繁爬取。html

涉及知识点：requests、html、xpath、csv浏览器

1、准备工做函数

须要安装requests、lxml、csv库优化

爬取目标：https://book.douban.com/top250网站

2、分析页面源码url

打开网址，按下F12，而后查找书名，右键弹出菜单栏 Copy==> Copy Xpathspa

以书名“追风筝的人” 获取书名的xpath是：//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a3d

这里须要注意一下，浏览器复制的xpath只能做参考，由于浏览器常常会在本身里面增长多余的tbody标签，咱们须要手动把这个标签删除，整理成//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/acode

一样获取图书的评分、评论人数、简介，结果以下：orm

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[3]

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/p[1]

初步代码

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
#由于要获取标题文本，因此xpath表达式要追加/text(),res.xpath返回的是一个列表，且列表中只有一个元素因此追加一个[0]
name = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/text()')[0].strip()
score = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]/text()')[0].strip()
comment = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[3]/text()')[0].strip()
info = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/p[1]/text()')[0].strip()
print(name,score,comment,info)

执行显示：

这里只是获取第一条图书的信息，获取第2、第三看看

获得xpath：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
name1 = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/text()')[0].strip()
name2 = res.xpath('//*[@id="content"]/div/div[1]/div/table[2]/tr/td[2]/div[1]/a/text()')[0].strip()
name3 = res.xpath('//*[@id="content"]/div/div[1]/div/table[3]/tr/td[2]/div[1]/a/text()')[0].strip()
print(name1,name2,name3)

执行显示：

对比他们的xpath，发现只有table序号不同，咱们能够就去掉序号，获得所有关于书名的xpath信息：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
names = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/text()')
for name in names:
    print(name.strip())

执行结果：太多，这里只展现一部分

对于其余评分、评论人数、简介也一样使用此方法来获取。

到此，根据分析到的信息进行规律对比，写出获取第一页图书信息的代码：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
for tr in trs:
    name = tr.xpath('./td[2]/div[1]/a/text()')[0].strip()
    score = tr.xpath('./td[2]/div[2]/span[2]/text()')[0].strip()
    comment = tr.xpath('./td[2]/div[2]/span[3]/text()')[0].strip()
    info = tr.xpath('./td[2]/p[1]/text()')[0].strip()
    print(name,score,comment,info)

执行结果展现（内容较多，只展现前部分）

以上只是获取第一页的数据，接下来，咱们获取到所有页数的连接，而后进行循环便可

3、获取所有连接地址

查看分析页数对应网页源码：

以代码实现

for i in range(10):
    url = 'https://book.douban.com/top250?start={}'.format(i * 25)
    print(url)

执行结果：正是正确的结果

通过分析，已经获取到所有的页面连接和每一页的数据提取，最后把总体代码进行整理和优化。

完整代码

#-*- coding:utf-8 -*-
"""
-------------------------------------------------
   File Name：     DoubanBookTop250
   Author :        zww
   Date：          2019/5/13
   Change Activity:2019/5/13
-------------------------------------------------
"""
import requests
from lxml import etree

#获取每页地址
def getUrl():
    for i in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(i*25)
        urlData(url)
#获取每页数据
def urlData(url):
    html = requests.get(url).text
    res = etree.HTML(html)
    trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
    for tr in trs:
        name = tr.xpath('./td[2]/div/a/text()')[0].strip()
        score = tr.xpath('./td[2]/div/span[2]/text()')[0].strip()
        comment = tr.xpath('./td[2]/div/span[3]/text()')[0].replace('(','').replace(')','').strip()
        info = tr.xpath('./td[2]/p[1]/text()')[0].strip()
        print("《{}》--{}分--{}--{}".format(name,score,comment,info))

if __name__ == '__main__':
    getUrl()

执行结果：总共250条图书信息，一条很多，因为数据太多，只展现前部分

把爬取到的数据存储到csv文件中

def write_to_file(content):
    #‘a’追加模式，‘utf_8_sig’格式处处csv不乱码
    with open('DoubanBookTop250.csv','a',encoding='utf_8_sig',newline='') as f:
        fieldnames = ['name','score','comment','info']
        #利用csv包的DictWriter函数将字典格式数据存储到csv文件中
        w = csv.DictWriter(f,fieldnames=fieldnames)
        w.writerow(content)

完整代码

#-*- coding:utf-8 -*-
"""
-------------------------------------------------
   File Name：     DoubanBookTop250
   Author :        zww
   Date：          2019/5/13
   Change Activity:2019/5/13
-------------------------------------------------
"""
import csv
import requests
from lxml import etree

#获取每页地址
def getUrl():
    for i in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(i*25)
        for item in urlData(url):
            write_to_file(item)
        print('成功保存豆瓣图书Top250第{}页的数据!'.format(i+1))

#数据存储到csv
def write_to_file(content):
    #‘a’追加模式，‘utf_8_sig’格式处处csv不乱码
    with open('DoubanBookTop250.csv','a',encoding='utf_8_sig',newline='') as f:
        fieldnames = ['name','score','comment','info']
        #利用csv包的DictWriter函数将字典格式数据存储到csv文件中
        w = csv.DictWriter(f,fieldnames=fieldnames)
        w.writerow(content)

#获取每页数据
def urlData(url):
    html = requests.get(url).text
    res = etree.HTML(html)
    trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
    for tr in trs:
        yield {
        'name':tr.xpath('./td[2]/div/a/text()')[0].strip(),
        'score':tr.xpath('./td[2]/div/span[2]/text()')[0].strip(),
        'comment':tr.xpath('./td[2]/div/span[3]/text()')[0].replace('(','').replace(')','').strip(),
        'info':tr.xpath('./td[2]/p[1]/text()')[0].strip()
        }
        #print("《{}》--{}分--{}--{}".format(name,score,comment,info))

if __name__ == '__main__':
    getUrl()

内容过多，只展现前部分