用csv库保存爬取豆瓣网图书TOP250数据

时间 2019-11-09

标签 csv 保存豆瓣网图书 top250 数据繁體版

原文原文链接

（1）、爬取的内容为豆瓣网图书TOP250的信息html

（2）、爬取豆瓣网图书TOP250的10页信息，经过手动浏览，发现其规律：python

　　https://book.douban.com/top250?start=0浏览器

　　https://book.douban.com/top250?start=25url

　　......spa

（3）、须要爬取的信息有：书名、书本的url连接、做者、出版社和出版时间、书本价格、评分、评价。翻译

完整代码以下：orm

import csv
from lxml import etree
import requests


fp=open('C:\\Users\\weigengqiu\\Desktop\\booktop250.csv','w+',newline='',encoding='utf-8')————————————————————#建立csv文件，初始化该文件
writer=csv.writer(fp)
writer.writerow(('bookname  ','link  ','author  ','publisher  ','date   ','price  ','rate  ','comment  '))————#写入文件中的第一行信息点
url=['http://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]————————————————————————#列表推导式
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
html=requests.get(url,headers=headers)————————————————————————————————————————————————————————————————————————#获取网页内容
detail=etree.HTML(html.text)——————————————————————————————————————————————————————————————————————————————————#etree库把HTML文档为Element对象
infos=detail.xpath('//tr[@class="item"]')—————————————————————————————————————————————————————————————————————#获取大标签，以此循环
for info in infos:
    booknames=info.xpath('td[2]/div[1]/a/@title')—————————————————————————————————————————————————————————————#经过浏览器检查获取节点
    links=info.xpath('td[2]/div[1]/a/@href')——————————————————————————————————————————————————————————————————#a节点下有多个信息点，能够经过@属性获取
    book_infos=info.xpath('td[2]/p/text()')[0]————————————————————————————————————————————————————————————————#<p>与<div>同一子点，获取多个信息点，有包含不少/n，因此提取第一个
    authors=book_infos.split('/')[0]——————————————————————————————————————————————————————————————————————————#获取多个信息点后，经过“/”分隔，获得做者信息
    publishers=book_infos.split('/')[-3]——————————————————————————————————————————————————————————————————————#发布社尝试按[2]获取，会发现异常，某些书籍不存在中外翻译，此时[2]获取的则是时间，因此采用倒序输出
    dates=book_infos.split('/')[-2]———————————————————————————————————————————————————————————————————————————#倒序输出书籍发布时间
    prices=book_infos.split('/')[-1]——————————————————————————————————————————————————————————————————————————#倒序输出书籍价格
    rates=info.xpath('td[2]/div[2]/span[2]/text()')———————————————————————————————————————————————————————————#会发现div[2]下有多个span节点，肯定对应节点输出文本
    comments=info.xpath('td[2]/p[2]/span/text()')—————————————————————————————————————————————————————————————#<p>与<div>同一子点，并且有序，肯定对应节点输出文本

　　writer.writerow((booknames,links,authors,publishers,dates,prices,rates,comments))—————————————————————————#爬取信息写入csv数据保存

fp.close()——————————————————————————————————————————————————————————————————————————————————————————————————-—#关闭csv文件

注意事项：xml

1.爬取节点的时候会报错：IndexError: list index out of rangehtm

这个错误出现大约有两种状况：第1种可能状况list[index]index超出范围第2种可能状况list是一个空的 没有一个元素进行list[0]就会出现该错误