【学习笔记】Scrapy

时间 2019-11-18

原文原文链接

　　刚刚接触Scrapycss

　　根据书上的内容作练习python

　　爬取书上的连接里的书本信息scrapy

 1 import scrapy
 2 class FuzxSpider(scrapy.Spider):
 3     name = "Fuzx"
 4     start_urls=['http://books.toscrape.com/']
 5     def parse(self, response):
 6         for book in response.css('article.product_pod'):
 7             name=book.xpath('./h3/a/@title').extract_first()
 8             price=book.css('p.price_color::text').extract_first()
 9             yield {
10                 'name':name,
11                 'price':price,
12             }
13         next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
14         if next_url:
15             next_url=response.urljoin(next_url)
16             yield scrapy.Request(next_url,callback=self.parse)

　　命令：scrapy crawl Fuzx -o fuzx.csvide

　　将内容保存在fuzx.csv里url

　　看样子爬成功了spa

　　延伸一下网上说练爬虫豆瓣比较容易code

　　我打开的是豆瓣阅读的小说界面 https://read.douban.com/kind/100blog

　　爬取图书名和做者requests

　　而后改了css选择器的内容it

　　报错403

　　而后我加了个User-Agent

　　第二次出现问题的地方是在这里有空格

for book in response.css('li.item.store-item'):

　　参考别人的文章，能够用上面这种写法

　　完整代码：

 1 import scrapy
 2 class DoubSpider(scrapy.Spider):
 3     name = "doub"
 4     headers = {
 5         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
 6     }
 7     def start_requests(self):
 8         url = 'https://read.douban.com/kind/100'
 9         yield scrapy.Request(url, headers=self.headers)
10 
11     def parse(self, response):
12         for book in response.css('li.item.store-item'):
13             name=book.css('div.title a::text').extract_first()
14             author=book.css('p span.labeled-text a.author-item::text').extract_first()
15             yield {
16                 'name':name,
17                 'author':author,
18             }
19         next_url=response.css('div.pagination li.next a::attr(href)').extract_first()
20         if next_url:
21             next_url=response.urljoin(next_url)
22             yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)

　　而后scrapy crawl doub -o db.csv

　　爬完查看db.csv