【学习笔记】Scrapy

  刚刚接触Scrapycss

  根据书上的内容作练习python

  爬取书上的连接里的书本信息scrapy

  

 1 import scrapy
 2 class FuzxSpider(scrapy.Spider):
 3     name = "Fuzx"
 4     start_urls=['http://books.toscrape.com/']
 5     def parse(self, response):
 6         for book in response.css('article.product_pod'):
 7             name=book.xpath('./h3/a/@title').extract_first()
 8             price=book.css('p.price_color::text').extract_first()
 9             yield {
10                 'name':name,
11                 'price':price,
12             }
13         next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
14         if next_url:
15             next_url=response.urljoin(next_url)
16             yield scrapy.Request(next_url,callback=self.parse)

 

  命令:scrapy crawl Fuzx -o fuzx.csvide

  将内容保存在fuzx.csv里url

  

  看样子爬成功了spa

  延伸一下 网上说练爬虫豆瓣比较容易code

  我打开的是豆瓣阅读的小说界面 https://read.douban.com/kind/100blog

  爬取图书名和做者requests

  

 

  而后改了css选择器的内容it

  报错403

  而后我加了个User-Agent

  第二次出现问题的地方是在这里有空格

  

for book in response.css('li.item.store-item'):

  参考别人的文章,能够用上面这种写法

  完整代码:

 1 import scrapy
 2 class DoubSpider(scrapy.Spider):
 3     name = "doub"
 4     headers = {
 5         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
 6     }
 7     def start_requests(self):
 8         url = 'https://read.douban.com/kind/100'
 9         yield scrapy.Request(url, headers=self.headers)
10 
11     def parse(self, response):
12         for book in response.css('li.item.store-item'):
13             name=book.css('div.title a::text').extract_first()
14             author=book.css('p span.labeled-text a.author-item::text').extract_first()
15             yield {
16                 'name':name,
17                 'author':author,
18             }
19         next_url=response.css('div.pagination li.next a::attr(href)').extract_first()
20         if next_url:
21             next_url=response.urljoin(next_url)
22             yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)

  而后scrapy crawl doub -o db.csv

  爬完查看db.csv

相关文章
相关标签/搜索