环境:Windows 7 x64 Python3.7.1 pycharmhtml
1.1linux系统使用:pip install scrapypython
1.2Windows系统:linux
一、新建一个项目,选择Python便可。我这里建立的项目名是demo。建立好后是一个空的项目。数据库
二、点击pycharm下面的terminal,以下图所示:api
在终端中输入:scrapy startproject demo 命令,建立scrapy项目,建立成功后会出现以下目录结构:网络
各文件做用大体以下:dom
3.1在终端中输入:cd demo(我这里输入demo是由于个人项目名是demo)scrapy
3.2在终端中输入:scrapy genspider books books.toscrape.com (scrapy genspider 应用名称 爬取网页的起始url)ide
5.1分析http://books.toscrape.com/页面。函数
由上图咱们能够知道全部书籍都存放在div/ol/下的li标签中。这里咱们只打印书名,由此咱们能够像下面这样写来提取数据。
5.2books中的部分代码以下:
def parse(self, response): ''' 数据解析,提取。 :param response: 爬取到的response对象 :return: ''' book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li') for book in book_list: print(book.xpath('./article/div[1]/a/img/@alt').extract())
5.3在setting.py中配置以下:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0' # UA头 ROBOTSTXT_OBEY = False # 若是为True表示准信robots协议,则大多数数据都爬不了。因此这里设置为Flase LOG_LEVEL = 'ERROR' # 日志等级
5.4在终端中执行爬取命令:scrapy crawl books
# 打印内容以下 ['A Light in the Attic'] ['Tipping the Velvet'] ['Soumission'] ['Sharp Objects'] ['Sapiens: A Brief History of Humankind'] ['The Requiem Red'] ['The Dirty Little Secrets of Getting Your Dream Job'] ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull'] ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics'] ['The Black Maria'] ['Starving Hearts (Triangular Trade Trilogy, #1)'] ["Shakespeare's Sonnets"] ['Set Me Free'] ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"] ['Rip it Up and Start Again'] ['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991'] ['Olio'] ['Mesaerion: The Best Science Fiction Stories 1800-1849'] ['Libertarianism for Beginners'] ["It's Only the Himalayas"]
由此咱们能够看出这里只是爬取了1页,下面来爬取全部书籍名称。
最终books.py的内容看起来像下面这样:
# -*- coding: utf-8 -*- import scrapy class BooksSpider(scrapy.Spider): name = 'books' # 爬虫的惟一标识 allowed_domains = ['books.toscrape.com'] # 要爬取的起点,能够是多个。 start_urls = ['http://books.toscrape.com/'] url = 'http://books.toscrape.com/catalogue/page-%d.html' # url模板用于拼接新的url page_num = 2 def parse(self, response): ''' 数据解析,提取。 :param response: 爬取到的response对象 :return: ''' print(f'当前页数{self.page_num}') # 打印当前页数的数据 book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li') for book in book_list: print(book.xpath('./article/div[1]/a/img/@alt').extract()) if self.page_num < 50: # 总共50页的内容 new_url = format(self.url % self.page_num) # 拼接处新的URL self.page_num += 1 # 页数加1 yield scrapy.Request(url=new_url, callback=self.parse) # 手动发送请求
在终端中执行命令获取书名:scrapy crawl books
若是一切顺利你会看到打印的最终部分结果以下:
一、提取页面中的数据(使用XPath或CSS选择器)。
二、提取页面中的连接,并产生对连接页面的下载请求。