爬虫系列(十三) 用selenium爬取京东商品

时间 2019-11-18

标签爬虫系列十三 selenium 京东商品栏目网络爬虫繁體版

原文原文链接

这篇文章，咱们将经过 selenium 模拟用户使用浏览器的行为，爬取京东商品信息，仍是先放上最终的效果图：html

一、网页分析

（1）初步分析

本来博主打算写一个可以爬取全部商品信息的爬虫，但是在分析过程当中发现，不一样商品的网页结构居然是不同的python

因此，后来就放弃了这个想法，转为只爬取笔记本类型商品的信息web

若是须要爬取其它类型的商品信息，只需把提取数据的规则改变一下就好，有兴趣的朋友能够本身试试看呀chrome

好了，下面咱们正式开始！json

首先，用 Chrome 浏览器打开笔记本商品首页，咱们很容易发现该网页是一个 动态加载 的网页浏览器

由于刚打开网页时只会显示 30 个商品的信息，但是当咱们向下拖动网页时，它会再次加载剩下 30 个商品的信息网络

这时候咱们能够经过 selenium 模拟浏览器下拉网页的过程，获取网站所有商品的信息less

>>> browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

（2）模拟翻页

另外，咱们发现该网站一共有 100 个网页ide

咱们能够经过构造 URL 来获取每个网页的内容，可是这里咱们仍是选择使用 selenium 模拟浏览器的翻页行为学习

下拉网页至底部能够发现有一个 下一页 的按钮，咱们只需获取并点击该元素便可实现翻页

>>> browser.find_element_by_xpath('//a[@class="pn-next" and @onclick]').click()

（3）获取数据

接下来，咱们须要解析每个网页来获取咱们须要的数据，具体包括（可使用 selenium 选择元素）：

商品 ID：browser.find_elements_by_xpath('//li[@data-sku]') ，用于构造连接地址
商品价格：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')
商品名称：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')
评论人数：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')

二、编码实现

好了，分析过程很简单，基本思路是使用 selenium 模拟浏览器的行为，下面是代码实现

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

class JdSpider():
    def open_file(self):
        self.fm = input('请输入文件保存格式（txt、json、csv）：')
        while self.fm!='txt' and self.fm!='json' and self.fm!='csv':
            self.fm = input('输入错误，请从新输入文件保存格式（txt、json、csv）：')
        if self.fm=='txt' :
            self.fd = open('Jd.txt','w',encoding='utf-8')
        elif self.fm=='json' :
            self.fd = open('Jd.json','w',encoding='utf-8')
        elif self.fm=='csv' :
            self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

    def open_browser(self):
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser,10)

    def init_variable(self):
        self.data = zip()
        self.isLast = False

    def parse_page(self):
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links,prices,names,comments)
        except selenium.common.exceptions.TimeoutException:
            print('parse_page: TimeoutException')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('parse_page: StaleElementReferenceException')
            self.browser.refresh()

    def turn_page(self):
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH,'//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            print('turn_page: TimeoutException')
            self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('turn_page: StaleElementReferenceException')
            self.browser.refresh()

    def write_to_file(self):
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('----------------------------------------\n')
                self.fd.write('link：' + str(item[0]) + '\n')
                self.fd.write('price：' + str(item[1]) + '\n')
                self.fd.write('name：' + str(item[2]) + '\n')
                self.fd.write('comment：' + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link','price','name','comment')
            for item in self.data:
                json.dump(dict(zip(temp,item)),self.fd,ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    def close_file(self):
        self.fd.close()

    def close_browser(self):
        self.browser.quit()

    def crawl(self):
        self.open_file()
        self.open_browser()
        self.init_variable()
        print('开始爬取')
        self.browser.get('https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8')
        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        count = 0
        while not self.isLast:
            count += 1
            print('正在爬取第 ' + str(count) + ' 页......')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('结束爬取')

if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()

代码中有几个须要注意的地方，如今记录下来便于之后学习：

一、self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

在打开 csv 文件时，最好加上参数 newline='' ，不然咱们写入的文件会出现空行，不利于后续的数据处理

二、self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

在模拟浏览器向下拖动网页时，因为数据更新不及时，因此常常出现 StaleElementReferenceException 异常

通常来讲有两种处理方法：

在完成操做后使用 time.sleep() 给浏览器充足的加载时间
捕获该异常进行相应的处理

三、skus = [item.get_attribute('data-sku') for item in skus]

在 selenium 中使用 xpath 语法选取元素时，没法直接获取节点的属性值，而须要使用 get_attribute() 方法

四、无头启动浏览器能够加快爬取速度，只需在启动浏览器时设置无头参数便可

opt = webdriver.chrome.options.Options()
opt.set_headless()
browser = webdriver.Chrome(chrome_options=opt)

【爬虫系列相关文章】