selenium + phantomJs

时间 2019-11-09

标签 selenium phantomjs 栏目 JavaScript 繁體版

原文原文链接

这篇文章应该写在scrapy框架以前，在此做为补充javascript

问题：如何对动态加载的页面数据进行爬取？

解决方式有两个：html

1.seleniumjava

2.phantomJsweb

selenium

简介：三方库，能够实现让浏览器完成自动化的操做。
chrome

环境搭建

安装：pip install selenium
获取浏览器驱动程序　　下载地址http://chromedriver.storage.googleapis.com/index.html

注意：浏览器版本和驱动版本的对应关系表： http://www.javashuo.com/article/p-wlhinbws-eo.htmlmacos

若是谷歌浏览器版本比较高，那驱动也选择高版本，多数状况下没有问题api

元素定位

1     #使用下面的方法，查找指定的元素进行操做便可
2     find_element_by_id            根据id找节点
3     find_elements_by_name         根据name找
4     find_elements_by_xpath        根据xpath查找
5     find_elements_by_tag_name     根据标签名找
6     find_elements_by_class_name   根据class名字查找

示例

#编码流程:
from selenium import webdriver
from time import sleep
#建立一个浏览器对象executable_path驱动的路径
bro = webdriver.Chrome(executable_path='./chromedriver')
#get方法能够指定一个url，让浏览器进行请求
bro.get('https://www.baidu.com')
sleep(1)
#让百度进行指定词条的一个搜索
text = bro.find_element_by_id('kw')#定位到了text文本框
text.send_keys('人民币') #send_keys表示向文本框中录入指定内容
sleep(1)
button = bro.find_element_by_id('su')
button.click()#click表示的是点击操做
sleep(3)
bro.quit()#关闭浏览器

执行javascript

from selenium import webdriver
 
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')

phantomJs

phantomJs是无界面浏览器，其操做流程和上述谷歌同样。浏览器

from selenium import webdriver

bro = webdriver.PhantomJS(executable_path='/Users/bobo/Desktop/路飞爬虫授课/动态数据加载爬取/phantomjs-2.1.1-macosx/bin/phantomjs')

#打开浏览器
bro.get('https://www.baidu.com')

#截屏
bro.save_screenshot('./1.png')

text = bro.find_element_by_id('kw')#定位到了text文本框
text.send_keys('人民币') #send_keys表示向文本框中录入指定内容

bro.save_screenshot('./2.png')

bro.quit()

综合案例

使用selenium+phantomJs处理页面动态加载数据的爬取

需求：获取豆瓣电影中动态加载出更多电影详情数据

from selenium import webdriver
from time import sleep
bro = webdriver.PhantomJS(executable_path='/Users/bobo/Desktop/路飞爬虫授课/动态数据加载爬取/phantomjs-2.1.1-macosx/bin/phantomjs')
url = 'https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action='
bro.get(url)
sleep(1)
#截屏
bro.save_screenshot('./1.png')
#编写js代码：让页面中的滚轮向下滑动（底部）
js = 'window.scrollTo(0,document.body.scrollHeight)'
#如何让浏览器对象执行js代码
bro.execute_script(js)
sleep(1)
#截屏
bro.save_screenshot('./2.png')


bro.execute_script(js)
bro.save_screenshot('./3.png')
#获取加载数据后的页面:page_sourse获取浏览器当前的页面数据
page_text = bro.page_source

print(page_text)