python爬虫之初恋 selenium

时间 2019-11-16

原文原文链接

selenium 是一个web应用测试工具，可以真正的模拟人去操做浏览器。
用她来爬数据比较直观，灵活，和传统的爬虫不一样的是，
她真的是打开浏览器，输入表单，点击按钮，模拟登录，得到数据，样样行。彻底不用考虑异步请求，所见即所得。css

selenium语言方面支持java/python，浏览器方面支持各大主流浏览器谷歌，火狐，ie等。我选用的是python3.6+chrome组合html

chrome

写python爬虫程序以前，须要准备两样东西：java

1.[chrome][1]/浏览器              https://www.google.cn/chrome/
2.[chromedriver][2] /浏览器驱动   http://chromedriver.storage.googleapis.com/index.html

浏览器和浏览器驱动的搭配版本要求比较严格，不一样的浏览器版本须要不一样的驱动版本；个人版本信息：python

chrome info: chrome=66.0.3359.139
 Driver info: chromedriver=2.37.544315

其余版本对照git

chromedriver版本	Chrome版本
v2.37	v64-66
v2.36	v63-65
v2.34	v61-63

chrome浏览器
这里须要注意的是若是想更换对应的谷歌浏览器，要高版本的请务必直接升级处理，低版本的卸载时要完全！完全！完全！卸载，包括（Google升级程序，注册表，残留文件等），再安装。不然爬虫程序启动不了浏览器。github

chromedriver浏览器驱动
chromedriver 放置的位置也很重要，把chromedriver放在等会要写的.py文件旁边是最方便的方法。固然也能够不放这里，可是须要配置chromedriver的路径，我这里就不介绍这种方法了。web

火狐驱动下载地址：https://github.com/mozilla/ge...chrome

python

终于开始敲代码了segmentfault

打开网站

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://segmentfault.com/")

三行代码便可自动完成启动谷歌浏览器，输出url，回车的骚操做。
此时的窗口地址栏下方会出现【Chrome 正在受到自动测试软件的控制】字样。api

提交表单

下面咱们来尝试控制浏览器输入并搜索关键字找到咱们这篇文章；
先打开segmentfault网站，F12查看搜索框元素

<input id="searchBox" name="q" type="text" placeholder="搜索问题或关键字" class="form-control" value="">

发现是一个id为searchBox的input标签，ok

from selenium import webdriver
browser = webdriver.Chrome()   #打开浏览器
browser.get("https://segmentfault.com/")   #输入url

searchBox = browser.find_element_by_id("searchBox")  #经过id得到表单元素
searchBox.send_keys("python爬虫之初恋 selenium")   #向表单输入文字
searchBox.submit()    #提交

find_element_by_id()方法：根据id得到该元素。
一样还有其余方法好比

find_element_by_xpath()	经过路径选择元素
find_element_by_tag_name()	经过标签名得到元素
find_element_by_css_selector()	经过样式选择元素
find_element_by_class_name()	经过class得到元素
find_elements_by_class_name()	经过class得到元素们，element加s的返回的都是集合

举个栗子：

1.find_elements_by_css_selector("tr[bgcolor='#F2F2F2']>td")
  得到 style为 bgcolor='#F2F2F2' 的tr的子元素td

2.find_element_by_xpath("/html/body/div[4]/div/div/div[2]/div[3]/div[1]/div[2]/div/h4/a")
  得到此路径下的a元素。
  find_element_by_xpath方法使用谷歌浏览器F12选择元素右键copy->copyXpath急速得到准确位置，很是好用，谁用谁知道
  
3.find_element_by_xpath("..")得到上级元素
4.find_element_by_xpath("following-sibling::")获同级弟弟元素
5.find_element_by_xpath("preceding-sibling::")获同级哥哥元素

抓取数据

得到元素后.text方法便可得到该元素的内容
咱们得到文章的简介试试：

from selenium import webdriver
browser = webdriver.Chrome()   #打开浏览器

browser.get("https://segmentfault.com/")   #输入url
searchBox = browser.find_element_by_id("searchBox")  #经过id得到表单元素
searchBox.send_keys("python爬虫之初恋 selenium")   #向表单输入文字
searchBox.submit()                                #提交

text = browser.find_element_by_xpath("//*[@id='searchPage']/div[2]/div/div[1]/section/p[1]").text
print(text)

除了捕获元素还有其余的方法:

refresh()	刷新
close()	关闭当前标签页 (若是只有一个标签页就关闭浏览器)
quit()	关闭浏览器
title	得到当前页面的title
window_handles	得到全部窗口选项卡id集合
current_window_handle	得到当前窗口选项卡id
switchTo().window()	根据选项卡id切换标签页
switch_to_frame("iframeId")	切换到iframe
execute_script('window.open("www.segmentfault.com")')	执行js脚本（打开新标签）
maximize_window()	最大化
get_screenshot_as_file()	截图（图片保存路径+名称+后缀）
set_page_load_timeout(30)	设置加载时间
ActionChains(driver).move_to_element(ele).perform()	鼠标悬浮在ele元素上
value_of_css_property()	得到元素的样式（不管行内式仍是内嵌式）

启动前添加参数

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--proxy-server=http://101.236.23.202:8866")  //代理
chromeOptions.add_argument("headless")   //不启动浏览器模式
browser = webdriver.Chrome(chrome_options=chromeOptions)

不加载图片启动

def openDriver_no_img():
    options = webdriver.ChromeOptions()
    prefs = {
        'profile.default_content_setting_values': {
            'images': 2
        }
    }
    options.add_experimental_option('prefs', prefs)
    browser = webdriver.Chrome(chrome_options=options)
    return browser

反爬虫应对手段

二维码识别：https://segmentfault.com/a/11...
IP代理：https://segmentfault.com/n/13...