Selenium 自动登陆网站、截图及 Requests 抓取登陆后的网页内容。一块儿了解下吧。javascript
Selenium 实现,至关于模拟用户手动打开浏览器、进行登陆的过程。css
相比直接 HTTP 请求登陆,有几个好处:html
避免登陆窗口的复杂状况(iframe, ajax 等),免得分析细节。java
避免模拟 Headers 、记录 Cookies 等 HTTP 完成登陆的细节。python
另外,自动登陆等过程的可视化,给外行看挺让人感受高端的。git
抓取登陆后的某些内容,而非爬取网站, Requests 够用、好用。github
基础环境: Python 3.7.4 (anaconda3-2019.10)web
pip 安装 Selenium :ajax
pip install selenium
获取 Selenium 版本信息:算法
$ python Python 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import selenium >>> print('Selenium version is {}'.format(selenium.__version__)) Selenium version is 3.141.0
下载 Google Chrome 浏览器并安装:
https://www.google.com/chrome/
下载 Chromium/Chrome WebDriver:
https://chromedriver.storage....
而后,将 WebDriver 路径加入到 PATH ,例如:
# macOS, Linux export PATH=$PATH:/opt/WebDriver/bin >> ~/.profile # Windows setx /m path "%path%;C:\WebDriver\bin\"
登陆信息是私密的,咱们从 json 配置读取:
# load config import json from types import SimpleNamespace as Namespace secret_file = 'secrets/douban.json' # { # "url": { # "login": "https://www.douban.com/", # "target": "https://www.douban.com/mine/" # }, # "account": { # "username": "username", # "password": "password" # } # } with open(secret_file, 'r', encoding='utf-8') as f: config = json.load(f, object_hook=lambda d: Namespace(**d)) login_url = config.url.login target_url = config.url.target username = config.account.username password = config.account.password
以 Chrome WebDriver 实现,登陆测试站点为「豆瓣」。
打开登陆页面,自动输入用户名、密码,进行登陆:
# automated testing from selenium import webdriver # Chrome Start opt = webdriver.ChromeOptions() driver = webdriver.Chrome(options=opt) # Chrome opens with “Data;” with selenium # https://stackoverflow.com/questions/37159684/chrome-opens-with-data-with-selenium # Chrome End # driver.implicitly_wait(5) from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC wait = WebDriverWait(driver, 5) print('open login page ...') driver.get(login_url) driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0]) driver.find_element_by_css_selector('li.account-tab-account').click() driver.find_element_by_name('username').send_keys(username) driver.find_element_by_name('password').send_keys(password) driver.find_element_by_css_selector('.account-form .btn').click() try: wait.until(EC.presence_of_element_located((By.ID, "content"))) except TimeoutException: driver.quit() sys.exit('open login page timeout')
若是用 IE 浏览器,以下:
# Ie Start # Selenium Click is not working with IE11 in Windows 10 # https://github.com/SeleniumHQ/selenium/issues/4292 opt = webdriver.IeOptions() opt.ensure_clean_session = True opt.ignore_protected_mode_settings = True opt.ignore_zoom_level = True opt.initial_browser_url = login_url opt.native_events = False opt.persistent_hover = True opt.require_window_focus = True driver = webdriver.Ie(options = opt) # Ie End
若是设定更多功能,能够:
cap = opt.to_capabilities() cap['acceptInsecureCerts'] = True cap['javascriptEnabled'] = True
print('open target page ...') driver.get(target_url) try: wait.until(EC.presence_of_element_located((By.ID, "board"))) except TimeoutException: driver.quit() sys.exit('open target page timeout') # save screenshot driver.save_screenshot('target.png') print('saved to target.png')
# save html import requests requests_session = requests.Session() selenium_user_agent = driver.execute_script("return navigator.userAgent;") requests_session.headers.update({"user-agent": selenium_user_agent}) for cookie in driver.get_cookies(): requests_session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain']) # driver.delete_all_cookies() driver.quit() resp = requests_session.get(target_url) resp.encoding = resp.apparent_encoding # resp.encoding = 'utf-8' print('status_code = {0}'.format(resp.status_code)) with open('target.html', 'w+') as fout: fout.write(resp.text) print('saved to target.html')
能够临时将 WebDriver 路径加入到 PATH :
# macOS, Linux export PATH=$(pwd)/drivers:$PATH # Windows set PATH=%cd%\drivers;%PATH%
运行 Python 脚本,输出信息以下:
$ python douban.py Selenium version is 3.141.0 -------------------------------------------------------------------------------- open login page ... open target page ... saved to target.png status_code = 200 saved to target.html
截图 target.png
, HTML 内容 target.html
,结果以下:
登陆过程若是遇到验证呢?
滑动验证,能够 Selenium 模拟
本文代码 Gist 地址:
https://gist.github.com/ikuok...
分享 Coding 中实用的小技巧、小知识!欢迎关注,共同成长!