最近在作一个关于 selenium 相关的项目,在选择浏览器方面,通常有3种方案:linux
网上有不少教程是关于PhantomJS的,但是,在2018.3.4日,git开源项目上,ariya宣布暂停更新,具体时间另行通知,截止到2019.3.8日,还没消息。。。
chrome浏览器的教程也是不少的,可是,通过这几天的使用,体验并非很好,对selenium超时的支持不够好,坑了我好久!
在这里隆重推荐firefox浏览器
git
利用selenium的强大之处在于,能够像人打开浏览器同样,这样能够避免js的各类加密,动态加载之类的,可见便可爬。可是,selenium控制的chrome会暴露出许多参数,是能够经过这些参数来识别selenium的,如今针对selenium的反爬网站其实不少了。据说可使用pyppeteer(puppeteer的py版)
,之后要学。web
今天发现一个好用的方法driver.set_page_load_timeout(self.timeout)
, 让我不得不仔细学习几种超时的用法了chrome
官方注释:express
Sets a sticky timeout to implicitly wait for an element to be found,
or a command to complete. This method only needs to be called one
time per session. To set the timeout for calls to
execute_async_script, see set_script_timeout.浏览器
当你在调用find_element_by...
方法时,会用到此方法,而且它是driver全局的,只要设置1次,因此当你想查找某元素,找不到立刻放弃时,要设置得比较小才行session
官方注释:less
Set the amount of time to wait for a page load to complete
before throwing an error.异步
当你在调用driver.get()
方法打开某网站时,其实某网站已经差很少加载完成了,但好比某图片,某异步请求没完成,一直转圈圈,get
方法是不会结束的,set_page_load_timeout
就是来设置这个的,示例:
driver = webdriver.Chrome() driver.set_page_load_timeout(15) # 设定页面加载限制时间 try: driver.get('https://www.toutiao.com') except TimeoutException: driver.execute_script('window.stop()') # 中止加载 print(driver.page_resource) driver.quit()
可是!!!chrome浏览器不支持这个很是重要的配置,一旦TimeoutException
,driver的全部操做都会报TimeoutException
异常,不能进行下去了。因此我推荐firefox浏览器
官方注释:
Set the amount of time that the script should wait during an
execute_async_script call before throwing an error.
这是控制异步脚本执行时间,超时抛TimeoutException
异常
示例:
from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC wait = WebDriverWait(driver, 15, 0.5) wait.until(EC.presence_of_element_located((By.XPATH, 'express')))
我喜欢使用的找元素的等待类,15秒超时,每0.5秒根据要求找元素,找到了就结束until,15后没找到会抛TimeoutException
异常
from selenium.common.exceptions import TimeoutException t0 = time.time() print("cur time is: %s" % (time.time() - t0)) driver = webdriver.Chrome() driver.set_page_load_timeout(5) # 设定页面加载限制时间 driver.maximize_window() try: print("cur time is: %s" % (time.time() - t0)) driver.get('http://www.autohome.com.cn/') except TimeoutException: print("cur time is: %s" % (time.time() - t0)) try: driver.execute_script('window.stop()') # 当页面加载时间超过设定时间,经过执行Javascript来stop加载,便可执行后续动做 except: pass print("cur time is: %s" % (time.time() - t0))
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('disable-infobars') # 去掉提示 # 必定要注意,=两边不能有空格,不能是这样"--proxy-server = http://202.20.16.82:10152" # chrome_options.add_argument("--proxy-server=http://192.168.60.15:808") # 设置代理 # chrome_options.add_argument('start-fullscreen') # 启动就全屏 F11那种 # chrome_options.add_argument('-lang=zh-CN') # 中文,貌似没用 # 语言,设为中文 # prefs = {'intl.accept_languages': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'} # chrome_options.add_experimental_option('prefs', prefs) # chrome_options.add_argument('blink-settings=imagesEnabled=false') # 不加载图片, 提高速度 # chrome_options.add_argument('--headless') # 浏览器不提供可视化页面. linux下若是系统不支持可视化不加这条会启动失败 # chrome_options.add_argument('window-size=1920x3000') # 指定浏览器分辨率 # chrome_options.add_argument('--disable-gpu') # 谷歌文档提到须要加上这个属性来规避bug # chrome_options.add_argument('--hide-scrollbars') # 隐藏滚动条, 应对一些特殊页面 # chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使用的浏览器位置 TIMEOUT = 15 class Display(object): def __init__(self): self.driver = webdriver.Chrome(options=chrome_options) # 配置好了环境变量能够不用写executable_path # self.driver.set_page_load_timeout(TIMEOUT) self.wait = WebDriverWait(self.driver, TIMEOUT, 0.5) def __del__(self): if self.driver: self.driver.close() def fetch(self, url): self.driver.maximize_window() # 放大 self.driver.get(url) # 发请求 # self.driver.execute_script('window.location.reload();') # 刷新 self.wait.until(EC.presence_of_element_located((By.ID, 'kw'))) self.driver.find_element_by_id('kw').send_keys('selenium') self.driver.find_element_by_id('su').click() self.wait.until(EC.presence_of_element_located((By.ID, '1'))) return self.driver.page_source d = Display() print(d.fetch('https://www.baidu.com'))
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC TIMEOUT = 15 class Display(object): def __init__(self): # 无界模式 options = webdriver.FirefoxOptions() options.headless = True profile = webdriver.FirefoxProfile() # 禁用图片 profile.set_preference('permissions.default.image', 2) self.driver = webdriver.Firefox(desired_capabilities=DESIRED_CAP, profile=profile, options=options) self.driver.set_page_load_timeout(TIMEOUT) self.wait = WebDriverWait(self.driver, TIMEOUT, 0.5) def __del__(self): if self.driver: self.driver.close() def fetch(self, url): self.driver.maximize_window() # 放大 self.driver.get(url) # 发请求 # self.driver.execute_script('window.location.reload();') # 刷新 self.wait.until(EC.presence_of_element_located((By.ID, 'kw'))) self.driver.find_element_by_id('kw').send_keys('selenium') self.driver.find_element_by_id('su').click() self.wait.until(EC.presence_of_element_located((By.ID, '1'))) return self.driver.page_source