【爬虫】python+selenium+tesseract

简介

最近工做中的爬虫小知识，主要是python+selenium自动化截图以及tesseract的验证码自动校验（其实tesseract的正确率不好）。html

前期准备

1.安装python环境，本身百度吧。python

2.安装selenium，可用命令安装：pip install seleniumweb

3.安装pytesseract，一样：pip install pytesseractchrome

4.安装chromedriver.exe, 安装教程：https://blog.csdn.net/wwwq2386466490/article/details/81513888浏览器

5.安装tesseract.exe 教程：https://www.cnblogs.com/VseYoung/p/code.html 配置pytesseract：https://blog.csdn.net/u010134642/article/details/78747630微信

好多。。。接下来就是操做了。app

python+selenium 基本操做

下面的代码步骤函数

python+selenium 启动浏览器，而后输入网址百度地图的https://map.baidu.com/ ，并将浏览器最大化接着就是在搜索框中输入关键词”广州塔”，点击搜索按钮，最后截图保存到相应路径。（这时候，我想起了“贪玩蓝月”。。。）测试

  
    
  
  
  
   
   
            
   
   
   
     
   
   
   
    
    
             
    
    大数据
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
  
    
  
  
  
   
   
            
   
   
# -*- coding:utf-8 -*-from selenium import webdriverfrom time import sleepimport time ### 这是你上一步的chromedriver.exe的地址chrome_driver = 'C:/Users/zero/AppData/Local/Google/Chrome/Application/chromedriver.exe'# 时间格式进行格式化def time_format(): current_time = time.strftime('%Y%m%d%H%M%S', time.localtime(time.time())) return current_timedriver = webdriver.Chrome(executable_path=chrome_driver)driver.get('https://map.baidu.com/')driver.maximize_window()elem = driver.find_element_by_id("sole-input") ### 找到相应输入框的idelem.send_keys("广州塔")elem = driver.find_element_by_id("search-button") ### 找到相应按钮的idelem.click()sleep(3)### 截全屏driver.get_screenshot_as_file("E:/crawl/"+time_format()+".png")sleep(2)driver.quit()

python+tesseract 操做

这个tesseract 验证码识别比较不许，不过既然用过了，那就介绍一下呗。

总体流程：

1.请求百度的找回密码接口页面 2.找到验证码对应的img节点，并截图验证码 3.tesseract 进行灰度二值化等一系列图片处理，返回识别出来的验证码 4.webdriver找到相应的页面元素，输入框填写相应信息，而后点击按钮。

  
    
  
  
  
   
   
            
   
   
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
  
    
  
  
  
   
   
            
   
   
# coding:utf-8from selenium import webdriverfrom time import sleepimport unittestfrom PIL import Imagefrom PIL import ImageEnhanceimport pytesseractchrome_driver = 'C:/Users/zero/AppData/Local/Google/Chrome/Application/chromedriver.exe'driver = webdriver.Chrome(executable_path=chrome_driver)url="https://passport.baidu.com/?getpassindex"driver.get(url)driver.maximize_window()driver.save_screenshot(r"E:\crawl\aa.png") #截取当前网页，该网页有咱们须要的验证码imgelement = driver.find_element_by_xpath(".//*[@id='forgotsel']/div/div[3]/img")#imgelement = driver.find_element_by_id("code") #定位验证码location = imgelement.location #获取验证码x,y轴坐标size=imgelement.size #获取验证码的长宽coderange=(int(location['x']),int(location['y']),int(location['x']+size['width']), int(location['y']+size['height'])) #写成咱们须要截取的位置坐标i=Image.open(r"E:\crawl\aa.png") #打开截图frame4=i.crop(coderange) #使用Image的crop函数，从截图中再次截取咱们须要的区域frame4.save(r"E:\crawl\frame4.png")i2=Image.open(r"E:\crawl\frame4.png")imgry = i2.convert('L') #图像增强，二值化，PIL中有九种不一样模式。分别为1，L，P，RGB，RGBA，CMYK，YCbCr，I，F。L为灰度图像sharpness =ImageEnhance.Contrast(imgry)#对比度加强i3 = sharpness.enhance(3.0) #3.0为图像的饱和度i3.save("E:\crawl\image_code.png")i4=Image.open("E:\crawl\image_code.png")text=pytesseract.image_to_string(i2).strip() #使用image_to_string识别验证码print(text)elem = driver.find_element_by_id("account")elem.send_keys(13652878889)elem = driver.find_element_by_id("veritycode")elem.send_keys(text)sleep(2)elem = driver.find_element_by_id("submit")elem.click()

总结

1.人生苦短，我用python。

2.其实python+chrome的手机端一样能够解放双手。

3.平时页面代码写完有不少输入框的那种，你能够实现填完一次，之后就不用再填了，或许这就是自动化测试。。。

4.喜欢打游戏的，刷怪什么，能够了解一下哦。

最后

若是对 Java、大数据感兴趣请长按二维码关注一波，我会努力带给大家价值。以为对你哪怕有一丁点帮助的请帮忙点个赞或者转发哦。关注公众号【爱编码】，小编会一直更新文章的哦。

本文分享自微信公众号 - 爱编码（ilovecode）。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。