我的公众号:螺旋编程极客 >>期待您的关注css
最近公司有个新需求,大致流程是这样的,进入天津市市场主体信用信息公示系统,根据excel中表格的企业名称或税号查询企业的股东信息,查到以后获取股东信息的税号,而后再分别查询股东的股东,最后把查询结果录入excel。 读excel——>查询企业股东——》获取股东税号——》输入股东税号查询其股东——》查询结果录入excel,是否是让人以为十分无语,简单一句话,查询股东的股东的相关信息录入excel,当时听到这个需求感受理论上是能够实现的,惟一的难点就在于滑块验证码,破解了它以后后面的就是一些网页数据提取的工做了。html
话很少说,上爬虫呗,由于有滑块验证码这个东西的存在,因此只能选择浏览器爬虫了,虽然效率慢点,可是万物皆可爬,由于抓包分析那些请求数据实在是让人恶心的想吐。在这里我使用 “艺赛旗RPA设计器” 来辅助完成工做,不得不说,这个东西真的好用,并且它的python库十分强大,设计完流程能够自动生成python代码,本身只须要关心一些核心的算法和业务逻辑就能够了,事半功倍。 首先看一下验证码的图片:python
是比较常见的 “极验” 验证码,不少网站都在使用这个东西,可是政府的网站明显落后了一点,如今 "极验3.0" 已经更新了,这个还停留在2.0。区别就是2.0一开始显示的是完整的图片,点击滑动按钮会出现有缺口的图片,而3.0一开始显示的就是带缺口的图片,不过也是能够破解的。web
在这里咱们以2.0为例,3.0的核心代码我也会贴上,先看一下2.0破解的步骤:算法
点击搜索-->截取验证码图片-->点击滑动按钮-->截取带缺口图片-->比较像素计算偏移量-->移动
复制代码
由于咱们使用了RPA设计器,因此像点击鼠标,截图之类的代码都不须要本身去写,选择相应的元素,点击对应页面的元素,他就能够自动为咱们生成python代码,固然是高度封装的,源码是能够随时看的,底层其实仍是那一套。惟一须要咱们动手写的是计算偏移量以及鼠标移动,虽然他自己有鼠标拖动的组件,可是拖动的时候过于直来直去,会被检测到,提示 “被怪物吃掉” 因此我稍微修改了一下他的源码,封装了一个本身的方法,先看一下验证码识别的流程图: 编程
# coding=utf-8
# 编译日期:2019-08-14 10:09:34
import time
import pdb
from ubpa.ilog import ILog
from ubpa.base_img import *
import getopt
from sys import argv
import sys
from ubpa.itools import rpa_import
GlobalFun = rpa_import.import_global_fun(__file__)
import ubpa.ibox as ibox
import ubpa.iexcel as iexcel
import ubpa.ifile as ifile
import ubpa.iie as iie
import ubpa.iimg as iimg
import ubpa.ikeyboard as ikeyboard
class getTjInfo:
def __init__(self,**kwargs):
self.__logger = ILog(__file__)
self.path = set_img_res_path(__file__)
self.robot_no = ''
self.proc_no = ''
self.job_no = ''
if('robot_no' in kwargs.keys()):
self.robot_no = kwargs['robot_no']
if('proc_no' in kwargs.keys()):
self.proc_no = kwargs['proc_no']
if('job_no' in kwargs.keys()):
self.job_no = kwargs['job_no']
#验证码识别
def checkCode(self):
existFlg=None
distance=None
xy=None
imageTwo=None
imageOne=None
# 截图
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530184,Note:')
imageOne = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 鼠标点击
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530183,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(4)
# 截图
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530186,Note:')
imageTwo = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 自定义函数
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530182,Note:')
distance = GlobalFun.get_distance(imageOne,imageTwo)
# 获取元素位置
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530181,Note:')
xy = iie.get_element_rect(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',curson=r'center',waitfor=10)
# 代码块
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530180,Note:')
print(xy)
lastxy=(xy[0]+distance,xy[1],xy[2],xy[3])
print(lastxy)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(900.0, 682.0, 44, 44)):
print('修正')
lastxy=(895.0, 682.0, 44, 44)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(976.0, 682.0, 44, 44)):
print('修正')
lastxy=(868.0, 682.0, 44, 44)
# 自定义函数
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530185,Note:')
GlobalFun.myDo_drag_to(win_title=r'天津市市场主体', srcpos=xy,distpos=lastxy)
#删除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530187,Note:')
ifile.del_file(file=imageOne)
#删除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530188,Note:')
ifile.del_file(file=imageTwo)
# 图像检测
self.__logger.debug('Flow:checkCode,StepNodeTag:13140204311151,Note:')
time.sleep(3.5)
existFlg = iimg.img_exists(win_title=r'天津市市场主体',img_res_path=self.path,image=r'snapshot_20190813135330024.png',fuzzy=True,confidence=0.85,waitfor=3)
# IF分支
self.__logger.debug('Flow:checkCode,StepNodeTag:13140549531176,Note:')
if existFlg:
#消息框
self.__logger.debug('Flow:checkCode,StepNodeTag:13143951406201,Note:')
ibox.msg_box(msg='验证失败,重试!',timeout=1.5)
time.sleep(1)
# 鼠标点击
self.__logger.debug('Flow:checkCode,StepNodeTag:13140738964184,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(3) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=100,scroll_view='no')
time.sleep(1.5)
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140556594179,Note:')
return True
else:
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140620186183,Note:')
return False
# 代码块
self.__logger.debug('Flow:checkCode,StepNodeTag:13141700326199,Note:')
print(existFlg)
#处理表格数据
def dealTableData(self,tableData=None):
currentCom=None
currentTableData=None
currentComName=None
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161341316275,Note:')
columns=tableData.columns
realDataList=tableData.values.tolist()
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161638010281,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13164935964357,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#消息框
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161731539284,Note:')
ibox.msg_box(msg='开始处理二级公司数据',timeout=2)
time.sleep(0.002)
# For循环
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161910442289,Note:')
for i in range(len(realDataList)):
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13162051185291,Note:')
currentList=realDataList[i]
if(columns[0]=='有限责任公司本年度是否有股权转让 '):
currentCom=currentList[0]
currentComName=currentList[0]
if(columns[0]=='企业是否有股权信息或购买其它公司股权'):
currentCom=currentList[0]
currentComName=currentList[1]
if("天津" not in currentCom):
continue
time.sleep(1)
# 子流程:finishCheckCode
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161503452279,Note:')
self.finishCheckCode(comName=currentComName)
# 子流程:goToDetail
self.__logger.debug('Flow:dealTableData,StepNodeTag:13163507635351,Note:')
currentTableData=self.goToDetail()
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13170553111368,Note:')
currentTableData[0].drop(['变动后股权比例','股权变动日期'], axis=1)
currentTableData[1].drop(['投资设立企业后购买股权企业名称',r'统一社会信用代码/注册号'], axis=1)
lastTableData0=currentTableData[0].values.tolist()
lastTableData1=currentTableData[1].values.tolist()
#插入行
self.__logger.debug('Flow:dealTableData,StepNodeTag:13171433395371,Note:')
iexcel.ins_row(path='C:/Users/Administrator/Desktop/testData.xlsx',data=lastTableData1)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165826952364,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165850702366,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#完成验证
def finishCheckCode(self,comName='911200006630613577'):
#网站
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082917,Note:')
iie.open_url(url=r'http://credit.scjg.tj.gov.cn/gsxt/')
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082912,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=300,scroll_view='no')
time.sleep(0.5)
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082913,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',offsetY=45,times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2)
# 设置文本
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108293,Note:')
iie.set_text(url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#searchName',text=comName,waitfor=10)
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108292,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#entSearchLink',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2.5)
# While循环
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135310584131,Note:')
while True:
# 子流程:checkCode
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13140310308161,Note:')
tvar13140310308161=self.checkCode()
# IF分支
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135336032135,Note:')
if tvar13140310308161:
pass
else:
# Break中断
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135345176138,Note:')
break
#获取股东公司信息
def getChildCom(self):
tableData2=None
tableData1=None
table2Columns=None
table1Columns=None
tableDatas=None
# 子流程:goToDetail
self.__logger.debug('Flow:getChildCom,StepNodeTag:13162921190325,Note:')
tableDatas=self.goToDetail()
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145633622210,Note:')
if tableDatas[0].columns[1]=='否':
pass
else:
# 代码块
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163242189338,Note:')
tableData1=tableDatas[0]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163127828333,Note:')
self.dealTableData(tableData=tableData1)
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145822725214,Note:')
if tableDatas[1].columns[1]=='否':
pass
else:
# 代码块
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163302199339,Note:')
tableData2=tableDatas[1]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163131389335,Note:')
self.dealTableData(tableData=tableData2)
#消息框
self.__logger.debug('Flow:getChildCom,StepNodeTag:13151905124229,Note:')
ibox.msg_box(msg='当前企业数据处理完毕,下一个。。',timeout=1.5)
time.sleep(1.5)
#去往详情页
def goToDetail(self):
table2Data=None
table1Data=None
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929301,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#center_content > DIV:nth-of-type(1) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(1) > H1:nth-of-type(1) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=20,scroll_view='no')
time.sleep(1.5)
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929300,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tabs > DIV:nth-of-type(1) > DIV:nth-of-type(3) > SPAN:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929299,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tableInfoDiv > DIV:nth-of-type(2) > TABLE:nth-of-type(1) > TBODY:nth-of-type(1) > TR:nth-of-type(3) > TD:nth-of-type(4) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 自定义函数
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929298,Note:股权转让')
table1Data = GlobalFun.getTableData('年报详情','#show_alter')
# 自定义函数
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929297,Note:是否有狗买')
table2Data = GlobalFun.getTableData('年报详情','#show_invest')
# Return返回
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162803390322,Note:')
return table1Data,table2Data
def Main(self):
# 子流程:finishCheckCode
self.__logger.debug('Flow:Main,StepNodeTag:13165330947360,Note:')
self.finishCheckCode(comName='911200006630613577')
# 子流程:getChildCom
self.__logger.debug('Flow:Main,StepNodeTag:13151828292226,Note:')
self.getChildCom()
#消息框
self.__logger.debug('Flow:Main,StepNodeTag:13152147770235,Note:')
ibox.msg_box(msg='所有数据处理完毕!')
if __name__ == '__main__':
robot_no = ''
proc_no = ''
job_no = ''
try:
argv = sys.argv[1:]
opts, args = getopt.getopt(argv,"hr:p:j:",["robot = ","proc = ","job = "])
except getopt.GetoptError:
print ('robot.py -r <robot> -p <proc> -j <job>')
for opt, arg in opts:
if opt == '-h':
print ('robot.py -r <robot> -p <proc> -j <job>')
elif opt in ("-r", "--robot"):
robot_no = arg
elif opt in ("-p", "--proc"):
proc_no = arg
elif opt in ("-j", "--job"):
job_no = arg
pro = getTjInfo(robot_no=robot_no,proc_no=proc_no,job_no=job_no)
pro.Main()
复制代码
使用的全局函数的代码,在这里咱们须要引入PIL库来进行图片的读取以及像素的处理,具体方法见 get_distance ,引入pyautogui库来对浏览器页面进行操做,在这里主要用它控制鼠标滑动,具体方法见 myDo_drag_to 引入pandas库来进行页面表格的数据获取,具体方法见 getTableData ,以下:浏览器
# 编译日期:2019-08-12 10:47:48
# coding=utf-8
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from PIL import Image
import ubpa.ics as ics
import pyautogui
from ubpa import iwin
import math
import ubpa.iie as iie
import re
import pandas as pd
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#调用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
def get_point_axis(axis_list,distpos,point):
pos_val_list = []
for i in range(1, 10000):
if i >= point:
break
n = len(axis_list) * (i / (point + 1))
pos_val = axis_list[int(n)]
pos_val_list.append(pos_val)
pos_val_list.append(distpos)
return pos_val_list
def get_axis_list(srcpos=(0, 0), distpos=(0, 0)):
pos_list = []
x1 = srcpos[0]
y1 = srcpos[1]
x2 = distpos[0]
y2 = distpos[1]
if x1 == x2:
if y1 > y2:
for i in range(math.ceil(y2), int(y1) + 1):
pos_list.append((x1, i))
pos_list.reverse()
elif y1 < y2:
for i in range(math.ceil(y1), int(y2) + 1):
pos_list.append((x1, i))
else:
pos_list = []
else:
if y1 == y2:
if x1 < x2:
x1 = math.ceil(x1)
x2 = int(x2)
length = x2 - x1
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
if x1 > x2:
x1 = int(x1)
x2 = math.ceil(x2)
length = x1 - x2
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
else:
if x1 < x2:
for i in range(math.ceil(x1), int(x2) + 1):
if y1 < y2:
h = (i - x1) * (y2 - y1) / (x2 - x1)
pos_list.append((i, y1 + h))
else:
h = (i - x1) * (y1 - y2) / (x2 - x1)
pos_list.append((i, y1 - h))
else:
for i in range(math.ceil(x2), int(x1) + 1):
if y1 < y2:
h = (i - x2) * (y2 - y1) / (x1 - x2)
pos_list.append((i, y2 - h))
else:
h = (i - x2) * (y1 - y2) / (x1 - x2)
pos_list.append((i, y2 + h))
pos_list.reverse()
return pos_list
def myDo_drag_to(win_title=None, srcpos=(0,0), distpos=(0,0), point=0, stimes=1, model=pyautogui.easeInOutQuad, waitfor=10):
''' 验证拖拽 x1:起点位置x坐标 y1:起点位置y坐标 x2:终点位置x坐标 y2:终点位置y坐标 point:停顿次数,默认是0 stimes:移动快慢,默认是1 model:移动方式,easeInQuad先慢后快,easeOutQuad先快后慢,easeInOutQuad开始和结束快 中间慢,easeInBounce结束反弹,easeInElastic持续反弹 '''
try:
if win_title != None and win_title.strip() != '':
''''若是窗口不活跃状态'''
if not iwin.do_win_is_active(win_title):
iwin.do_win_activate(win_title=win_title, waitfor=2)
pyautogui.moveTo(srcpos[0], srcpos[1], 0.5)
pyautogui.mouseDown(button='left', _pause=True)
axis_list = get_axis_list(srcpos, distpos)
if len(axis_list) > 0:
pos_val_list = get_point_axis(axis_list, distpos, point)
# print(pos_val_list)
for index in pos_val_list:
pyautogui.dragTo(float(index[0]+20), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]-5), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.mouseUp(button='left', _pause=True)
except Exception as e:
raise e
# 2.0获取偏移量
def get_distance(imageOne,imageTwo):
''' 拿到滑动验证码须要移动的距离 :param image1:没有缺口的图片对象 :param image2:带缺口的图片对象 :return:须要移动的距离 '''
threshold=150
left=60
image1 = Image.open(imageOne)
image2 = Image.open(imageTwo)
for i in range(left,image1.size[0]):
for j in range(image1.size[1]):
rgb1=image1.load()[i,j]
rgb2=image2.load()[i,j]
res1=abs(rgb1[0]-rgb2[0])
res2=abs(rgb1[1]-rgb2[1])
res3=abs(rgb1[2]-rgb2[2])
if not (res1 < threshold and res2 < threshold and res3 < threshold):
print(i-7)
return i-7 #通过测试,偏差为大概为7
print(i-7)
return i-7#通过测试,偏差为大概为7
复制代码
以上代码为整个流程的代码,我在这里全贴出来了,3.0验证码破解的获取偏移量方法以下:bash
#极验3.0破解方法
def get_gap(image):
""" 获取缺口偏移量 :param image: 带缺口图片 :return: """
# left_list保存全部符合条件的x轴坐标
left_list = []
# 须要获取的是凹槽的x轴坐标,就不须要遍历全部y轴,遍历几个等分点就行
for i in [10 * i for i in range(1,image.size[1]/11)]:
# x轴从x为image.size[0]/5.16的像素点开始遍历,由于凹槽不会在x轴为50之内
for j in range(image.size[0]/5.16, image.size[0] - int(image.size[0]/8.6)):
if is_pixel_equal(image, j, i, left_list):
break
#其中(x, z)中的x为凹槽左侧的位置,z是count,就是从该x点坐标起有多少连续像素点的R、G、B都是小于150的,由于咱们遍历y轴,全部咱们的获得几个值,其中,z值最接近40的,结果最符合
left_list = sorted(left_list, key=lambda x: abs(x[1]-40))
#取第一个元素的x下标 最后结果 -7 或者 -14 通常 -7就能够
return left_list[0][0] - 7
def is_pixel_equal(image, x, y, left_list):
""" 判断两个像素是否相同 :param image: 图片 :param x: 位置x :param y: 位置y :return: 像素是否相同 """
# 取图片的像素点
pixel1 = image.load()[x, y]
threshold = 150
# count记录一次向右有多少个像素点R、G、B都是小于150的
count = 0
# 若是该点的R、G、B都小于150,就开始向右遍历,记录向右有多少个像素点R、G、B都是小于150的
if pixel1[0] < threshold and pixel1[1] < threshold and pixel1[2] < threshold:
for i in range(x + 1, image.size[0]):
piexl = image.load()[i, y]
if piexl[0] < threshold and piexl[1] < threshold and piexl[2] < threshold:
count += 1
else:
break
if int(image.size[0]/8.6) < count < int(image.size[0]/4.3):
left_list.append((x, count))
return True
else:
return False
复制代码
代码都有明确注释,静下心来看的话很容易就能够明白。 还有一个不错的处理页面表格的方法,上面的代码里已经有了,代码以下:app
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#调用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
复制代码
titleStr为浏览器标题,只要标题包含传入的参数就能够识别,selectorStr是css选择器的选择字符串,css选择器是设计器原生支持的,自己这个东西在爬虫方面也很重要,不懂的能够自行百度,iie是他们本身的python库里的组件,能够直接读取已经打开的页面的信息,使用这个方法传入页面table的位置,就能够把表格转化为dataframe类型,不得不说,pandas仍是好用!ide
验证码运行效果,失败了会本身重试,以下: