OCR 技术:git
(1) 在爬虫过程当中,不免会遇到各类各样的验证码,而大多数验证码仍是罔形验证码,这时候咱们能够直接用 OCR 来识别
(2) OCR ,即 Optical Character Recognition ,光学字符识别, 是指经过扫描字符,而后经过其形状将其翻译成电子文本的过程
(3) tesserocr 是 Python 的一个OCR 识别库,但实际上是对 tesseract 作的一层 Python API 封装,因此它的核心是 tesseract。所以,在安装 tesserocr 以前,咱们须要先安装 tesseractgithub
Windows 下安装 tessorocr:app
1. 先安装 tessoract,下载地址:https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe
2. 再安装 tessorocr,使用 pip3 安装便可:pip3 install tesserocr pillowide
Linux 下安装 tessorocr:spa
yum install -y tesseract
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract/tessdata
pip3 install tesserocr pillow
Python 识别图片验证码:翻译
import tesserocr from PIL import Image image = Image.open('1.png') # Opens and identifies the given image file result = tesserocr.image_to_text(image) # Recognize OCR text from an image object print(result)
Python 识别有干扰的图片验证码:code
import tesserocr from PIL import Image image = Image.open('2.png') image = image.convert('L') threshold = 127 table = [] for i in range(256): if i < threshold: table.append(0) else: table.append(1) image = image.point(table, '1') result = tesserocr.image_to_text(image) print(result)