tesseract-ocr训练方法

时间 2019-11-21

标签 tesseract ocr 训练方法繁體版

原文原文链接

tesseract-ocr有2和3两个版本，不一样版本训练方法稍有不一样。html

第3版本的训练方法官版教程在这里：TrainingTesseract3java

第2版的训练方法官版教程在这里：TrainingTesseractweb

我使用的是最新的3.01版本的。训练所需准备：windows

1.下载并安装3.01版本的tesseract。事实上并不须要安装这步骤，我下载的是压缩包版，解压便可，这里我解压到E:\Tesseract-ocr目录。工具

2.下载并安装jTessBoxEditor 工具，这是一个Box file editors，用来编辑训练文件的，直接下载地址在这里。这个软件是用java写的，运行须要安装jre，好在这个东西比.net好装多了，怎么运行能够见它的readme文件。学习

3.一张用来训练的tiff格式图片。字体

在不经过训练的前提下，使用tesseract来识别一个订单号的内容，如图发现错误率很高，但愿经过训练来提升准确率。ui

训练过程：google

1.经过合并10张如上图的图片合并为一张tiff格式的图片，如何合并呢？经过jTessBoxEditor的Merge Tiff 来完成，不过他的小缺点就是只能合并多张tiff格式的，若是你的图片是jpg的，须要先转换。生成后的tiff图片叫作orderNo.tiflua

2.Make Box Files。在orderNo.tif所在的目录下打开一个命令行，输入：

E:\Tesseract-ocr\tesseract.exe orderNo.tif orderNo batch.nochop makebox

来生成一个box文件，该文件记录了tesseract识别出来的每个字和其位置坐标。

3.使用jTessBoxEditor打开orderNo.tif文件，须要记住的是第2步生成的orderNo.box要和这个orderNo.tif文件同在一个目录下。逐个校订文字，后保存。

4.Run Tesseract for Training。输入命令：

E:\Tesseract-ocr\tesseract.exe orderNo.tif orderNo nobatch box.train

5.Compute the Character Set。输入命令：

E:\Tesseract-ocr\unicharset_extractor.exe orderNo.box

6.新建文件“font_properties”。若是是3.01版本，那么须要在目录下新建一个名字为“font_properties”的文件，而且输入文本 :

orderNo 0 0 0 0 0

大体意思就是说orderNo这个语言的字体为普通字体。

并执行命令：

E:\Tesseract-ocr\mftraining.exe -F font_properties -U unicharset orderNo.tr

7.Clustering。输入命令：

E:\Tesseract-ocr\cntraining.exe orderNo.tr

8.此时，在目录下应该生成若干个文件了，把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“orderNo.”。而后输入命令：

E:\Tesseract-ocr\combine_tessdata.exe orderNo.

会显示一个结果如：

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 1660
Offset for type 4 is 327545
Offset for type 5 is 327781
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is –1

必须肯定的是第二、四、五、6行的数据不是-1，那么一个新的字典就算生成了。

此时目录下“orderNo.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录。

之后就可使用该该字典来识别了，例如：

tesseract.exe test.jpg result –l orderNo

经过训练出来的新语言，识别率提升了很多。

Posted by lixin at 下午 6:46 Tagged with: ocr

28 Responses to “tesseract-ocr训练方法”

luacloud says:

2012 年 5 月 28 日 at 下午 2:19

有学习能力的？

回复
by says:

2012 年 8 月 16 日 at 下午 9:32

您好，个人步骤跟你的同样，可到mftraining这一步怎么都过不去，总是windows弹出提示，mftraining.exe已中止工做。该怎么解决呢。

E:\Tesseract-ocr3.01\build>..\mftraining -F font_properties -U unicharset cnlp.lpft.exp0.tr