Tesseract下载地址windows
https://code.google.com/p/tesseract-ocr/electron
目前最新版本为3.02字体
windows版下载解压后,使用命令行,进入解压后目录运行this
命令格式:google
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] e...] pagesegmode values are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. -l lang and/or -psm pagesegmode must occur before anyconfigfile. Single options: -v --version: version info --list-langs: list available languages for tesseract engine
命令举例:spa
F:\Tesseract-OCR>tesseract.exe 2013-09-05_154628.jpg eng -l eng -psm 6命令行
相关命令列表:code
功能 | 命令 |
ambiguous_words.exe | |
classifier_tester.exe | |
cntraining.exe | |
整合训练文件 | combine_tessdata.exe |
dawg2wordlist.exe | |
mftraining.exe | |
shapeclustering.exe | |
识别程序 | tesseract.exe |
unicharset_extractor.exe | |
wordlist2dawg.exe |
须要的字库文件参考代码:orm
tesseract-ocr\ccutil\tessdatamanager.hblog
对字库相关的配置文件的格式要求:
ASCII or UTF-8 encoding without BOM
Unix end-of-line marker ('\n')
The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will got error message containing "last_char == '\n':Error:Assert failed..."
步骤:
1.生成训练图片
几个原则:
保证每一个字符出现的频率通常10次,经常使用字20次,不经常使用字5次;
不能把特殊字符都放在一块儿,应该用更加接近实际使用的组合;
很是重要:在字符和行之间保持必定的间隔,不然可能致使失败。(可能在3.0以后的版本修复)
训练的数据须要以font分组,相同font的文字须要放在同一个tiff文件中,(支持多页page)
除非字体过小(高度小于15px),没有必要作不一样尺寸的训练;
绝对不能够在同一个image文件中混杂多种字体
(能够参考下载页中的boxtiff文件样例)
Next print and scan (or use some electronic rendering method) to create an image of your training page. Upto 32 training files can be used (of multiple pages). It is best to create a mix of fonts and styles (but in separate files), including italic and bold.
生成tiff文件
2.制做box文件
生成box文件命令:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
例:
tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox
3.获得一个新的字符集
参考文档:
解压后doc目录中有API说明
--end--