上一次咱们讨论了Tesseract OCR引擎的用法,做为一款老牌的OCR引擎,目前已经开源,最新版本3.0中更是加入了中文OCR功能,再加上Google的更新、维护,能够说是潜力很大,值得期待。由上一次的测试结果也能够看出,Tesseract的OCR结果还不是很理想,尤为是中英文混合的文字,其识别率有限。本次咱们来关注下Office 2010中的Onenote,调用其API来测试OCR功能。html
PS:在公司经理一直推荐使用MyBase来记录工做中遇到的问题、工做日志等,可是我一直坚持使用Onenote :)git
测试代码下载编程
在Visual Studio 2010 Ultimate + Onenote 2010 x64中测试经过小程序
转载请注明出处:http://www.cnblogs.com/brooks-dotnet/archive/2010/10/07/1845313.html网络
一、Onenote 2010 新特性:架构
New features in 2010:app
Gather, organize, and searchdom |
Sharing and universal accesside |
|
-
Access from anywhere:
- Share on the Web
- View and edit in a browser
- Sync notes to OneNote Mobile
-
Share notes:
- Unread changes are highlighted
- See author initials
- Version history
- Find recent edits
- Find edits by author
- Faster sync with SharePoint
|
Examples:工具
 |
Organize topics using subpages Drag tabs to indent and organize pages within a section. |
 |
Keep notes visible during other tasks OneNote will link notes to documents and Web pages you view.
View > |
 |
|
 |
What's new in a shared notebook? Unread changes are shown automatically. |
 |
What notes are teammates working on?
Share > |
 |
 |
|
 |
Select location when sending to OneNote When sending from Outlook or Internet Explorer |
 |
Link to information for yourself and others
Insert > |
 |
or type |
[[page name]] |
|
More Resources Online
Videos, templates, training, help, and discussion groups.
Microsoft® OneNote® 2010 Guide Notebook
Copyright © 2009 Microsoft Corporation. All rights reserved.
The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious. No association with any real company, organization, product, domain name, email address, logo, person, places, or events is intended or should be inferred.
二、Onenote中的OCR功能
园子里的朋友斯克迪亚很早就写过一片博文,详细介绍了GUI下使用OCR功能的方法,当时我也看了,想用编程来实现Onenote的OCR功能。后来忙其余事就没有细究,国庆假期我正好在查找一些OCR工具,又想起了这回事,因而多方查找资料、测试,今天终因而略有小成,特与你们分享。程序还存在不少问题,欢迎拍砖。
2.一、关于Onenote OCR功能的一处细节要先提一下,那就是若是从网络上复制一幅图片到Onenote中,是没法进行OCR的;可是若是从本地插入一幅图片到Onenote中,则能够进行OCR:
左边是本地图片,右边是网络图片,能够看出,网络图片右键菜单中的【Copy Text from Picture】是灰掉的,没法点击。


2.二、我用WPF写的小程序界面以下,和上一次TesseractGUI一模一样,不过是换药未换汤。
左边选择图片,支持本地图片、网络图片预览、缩放、移动,右边选择输出目录、显示OCR结果:

2.三、MSDN杂志上有一篇文章给了我很大的启发,介绍了Onenote 2010中的对象模型,感兴趣的朋友能够深刻阅读。CodePlex上还有一个托管的Onenote对象模型项目ONOM,对Onenote PIA提供了更好的封装。创建WPF项目并添加引用时要注意一下:
首先,因为与 Visual Studio 2010 随附的 OneNote 互操做程序集不匹配,您不该该在"添加引用"对话框的".NET"选项卡上直接引用 Microsoft.Office.Interop.OneNote 组件,而是应该在"COM"选项卡上引用 Microsoft OneNote 14.0 类型库组件。这样作仍会向项目的引用中加入 OneNote 互操做程序集。
其次,OneNote 14.0 类型库与 Visual Studio 2010"NOPIA"功能不兼容。在 NOPIA 功能中,主互操做程序集默认状况下不会嵌入到应用程序中。所以,请务必将 OneNote 互操做程序集引用的"嵌入互操做类型"属性设置为 False。

2.四、Onenote没有采用OpenXML格式进行描述,而是使用了原始的XML ,一个完整的页面的描述代码以下:
2.五、Onenote的OCR机制是这样的:当咱们插入一幅图片时,Onenote会自动进行OCR处理,并将OCR后的结果以XML的形式写入到页面结构中。所以咱们的处理思路是,编程向Onenote中插入一幅图片,而后提取OCR后的结果。
即【OCRText】标签中的内容:
<one:OCRText>
<![CDATA[ew features in One Note 14
Gather, organize, and search
. Improved organization inside sections
O Multi-level subpages
o Collapsing subpages
O Drag-dropto make a subpage
o In-place New Page button
. Updated Search that抯 faster than navigating
. Improved hyperlinkingof notes, wiki links
. Outlook and lE integration improvements
O Section picker when sendingtoOneNote
o Notes on Outlook tasks
. QuIck Styles for making headings
. Linked note-taking on Web pages and documents
. Math support
. Dock to Desktop mode
Sharing and universal access
. Access from everywhere
O Share on the Web
O Browser access
O OneNote Mobile- syncs with the Web
. Sharing enhancements:
O Unread highlighting
O Author marks
O Recent Edits
O Find by Author
o VersioningandRecycleBin
O Faster sync with SharePoint
. Improved OneNoteMobile for mobile devices
O Sync overtheair
O Sync selected notebooks or sections]]>
</one:OCRText>
2.六、下面咱们来动手一步步处理,关于界面搭建的XAML再也不赘述,感兴趣的朋友请自行下载源代码,主要关注下业务代码。
图片在Onenote XML中是以Base64位编码格式存在的,故首先对插入的图片进行Base64编码:
//获取图片的Base64编码
FileInfo file = new FileInfo(v_strImgPath);
using (MemoryStream ms = new MemoryStream())
{
Bitmap bp = new Bitmap(v_strImgPath);
switch (file.Extension.ToLower())
{
case ".jpg":
bp.Save(ms, ImageFormat.Jpeg);
break;
case ".jpeg":
bp.Save(ms, ImageFormat.Jpeg);
break;
case ".gif":
bp.Save(ms, ImageFormat.Gif);
break;
case ".bmp":
bp.Save(ms, ImageFormat.Bmp);
break;
case ".tiff":
bp.Save(ms, ImageFormat.Tiff);
break;
case ".png":
bp.Save(ms, ImageFormat.Png);
break;
case ".emf":
bp.Save(ms, ImageFormat.Emf);
break;
default:
this.labMsg.Content = "不支持的图片格式。";
return;
}
byte[] buffer = ms.GetBuffer();
string _Base64 = Convert.ToBase64String(buffer);
2.七、构建插入图片后的Onenote XML代码:
var onenoteApp = new Microsoft.Office.Interop.OneNote.Application();
string notebookXml;
onenoteApp.GetHierarchy(null, Microsoft.Office.Interop.OneNote.HierarchyScope.hsPages, out notebookXml);
var doc = XDocument.Parse(notebookXml);
var ns = doc.Root.Name.Namespace;
var pageNode = doc.Descendants(ns + "Page").FirstOrDefault();
var existingPageId = pageNode.Attribute("ID").Value;
2.八、这里有一处小细节,就是Onenote XML中图片格式只支持以下几种:auto|png|emf|jpg,故须要将图片格式作一下处理:
string ImgExtension = file.Extension.ToLower().Substring(1);
switch (ImgExtension)
{
case "jpg":
ImgExtension = "jpg";
break;
case "png":
ImgExtension = "png";
break;
case "emf":
ImgExtension = "emf";
break;
default:
ImgExtension = "auto";
break;
}
2.九、下面这段是关键代码了,使用Linq to XML构造出插入图片后的Onenote XML:
var page = new XDocument(new XElement(ns + "Page",
new XElement(ns + "Outline",
new XElement(ns + "OEChildren",
new XElement(ns + "OE",
new XElement(ns + "Image",
new XAttribute("format", ImgExtension), new XAttribute("originalPageNumber", "0"),
new XElement(ns + "Position",
new XAttribute("x", "0"), new XAttribute("y", "0"), new XAttribute("z", "0")),
new XElement(ns + "Size",
new XAttribute("width", bp.Width.ToString()), new XAttribute("height", bp.Height.ToString())),
new XElement(ns + "Data", _Base64)))))));
page.Root.SetAttributeValue("ID", existingPageId);
onenoteApp.UpdatePageContent(page.ToString(), DateTime.MinValue);
2.十、线程休眠几秒钟,等待OCR完成,Onenote OCR根据图片大小须要消耗一些时间:
//线程休眠时间,单位毫秒,若图片很大,则延长休眠时间,保证Onenote OCR完毕
System.Threading.Thread.Sleep(Int32.Parse(System.Configuration.ConfigurationManager.AppSettings["WaitTIme"]));
2.十一、为了便于提取OCR后的结果,将构造好的Onenote XML代码写入一个临时的XML文件:
string pageXml;
onenoteApp.GetPageContent(existingPageId, out pageXml, Microsoft.Office.Interop.OneNote.PageInfo.piAll);
//获取OCR后的内容
FileStream tmpXml = new FileStream(System.Configuration.ConfigurationManager.AppSettings["tmpPath"] + @"\tmp.xml", FileMode.Create, FileAccess.ReadWrite);
StreamWriter sw = new StreamWriter(tmpXml);
sw.Write(pageXml);
sw.Flush();
sw.Close();
tmpXml.Close();
2.十二、使用Linq to XML和XPath表达式提取OCR后的结果:
FileStream tmpOnenote = new FileStream(System.Configuration.ConfigurationManager.AppSettings["tmpPath"] + @"\tmp.xml", FileMode.Open, FileAccess.ReadWrite);
XmlReader reader = XmlReader.Create(tmpOnenote);
XElement rdlc = XElement.Load(reader);
XmlNameTable nameTable = reader.NameTable;
XmlNamespaceManager mgr = new XmlNamespaceManager(nameTable);
mgr.AddNamespace("one", ns.ToString());
StringReader sr = new StringReader(pageXml);
XElement onenote = XElement.Load(sr);
var xml = from o in onenote.XPathSelectElements("//one:Image", mgr)
select o.XPathSelectElement("//one:OCRText", mgr).Value;
this.txtOCRed.Text = xml.First().ToString();
2.1三、释放占用的资源:
sr.Close();
reader.Close();
tmpOnenote.Close();
2.1四、最后将OCR后的结果写入到输出文件中:
FileStream fs = new FileStream(this.__OutputFileName, FileMode.Create, FileAccess.ReadWrite);
StreamWriter sw = new StreamWriter(fs);
sw.Write(this.txtOCRed.Text);
sw.Flush();
sw.Close();
fs.Close();
this.labMsg.Content = "OCR成功。";
因为我安装的是Onenote 2010 x64英文版,未找到中文语言包,故先测试下英文OCR。
2.1五、本地图片测试结果:

2.1六、网络图片测试结果:
网络图片是先下载到本地,后面步骤和本地图片同样。

小结
此方法的优势是效率很高,可扩展性强,只要改改配置文件、Linq to XML代码就能够完成不少附加工做。
缺点是,要求客户端必需要安装Onenote,且至少要有一个打开的Page,OCR时没法判断哪个图片是正在OCR的,若连续操做则显示结果混乱。
此外,我没有找到编程创建Onenote文档的方法,以及对Onenote XML架构了解的还不够多,对一些元素不知道如何编程生成,如ObjectID等。
综上所述,Onenote 2010的OCR水平仍是很高的,和Tesseract相比,OCR的准确率与效率均提升了不止一个档次,可是鉴于Onenote 2010 API十分简陋,远不及Word、Excel等操做方便,且官方文档对于Onenote 2010 XML架构的介绍还不是很详细,缺乏示例。但愿Office 1五、Onenote 2014能有所改进吧。关于OCR的介绍到此告一段落,欢迎感兴趣的朋友继续讨论。