这里就不对POI作过多的说明了,贴个官网 https://poi.apache.org/,随意看看。html
首先搞清楚下要将doc/docx文档转成html/htm的话要怎么处理,根据POI的文档,咱们能够知道,处理doc 格式文件对应的 POI API 为 HWPF、docx 格式为 XWPF。此处参考下这篇好文:http://www.open-open.com/lib/view/open1389594797523.html 在格式转换上说得很清楚。java
因此总体就是:根据文档类型,doc咱们用HWPF对象处理转换、docx用XWPF对象处理转换。apache
顺便贴一下这个在线文档 http://poi.apache.org/apidocs/index.html,不得不说看得至关麻烦,特别是XWPF的。api
1、处理doc。dom
这个相对简单,网上一查一堆,个人代码也是根据网上的作下本身的优化和逻辑。字体
由于POI很早前就能够支持doc的处理,因此资料比较多。优化
思路就是:HWPFDocument对象实例化文件流 -> WordToHtmlConverter对象处理HWPFDocument对象及预处理页面的图片等(主要是图片)ui
文档说明是:编码
Converts Word files (95-2007) into HTML files. This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.
-> org.w3c.dom.Document对象处理WordToHtmlConverter,生成DOM对象 -> 输出文件。code
这里有个好处就是使用到了Document对象,从而解决了编码、文件格式等问题。
这里由于过程简单,直接贴简单demo,看注释便可:
import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import java.util.List; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.apache.commons.io.FileUtils; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.converter.PicturesManager; import org.apache.poi.hwpf.converter.WordToHtmlConverter; import org.apache.poi.hwpf.usermodel.Picture; import org.apache.poi.hwpf.usermodel.PictureType; import org.apache.poi.xwpf.converter.core.FileImageExtractor; import org.apache.poi.xwpf.converter.core.FileURIResolver; import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter; import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFPictureData; import org.w3c.dom.Document; public class POIForeViewUtil { public void parseDocx2Html() throws Throwable { final String path = "F:\\"; final String file = "xxxxxxx.doc"; InputStream input = new FileInputStream(path + file); String suffix = file.substring(file.indexOf(".")+1);// //截取文件格式名 //实例化WordToHtmlConverter,为图片等资源文件作准备 WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.setPicturesManager(new PicturesManager() { public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) { return suggestedName; } }); if ("doc".equals(suffix.toLowerCase())) { // docx HWPFDocument wordDocument = new HWPFDocument(input); wordToHtmlConverter.processDocument(wordDocument); //处理图片,会在同目录下生成 image/media/ 路径并保存图片 List pics = wordDocument.getPicturesTable().getAllPictures(); if (pics != null) { for (int i = 0; i < pics.size(); i++) { Picture pic = (Picture) pics.get(i); try { pic.writeImageContent(new FileOutputStream(path + pic.suggestFullFileName())); } catch (FileNotFoundException e) { e.printStackTrace(); } } } } // 转换 Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream outStream = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outStream); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");//编码格式 serializer.setOutputProperty(OutputKeys.INDENT, "yes");//是否用空白分割 serializer.setOutputProperty(OutputKeys.METHOD, "html");//输出类型 serializer.transform(domSource, streamResult); outStream.close(); String content = new String(outStream.toByteArray()); FileUtils.writeStringToFile(new File(path, "interface.html"), content, "utf-8"); } public static void main(String[] args) throws Throwable { new POIForeViewUtil().parseDocx2Html(); } }
接着看第二种
2、处理docx。
docx是07的版本,处理起来困难的多,貌似POI对docx的处理方法没有doc那么便捷,处理样式等等都有问题,我遇到的两个最明显问题就是字体编码问题和表格的边框线显示。
思路:XWPFDocument加载文件流 -> XHTMLOptions处理页面资源(主要图片) -> OutputStream输出流直接输出文件。
过程代码至关简单,但是越简单结果约没有预期的好。输出的文件字体编码默认为GBK,例如个人“微软雅黑”字体就变成“寰蒋闆呴粦”,并且节点的显示也没有doc处理的好。
一样贴一下demo代码:
import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.stream.StreamResult; import org.apache.poi.xwpf.converter.core.FileImageExtractor; import org.apache.poi.xwpf.converter.core.FileURIResolver; import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter; import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFPictureData; public class Word07ToHtml { public static void parseToHtml() throws IOException { File f = new File("F:/xxxxx.docx"); if (!f.exists()) { System.out.println("Sorry File does not Exists!"); } else { if (f.getName().endsWith(".docx") || f.getName().endsWith(".DOCX")) { // 1) 加载XWPFDocument及文件 InputStream in = new FileInputStream(f); XWPFDocument document = new XWPFDocument(in); // 2) 实例化XHTML内容(这里将会把图片等文件放到生成的"word/media"目录) File imageFolderFile = new File("f:/opt"); XHTMLOptions options = XHTMLOptions.create().URIResolver( new FileURIResolver(imageFolderFile)); options.setExtractor(new FileImageExtractor(imageFolderFile)); //options.setIgnoreStylesIfUnused(false); //options.setFragment(true); // 3) 将XWPFDocument转成XHTML并生成文件 OutputStream out = new FileOutputStream(new File( "F:/result.html")); XHTMLConverter.getInstance().convert(document, out, null); } else { System.out.println("Enter only MS Office 2007+ files"); } } } public static void main(String args[]) { try { //String string = new String("寰蒋闆呴粦".getBytes("GBK"), "UTF-8"); //System.out.println(string); parseToHtml(); } catch (IOException e) { e.printStackTrace(); } } }
因为已将两个Demo移出项目,没有截图。
POI的jar包下载路径:
https://archive.apache.org/dist/poi/release/bin/poi-bin-3.9-20121203.zip