问题描述:java
使用openoffice将txt文本转pdf的过程当中发现中文乱码。测试
解决思路及过程:编码
一、查看出现乱码的缘由spa
经查询jodconverter源码发现,只有utf-8编码的文本才不会中文乱码。code
二、怎么样将非utf-8编码文件转换成utf-8文件。utf-8
要转以前首先要判断txt文本自己的编码。经查发现txt文本有一个头。ci
判断方法以下unicode
/** * 根据文件路径返回文件编码 * @param filePath * @return * @throws IOException */ public static String getCharset(String filePath) throws IOException{ BufferedInputStream bin = new BufferedInputStream(new FileInputStream( filePath)); int p = (bin.read() << 8) + bin.read(); String code = null; switch (p) { case 0xefbb: code = "UTF-8"; break; case 0xfffe: code = "Unicode"; break; case 0xfeff: code = "UTF-16"; break; default: code = "GB2312"; } System.out.println(code); return code; }
转换代码以下get
/** * 以指定编码方式写文本文件,存在会覆盖 * * @param file * 要写入的文件 * @param toCharsetName * 要转换的编码 * @param content * 文件内容 * @throws Exception */ public static void saveFile2Charset(File file, String toCharsetName, String content) throws Exception { if (!Charset.isSupported(toCharsetName)) { throw new UnsupportedCharsetException(toCharsetName); } OutputStream outputStream = new FileOutputStream(file); OutputStreamWriter outWrite = new OutputStreamWriter(outputStream, toCharsetName); outWrite.write(content); outWrite.close(); }
经测试发现,转换后的文本,获取的头仍是gbk的,只有手机将头文件中blob生成 源码
代码以下:
/** * 以指定编码方式写文本文件,存在会覆盖 * * @param file * 要写入的文件 * @param toCharsetName * 要转换的编码 * @param content * 文件内容 * @throws Exception */ public static void saveFile2Charset(File file, String toCharsetName, String content) throws Exception { if (!Charset.isSupported(toCharsetName)) { throw new UnsupportedCharsetException(toCharsetName); } OutputStream outputStream = new FileOutputStream(file); //增长头文件标识 outputStream.write(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF}); OutputStreamWriter outWrite = new OutputStreamWriter(outputStream, toCharsetName); outWrite.write(content); outWrite.close(); }
经测试
GB2312
Unicode
UTF-16
UTF-8
都成功。
txt编码和头文件说明
java编码与txt编码对应 |
|
java |
txt |
unicode |
unicode big endian |
utf-8 |
utf-8 |
utf-16 |
unicode |
gb2312 |
ANSI |
什么是BOM
BOM(byte-order mark),即字节顺序标记,它是插入到以UTF-八、UTF16或UTF-32编码Unicode文件开头的特殊标记,用来识别Unicode文件的编码类型。对于UTF-8来讲,BOM并非必须的,由于BOM用来标记多字节编码文件的编码类型和字节顺序(big-endian或little- endian)。
BOMs 文件头:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
EF BB BF = UTF-8,
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
注:jodconverter 2.2.1不支持docx 、xlsx、ppt、文件转pdf