没提供编码格式，读文件时要怎么推测文件具体的编码

时间 2019-12-28

原文原文链接

引子

咱们知道从一个文件流中读取内容时是要指定具体的编码格式的，不然读出来的内容会是乱码。好比咱们的代码写成下面这个样子：html

private static void m1(){
    try(FileInputStream fileInputStream = new FileInputStream("D:\\每日摘录.txt")) {
        byte[] bytes = FileCopyUtils.copyToByteArray(fileInputStream);
        System.out.println(new String(bytes));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

执行上面的代码，有时咱们能“侥幸”获得正确的执行结果。由于new String(byte[])这个方法会指定默认的编码格式，因此若是咱们读取的文件的编码格式正好是UTF8的话，那上面的代码就一点问题没有。可是若是咱们读取的是一个编码格式是GBK的文件，那么获得的内容将是一坨乱码。java

上面的问题解决起来很简单，只要指定下字符编码就能够了。正则表达式

new String(bytes,"GBK")；

在告知文件编码格式的条件下，解决上面的问题是很简单。假如如今没告知文件具体的编码格式，咱们须要怎么正确的读取文件呢？一个可行的办法是推测文件编码方式。算法

推测文件编码的方式

网上有多种方式能够“推测”出一个文件的可用编码，可是须要注意的是：全部的方法都不能保证推测出来的结果是绝对准确的，有的方法推测的准确率较高，而有的方法推测出来的准确率较低。主要的推测方法有如下几种：apache

经过文件的前三个字节来判断：由于有些编码格式会存在文件的前面3个字节中，好比UTF-8编码格式的文本文件，其前3个字节的值就是-1七、-6九、-65。可是很明显，这种方式的局限性比较大，推测出来的准确率也比较低，所以不推荐这种方式。
经过特殊字符来判断：经过某些编码格式编码的文件中会出现一些特殊的字节值，所以能够经过判断文件中是否有这些特殊值来推测文件编码格式。此方准确率也不高，不推荐使用。
经过工具库cpdetector来判断：cpdector 是一款开源的文档编码检测工具，能够检测 xml，html文档编码类型。是基于统计学原理来推测文件编码的，可是也不保证推测结果的准确性。
经过ICU4J库来判断：ICU的推测逻辑基于IBM过去几十年收集的字符集数据，理论上也是基于统计学的。这种方式统计的结果准确性也较高推荐使用。

下面就来具体介绍下怎么使用cpdector和ICU4J推测文件编码。工具

cpdector

使用Cpdetector jar包，提供两种方式检测文件编码，至于选择哪一种须要根据我的需求，文档有注释。依赖antlr-2.7.4.jar，chardet-1.0.jar，jargs-1.0.jar三个jar包。能够再官网下载 http://cpdetector.sourceforge.net/。性能

import info.monitorenter.cpdetector.io.ASCIIDetector;
import info.monitorenter.cpdetector.io.ByteOrderMarkDetector;
import info.monitorenter.cpdetector.io.CodepageDetectorProxy;
import info.monitorenter.cpdetector.io.JChardetFacade;
import info.monitorenter.cpdetector.io.ParsingDetector;
import info.monitorenter.cpdetector.io.UnicodeDetector;

import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;

import org.apache.log4j.Logger;

/**
 * <p>
 *  获取流编码,不保证彻底正确，设置检测策略 isFast为true为快速检测策略，false为正常检测
 *  InputStream 支持mark,则会在检测后调用reset，外部可从新使用。
 *  InputStream 流没有关闭。
 * </p>
 * 
 * <p>
 *  若是采用快速检测编码方式,最多会扫描8个字节，依次采用的{@link UnicodeDetector}，{@link byteOrderMarkDetector}，
 *  {@link JChardetFacade}， {@link ASCIIDetector}检测。对于一些标准的unicode编码，适合这个方式或者对耗时敏感的。
 * </p>
 * 
 * <p>
 *  采用正常检测，读取指定字节数，若是没有指定，默认读取所有字节检测，依次采用的{@link byteOrderMarkDetector}，{@link parsingDetector}，{@link JChardetFacade}， {@link ASCIIDetector}检测。
 *  字节越多检测时间越长，正确率较高。
 * </p>
 * @author WuKong
 *
 */
public class CpdetectorEncoding {
    
    private static final Logger logger = Logger.getLogger(CpdetectorEncoding.class);
    
    /**
     * <p>
     * 获取流编码,不保证彻底正确，设置检测策略 isFast为true为快速检测策略，false为正常检测
     * InputStream 支持mark,则会在检测后调用reset，外部可从新使用。
     * InputStream 流没有关闭。
     * </p>
     * 
     * <p>
     * 若是采用快速检测编码方式,最多会扫描8个字节，依次采用的{@link UnicodeDetector}，{@link byteOrderMarkDetector}，
     * {@link JChardetFacade}， {@link ASCIIDetector}检测。对于一些标准的unicode编码，适合这个方式或者对耗时敏感的。
     * </p>
     * 
     * <p>
     *  采用正常检测，读取指定字节数，若是没有指定，默认读取所有字节检测，依次采用的{@link byteOrderMarkDetector}，{@link parsingDetector}，{@link JChardetFacade}， {@link ASCIIDetector}检测。
     *  字节越多检测时间越长，正确率较高。
     * </p>
     *
     * @param in 输入流  isFast 是否采用快速检测编码方式
     * @return Charset The character are now - hopefully - correct。若是为null，没有检测出来。
     * @throws IOException 
     */
    public Charset getEncoding(InputStream buffIn,boolean isFast) throws IOException{
        
        return getEncoding(buffIn,buffIn.available(),isFast);
    }
    
    public Charset getFastEncoding(InputStream buffIn) throws IOException{
        return getEncoding(buffIn,MAX_READBYTE_FAST,DEFALUT_DETECT_STRATEGY);
    }
    
    
    
    public Charset getEncoding(InputStream in, int size, boolean isFast) throws IOException {
        
        try {
            
            java.nio.charset.Charset charset = null;
            
            int tmpSize = in.available();
            size = size >tmpSize?tmpSize:size;
            //if in support mark method, 
            if(in.markSupported()){
                
                if(isFast){
                    
                    size = size>MAX_READBYTE_FAST?MAX_READBYTE_FAST:size;
                    in.mark(size++);
                    charset = getFastDetector().detectCodepage(in, size);
                }else{
                    
                    in.mark(size++);
                    charset = getDetector().detectCodepage(in, size);
                }
                in.reset();
                
            }else{
                
                if(isFast){
                    
                    size = size>MAX_READBYTE_FAST?MAX_READBYTE_FAST:size;
                    charset = getFastDetector().detectCodepage(in, size);
                }else{
                    charset = getDetector().detectCodepage(in, size);
                }
            }
            
            
            return charset;
        }catch(IllegalArgumentException e){
            
            logger.error(e.getMessage(),e);
            throw e;
        } catch (IOException e) {
            
            logger.error(e.getMessage(),e);
            throw e;
        }
        
    }
    
    
    public Charset getEncoding(byte[] byteArr,boolean isFast) throws IOException{
        
        return getEncoding(byteArr, byteArr.length, isFast);
    }
    
    
    public Charset getFastEncoding(byte[] byteArr) throws IOException{
        
        return getEncoding(byteArr, MAX_READBYTE_FAST, DEFALUT_DETECT_STRATEGY);
    }
    
    
    public Charset getEncoding(byte[] byteArr, int size,boolean isFast) throws IOException {
        
        size = byteArr.length>size?size:byteArr.length;
        if(isFast){
            size = size>MAX_READBYTE_FAST?MAX_READBYTE_FAST:size;
        }
        
        ByteArrayInputStream byteArrIn = new ByteArrayInputStream(byteArr,0,size);
        BufferedInputStream in = new BufferedInputStream(byteArrIn);
        
        try {
            
            Charset charset = null;
            if(isFast){
                
                charset = getFastDetector().detectCodepage(in, size);
            }else{
                
                charset = getDetector().detectCodepage(in, size);
            }
            
            return charset;
        } catch (IllegalArgumentException e) {
            
            logger.error(e.getMessage(),e);
            throw e;
        } catch (IOException e) {
            
            logger.error(e.getMessage(),e);
            throw e;
        }
       
    }
    
    private static CodepageDetectorProxy detector =null;
    private static CodepageDetectorProxy fastDtector =null;
    private static ParsingDetector parsingDetector =  new ParsingDetector(false);
    private static ByteOrderMarkDetector byteOrderMarkDetector = new ByteOrderMarkDetector();
    
    //default strategy use fastDtector
    private static final boolean DEFALUT_DETECT_STRATEGY = true;
    
    private static final int MAX_READBYTE_FAST = 8; 
    
    private static CodepageDetectorProxy getDetector(){
        
        if(detector==null){
            
            detector = CodepageDetectorProxy.getInstance();
             // Add the implementations of info.monitorenter.cpdetector.io.ICodepageDetector: 
            // This one is quick if we deal with unicode codepages:
            detector.add(byteOrderMarkDetector);
            // The first instance delegated to tries to detect the meta charset attribut in html pages.
            detector.add(parsingDetector);
            // This one does the tricks of exclusion and frequency detection, if first implementation is 
            // unsuccessful:
            detector.add(JChardetFacade.getInstance());
            detector.add(ASCIIDetector.getInstance());
        }
        
        return detector;
    }
    
    
    private static CodepageDetectorProxy getFastDetector(){
        
        if(fastDtector==null){
            
            fastDtector = CodepageDetectorProxy.getInstance();
            fastDtector.add(UnicodeDetector.getInstance());
            fastDtector.add(byteOrderMarkDetector); 
            fastDtector.add(JChardetFacade.getInstance());
            fastDtector.add(ASCIIDetector.getInstance());
        }
        
        return fastDtector;
    }
    
}

ICU4J

ICU (International Components for Unicode)是为软件应用提供Unicode和全球化支持的一套成熟、普遍使用的C/C++和Java类库集，可在全部平台的C/C++和Java软件上得到一致的结果。ui

ICU首先是由Taligent公司开发的，Taligent公司被合并为IBM公司全球化认证中心的Unicode研究组后，ICU由IBM和开源组织合做继续开发。开始ICU只有Java平台的版本，后来这个平台下的ICU类被吸归入SUN公司开发的JDK1.1，并在JDK之后的版本中不断改进。C++和C平台下的ICU是由JAVA平台下的ICU移植过来的，移植过的版本被称为ICU4C，来支持这C/C++两个平台下的国际化应用。ICU4J和ICU4C区别不大，但因为ICU4C是开源的，而且紧密跟进Unicode标准，ICU4C支持的Unicode标准老是最新的；同时，由于JAVA平台的ICU4J的发布须要和JDK绑定，ICU4C支持Unicode标准改变的速度要比ICU4J快的多。编码

ICU的功能主要有:.net

代码页转换: 对文本数据进行Unicode、几乎任何其余字符集或编码的相互转换。ICU的转化表基于IBM过去几十年收集的字符集数据，在世界各地都是最完整的。
排序规则（Collation）: 根据特定语言、区域或国家的管理和标准比较字数串。ICU的排序规则基于Unicode排序规则算法加上来自公共区域性数据仓库（Common locale data repository）的区域特定比较规则。
格式化: 根据所选区域设置的惯例，实现对数字、货币、时间、日期、和利率的格式化。包括将月和日名称转换成所选语言、选择适当缩写、正确对字段进行排序等。这些数据也取自公共区域性数据仓库。
时间计算: 在传统格里历基础上提供多种历法。提供一整套时区计算API。
Unicode支持: ICU紧密跟进Unicode标准，经过它能够很容易地访问Unicode标准制定的不少Unicode字符属性、Unicode规范化、大小写转换和其余基础操做。
正则表达式: ICU的正则表达式全面支持Unicode而且性能极具竞争力。
Bidi: 支持不一样文字书写顺序混合文字（例如从左到右书写的英语，或者从右到左书写的阿拉伯文和希伯来文）的处理。
文本边界: 在一段文本内定位词、句或段落位置、或标识最适合显示文本的自动换行位置。

代码示例:

public class FileEncodingDetector {

    public static void main(String[] args) {
        File file = new File("D:\\xx1.log");
        System.out.println(getFileCharsetByICU4J(file));
    }

    public static String getFileCharsetByICU4J(File file) {
        String encoding = null;

        try {
            Path path = Paths.get(file.getPath());
            byte[] data = Files.readAllBytes(path);
            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);
            //这个方法推测首选的文件编码格式
            CharsetMatch match = detector.detect();
            //这个方法能够推测出全部可能的编码方式
            CharsetMatch[] charsetMatches = detector.detectAll();
            if (match == null) {
                return encoding;
            }
            encoding = match.getName();
        } catch (IOException var6) {
            System.out.println(var6.getStackTrace());
        }
        return encoding；
    }
}

注意点

ICU4J和cpdector推测出来的文件编码都不能保证百分百准确，只能保证大几率准确；
ICU4J和cpdector推测出来的编码不必定是文件原始的编码。好比个人一个文本文件中只有简单的英文字符，而后我将这个文件存为GBK编码格式。这时你使用这两个工具推测出来的文件编码多是ASCII编码。可是使用ASCII编码也能正确打开这个文件，由于GBK是兼容ASCII的。因此能看出，这两个工具都是以能正确解码文件为原则来推测编码的，不必定要推测出原始编码。

没提供编码格式，读文件时要怎么推测文件具体的编码

引子

推测文件编码的方式

cpdector

ICU4J

注意点

参考