jvm 默认字符集

时间 2019-12-10

标签 jvm 默认字符集栏目 Java 繁體版

原文原文链接

最近在读取第三方上传的文件时，遇到一个问题，就是采用默认字符集读取，发现个别中文乱码，找到乱码的字，发现是生僻字：碶。linux

因为在window是环境下作的测试，并无报错，可是在linux服务器上执行，发现读出后是乱码。windows

具体读取文件代码简化以下：服务器

 Path path = Paths.get("d:", "1.txt");
 String ss = null;
 try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toString()))) {
      ss = br.readLine();
      System.out.println(ss);
  }

问题就出在 new FileInputStream(path.toString()) 使用默认字符集jvm

而jvm在windows和linux下，读取文件的默认字符集是不一样的，测试代码以下：测试

        Path path = Paths.get("/szc", "1.txt");
        InputStreamReader isr;
        try {
            isr = new InputStreamReader(new FileInputStream(path.toFile()));
            System.out.println("FileInputStream encoding: "+isr.getEncoding()); 
            System.out.println("File Encoding: "+System.getProperty("file.encoding"));
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

上面的代码在windows下的输出结果为spa

FileInputStream encoding: GBK
File Encoding: GBK操作系统

而在linux上执行的结果为code

FileInputStream encoding: EUC_CN
File Encoding: GB2312

其中EUC_CN 是GB2312的另外一种表示方法。blog

另外GBK是GB2312的扩展，对于中文繁体和生僻字，GB2312没法表示。get

因此就出现了在linux下用默认字符集读取"碶"字乱码，可是在windows下确没有乱码。

ps：或许由于操做系统字符集以及版本不一样，可能在jvm读取文件的默认字符集也有不一样，楼主并无作相关测试。

综上，在读取文件时，尽可能指定字符集来避免操做系统差别性带来的问题。