啃碎String源码

时间 2020-10-05

标签 java git 正则表达式算法数组缓存函数性能优化栏目 Java 繁體版

原文原文链接

前言

最近打算开始来读一下JDK的部分源码，此次先从咱们平时用的最多的String类(JDK1.8)开始，本文主要会对如下几个方法的源码进行分析：java

equals
hashCode
equalsIgnoreCase
indexOf
startsWith
concat
substring
split
trim
compareTo

若是有不对的地方请多多指教，那么开始进入正文。git

源码剖析

首先看下String类实现了哪些接口正则表达式

public final class String
     implements java.io.Serializable, Comparable<String>, CharSequence {

java.io.Serializable

　　　　这个序列化接口没有任何方法和域，仅用于标识序列化的语意。算法

Comparable<String>

　　　　这个接口只有一个compareTo(T 0)接口，用于对两个实例化对象比较大小。数组

CharSequence

　　　　这个接口是一个只读的字符序列。包括length(), charAt(int index), subSequence(int start, int end)这几个API接口，值得一提的是，StringBuffer和StringBuild也是实现了该接口。缓存

看一下两个主要变量：函数

/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0

能够看到，value[]是存储String的内容的，即当使用String str = "abc";的时候，本质上，"abc"是存储在一个char类型的数组中的。性能

而hash是String实例化的hashcode的一个缓存。由于String常常被用于比较，好比在HashMap中。若是每次进行比较都从新计算hashcode的值的话，那无疑是比较麻烦的，而保存一个hashcode的缓存无疑能优化这样的操做。优化

注意：这边有一个须要注意的点就是能够看到value数组是用final修饰的，也就是说不能再去指向其它的数组，可是数组的内容是能够改变的，之因此说String不可变是由于其提供的API(好比replace等方法)都会给咱们返回一个新的String对象，而且咱们没法去改变数组的内容，这才是它不可变的缘由。ui

equals

equals() 方法用于判断 Number 对象与方法的参数进是否相等

String类重写了父类Object的equals方法，来看看源码实现：

首先会判断两个对象是否指向同一个地址，若是是的话则是同一个对象，直接返回true
接着会使用instanceof判断目标对象是不是String类型或其子类的实例，若是不是的话则返回false
接着会比较两个String对象的char数组长度是否一致，若是不一致则返回false
最后迭代依次比较两个char数组是否相等

hashCode

hashCode() 方法用于返回字符串的哈希码

Hash算法就是一种将任意长度的消息压缩到某一固定长度的消息摘要的函数。在Java中，全部的对象都有一个int hashCode()方法，用于返回hash码。

根据官方文档的定义：Object.hashCode() 函数用于这个函数用于将一个对象转换为其十六进制的地址。根据定义，若是2个对象相同，则其hash码也应该相同。若是重写了 equals() 方法，则原 hashCode() 方法也一并失效，因此也必需重写 hashCode() 方法。

按照上面源码举例说明：

String msg = "abcd"; 
System.out.println(msg.hashCode());

此时value = {'a','b','c','d'} 所以for循环会执行4次

第一次：h = 31*0 + a = 97
第二次：h = 31*97 + b = 3105
第三次：h = 31*3105 + c = 96354
第四次：h = 31*96354 + d = 2987074

由以上代码计算能够算出 msg 的hashcode = 2987074

在源码的hashcode的注释中还提供了一个多项式计算方式：

s[0] 31^(n-1) + s[1]31^(n-2) + ... + s[n-1]

另外，咱们能够看到，计算中使用了31这个质数做为权进行计算。能够尽量保证数据分布更分散

在《Effective Java》中有说起：

之因此选择31，是由于它是一个奇素数。若是乘数是偶数，而且乘法溢出的话，信息就会丢失，由于与2相乘等价于移位运算。使用素数的好处并不明显，可是习惯上都使用素数来计算散列结果。31有个很好的特性。即用移位和减法来代替乘法，能够获得更好的性能：31 * i == (i << 5) - i。现代的VM能够自动完成这种优化。

/** Cache the hash code for the string */
private int hash; // Default to 0

并且如上面所示，当计算完以后会用一个变量hash把哈希值保存起来，下一次再获取的时候就不用换从新计算了，正是由于String的不可变性保证了hash值的惟一。

equalsIgnoreCase

equalsIgnoreCase() 方法用于将字符串与指定的对象比较，不考虑大小写

接下来来看看源码实现：

来看看核心方法

相信看了上图的介绍就能看懂了，这里就很少说了。

indexOf

查找指定字符或字符串在字符串中第一次出现地方的索引，未找到的状况返回 -1

String str = "wugui";
System.out.println(str.indexOf("g"));

输出结果：2

public int indexOf(String str) {
   return indexOf(str, 0);
}

public int indexOf(String str, int fromIndex) {
   return indexOf(value, 0, value.length,str.value, 0, str.value.length, fromIndex);
}

接下来是咱们的核心方法，先看下各个参数的介绍

/*
 * @param   source       被搜索的字符
 * @param   sourceOffset 原字符串偏移量
 * @param   sourceCount  原字符串大小
 * @param   target       要搜索的字符
 * @param   targetOffset 目标字符串偏移量
 * @param   targetCount  目标字符串大小
 * @param   fromIndex    开始搜索的位置
*/
static int indexOf(char[] source, int sourceOffset, int sourceCount,
        char[] target, int targetOffset, int targetCount,
        int fromIndex) {
   ......
}

下面是代码的逻辑步骤

在indexOf的源码里面我认为边界条件是写的比较好的

咱们这里假设

String str = "wugui";
str.indexOf("ug");

在上图第2步，计算出max做为下面循环的边界条件

//找到第一个匹配的字符索引
if (source[i] != first) {
   while (++i <= max && source[i] != first);
}

咱们计算出 max=3，也就是说咱们在使用迭代搜索第一个字符的时候只须要遍历到索引为3的位置，就能够了，由于索引第4位也就是最后一位 'i'，就是匹配到了第一个字符也是无心义的，由于咱们要搜索的目标自字符是2位字符，同第5步计算出end做为边界条件也是一样的道理。

有了indexOf方法以后，那有些方法就能够借用它来实现了，好比contains方法，源码以下：

public boolean contains(CharSequence s) {
   return indexOf(s.toString()) > -1;
}

只须要调用根据indexOf的返回值来判断是否包含目标字符串就能够了。

startsWith

startsWith() 方法用于检查字符串是不是以指定子字符串开头，若是是则返回 True，不然返回 False

String str = "wugui";
System.out.println(str.startsWith("wu"));

输出结果：true

public boolean startsWith(String prefix) {
    return startsWith(prefix, 0);
}

public boolean startsWith(String prefix, int toffset) {
    ......
}

既然有了startsWith方法，那么endsWith就很容易实现了，以下：

只要修改一下参数，设置偏移量就能够了。

concat

用于将指定的字符串参数链接到字符串上

String str1 = "wu";
String str2 = "gui";
System.out.println(str1.concat(str2));

输出结果：wugui

能够看到是使用了Arrays.copyOf方法来生成新数组

char buf[] = Arrays.copyOf(value, len + otherLen);

咱们来看看其实现：

能够看到主要使用system.arraycopy方法，点进去看一下实现：

若是看不到的话咱们这里举个例子：

好比：咱们有一个数组数据

byte[] srcBytes =  new byte[]{2,4,0,0,0,0,0,10,15,50};//原数组
byte[] destBytes = new byte[5]; //目标数组

咱们使用System.arraycopy进行复制

System.arrayCopy(srcBytes,0,destBytes ,0,5)

上面这段代码就是 : 建立一个一维空数组,数组的总长度为 12位,而后将srcBytes源数组中从0位到第5位之间的数值 copy 到 destBytes目标数组中,在目标数组的第0位开始放置，
那么这行代码的运行效果应该是 2,4,0,0,0,

调用完Arrays.copy返回新数组方法后，会调用str.getChars(buf, len)来拼接字符串，咱们看下其实现：

能够看到其实也是调用了System.arraycopy来实现，这里再也不细说。

最后一步就是把新数组赋值给value

return new String(buf, true);

substring

提取字符串中介于两个指定下标之间的字符

String str = "wugui";
System.out.println(str.substring(1, 3));//包括索引1不包括索引3

输出结果：ug

来看看 new String(value, beginIndex, subLen) 的实现

看看Arrays.copyOfRange是如何实现的：

能够看到其实仍是使用的System.arraycopy来实现，上面已经介绍过了，这里再也不细说。

split

根据匹配给定的正则表达式来拆分字符串

先来看看用法：

public String[] split(String regex, int limit)

第一个参数regex表示正则表达式，第二个参数limit是分割的子字符串个数

String str = "a:b:c:d";
String[] split = str.split(":");

当没有传limit参数默认调用的是split(String regex, 0)

上面的输出为：[a, b, c, d]

若是把limit参数换成2那么输出结果变成：[a, b:c:d]，能够看出limit意味着分割后的子字符串个数。

看看整个源码：

public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        //若是regex只有一位，且不为列出的特殊字符； 
        //若是regex有两位，第一位为转义字符且第二位不是数字或字母 
        //第三个是和编码有关，就是不属于utf-16之间的字符
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

接下来咱们一步步来分析：

能够看到有三个条件：

若是regex只有一位，且不为列出的特殊字符
若是regex有两位，第一位为转义字符且第二位不是数字或字母
第三个是和编码有关，就是不属于utf-16之间的字符

只有知足上面三个条件才能进入下一步：

第一次分割时，使用off和next，off指向每次分割的起始位置，next指向分隔符的下标，完成一次分割后更新off的值，当list的大小等于limit-1时，直接添加剩下子字符串，具体看下源码：

最后就是对子字符串进行处理：

我的以为这部分源码仍是比较难的，有兴趣的同窗能够再去研究一下。

trim

删除字符串的头尾空白符

String str = "  wugui         ";
System.out.println(str.trim());

输出：wugui

这部分仍是比较简单的，这里再也不细说。

compareTo

比较两个字符

String a = "a";
String b = "b";
System.out.println(a.compareTo(b));

输出：-1

看看源码：

总结

有关String的源码暂时分析到这里，其它的源码感兴趣的小伙伴能够按本身去研究一下，接下来可能会得写几篇文章来介绍一下Java中的包装类，敬请期待！