散列表(hash table)——算法导论(13)

时间 2019-11-13

原文原文链接

1. 引言java

许多应用都须要动态集合结构，它至少须要支持Insert，search和delete字典操做。散列表（hash table）是实现字典操做的一种有效的数据结构。git

2. 直接寻址表github

在介绍散列表以前，咱们先介绍直接寻址表。算法

当关键字的全域U（关键字的范围）比较小时，直接寻址是一种简单而有效的技术。咱们假设某应用要用到一个动态集合，其中每一个元素的关键字都是取自于全域U=｛0，1，…，m-1｝，其中m不是一个很大的数。另外，假设每一个元素的关键字都不一样。数组

为表示动态集合，咱们用一个数组，或称为直接寻址表（direct-address table），记为T[0~m-1]，其中每个位置（slot，槽）对应全域U中的一个关键字，对应规则是，槽k指向集合中关键字为k的元素，若是集合中没有关键字为k的元素，则T[k]=NIL。服务器

几种字典操做实现起来很是简单：数据结构

上述的每个操做的时间均为O(1)时间。less

在某些应用中，咱们其实能够把对象做为元素直接保存在寻址表的槽中，而不须要像上图所示使用指针指向该对象，这样能够节省空间。ide

3. 散列表函数

(1) 直接寻址的缺点

咱们能够看出，直接寻址技术有几个明显的缺点：若是全域U很大，那么表T 将要申请一段很是长的空间，极可能会申请失败；对于全域较大，可是元素却十分稀疏的状况，使用这种存储方式将浪费大量的存储空间。

(2) 散列函数

为了克服直接寻址技术的缺点，而又保持其快速字典操做的优点，咱们能够利用散列函数（hash function）

h：U→｛0，1，2，…，m-1｝

来计算关键字k所在的的位置，简单的讲，散列函数h(k)的做用是将范围较大的关键字映射到一个范围较小的集合中。这时咱们能够说，一个具备关键字k的元素被散列到槽h(k)上，或者说h(k)是关键字k的散列值。

示意图以下：

这时会产生一个问题：两个关键字可能映射到同一槽中（咱们称之为冲突（collision）），而且无论你如何优化h(k)函数，这种状况都会发生（由于|U|>m）。

所以咱们如今面临两个问题，一是遇到冲突时如何解决；二是要找出一个的函数h(k)可以尽可能的减小冲突；

(3) 经过链表法解决冲突

咱们先来解决第一个问题。

解决办法就是，咱们把同时散列到同一槽中的元素以链表的形式“串联”起来，而该槽中保存的是指向该链表的指针。以下图所示：

采用该解决办法后，咱们能够经过以下的操做方式来进行字典操做：

下面咱们来分析上图各操做的性能。

首先是插入操做，很明显时间为O(1)。

而后分析删除操做，其花费的时间至关于从链表中删除一个元素的时间：若是链表T[h(k)]是双链表，花费的时间为O(1)；若是链表T[h(k)]是单链表，则花费的时间和查找操做的渐进运行时间相同。

下面咱们重点分析查找运行时间：

首先，咱们假定任何一个给定元素都等可能地散列在散列表T的任何一个槽位中，且与其余元素被散列在T的哪一个位置无关。咱们称这个假设为简单均匀散列（simple uniform hashing）。

不失通常性，咱们设散列表T的m个槽位散列了n个元素，则平均每一个槽位散列了α = n/m个元素，咱们称α为T的装载因子（load factor）。咱们记位于槽位j的链表为T[j]（j=1，2，…，m-1），而nj表示链表T[j]的长度，因而有

n = n0+n1+…+nm-1，

且E[nj] = α = n / m。

如今咱们分查找成功和查找不成功两种状况讨论。

① 查找不成功

在查找不成功的状况下，咱们须要遍历链表T[j]的每个元素，而链表T[j]的长度是α，所以须要时间O(α)，加上索引到T(j)的时间O(1)，总时间为θ(1 + α)。

② 查找成功

在查找成功的状况下，咱们没法准确知道遍历到链表T[j]的何处中止，所以咱们只能讨论平均状况。

咱们设xi是散列表T的第i个元素（假设咱们按插入顺序对散列表T中的n个元素进行了1~n的编号），ki表示xi.key，其中i = 1，2，…，n，再定义随机变量Xij=I｛h(ki)=h(kj)｝，即：

在简单均匀散列的假设下有

P｛h(ki)=h(kj)｝ = 1 / m，

E[Xij] = 1 / m。

则所需检查的元素的数目的指望是：

所以，一次成功的检查所须要的时间是O(2 + α / 2 –α / 2n) = θ(1 + α)。

综合上面的分析，在平均下，所有的字典操做均可以在O(1)时间下完成。

4. 散列函数

如今咱们来解决第二个问题：如何构造一个好的散列函数。

一个好的散列函数应（近似地）知足简单均匀散列：每一个关键字都等可能的被散列到各个槽位，并与其余关键字散列到哪个槽位无关（但很遗憾，咱们通常没法检验这一条件是否成立）。

在实际应用中，经常能够能够运用启发式方法来构造性能好的散列函数。设计过程当中，能够利用关键字分布的有用信息。一个好的方法导出的散列值，在某种程度上应独立于数据可能存在的任何模式。

下面给出两种基本的构造散列函数的方法：

(1) 除法散列法

除法散列法的作法很简单，就是让关键字k去除以一个数m，取余数，这样就将k映射到m个槽位中的某一个，即散列函数是：

h(k) = k mod m ，

因为只作一次除法运算，该方法的速度是很是快的。但应当注意的是，咱们在选取m的值时，应当避免一些选取一些值。例如，m不该是2的整数幂，由于若是m = 2 ^ p，则h(k)就是k的p个最低位数字。除非咱们已经知道各类最低p位的排列是等可能的，不然咱们最好慎重的选择m。而一个不太接近2的整数幂的素数，每每是较好的选择。

(2) 乘法散列法

该方法包含两个步骤。第一步：用关键字k乘以A（0 < A < 1），并提取kA的小数部分；第二步：用m乘以这个值，在向下取整，即散列函数是：

h(k) = [m (kA mod 1)]，

这里“kA mod 1”的是取kA小数部分的意思，即kA –[kA]。

乘法散列法的一个优势是，通常咱们对m的选择不是特别的关键，通常选择它为2的整数幂便可。虽然这个方法对任意的A都适用，但Knuth认为，A ≈ （√5 - 1）/ 2 = 0.618033988…是一个比较理想的值

(5) 布隆过滤器

布隆过滤器（Bloom Filter）是一种常被用来检验一个元素是否在一个集合里面的算法（从这里咱们能够看出，这个集合只须要保存比对元素的“指纹”便可，而不须要保存比对元素的所有信息），由一个很长的二进制向量和一系列随机映射函数组成。相较于其余算法，它具备空间利用率高，检测速度快等优势。

在介绍布隆过滤器以前，咱们先假设这样一种场景：某公司致力于解决用户经常遭遇骚扰电话的问题。该公司打算创建一个骚扰电话号码的黑名单，即把全部骚扰电话的号码保存到一张hash表中。当用户接到某个陌生电话时，服务器会当即将该号码与黑名单进行比对，若比对成功，则对该号码进行拦截。

他们固然不会直接将骚扰电话号码保存在hash表中，而是对每个号码利用某种算法进行数据压缩，最终获得一个8字节的信息指纹，而后将其存入表中。但即使如此，问题仍是来了：因为hash表的空间利用率大约只有50%，等价换算过来，储存一个号码将要花费16字节的空间。按照这样计算，储存1亿个号码将要花费大约1.6G的空间，储存几十亿的号码可能须要上百G的空间。那么有没有更好解决办法呢？

这时，布隆过滤器就派上用场了。假设咱们有1亿条骚扰电话号码须要记录，咱们的作法是，首先创建一个2亿字节（即16亿位，并假设咱们对这16亿位以1~16亿的顺序进行了编号）的向量，将每位都置为0。当要插入某个电话号码x时，咱们使用某种算法（该算法能够作到每一个位被映射的几率是同样的，且某个映射的分布与其余的映射分布无关）让号码x映射到1~16亿中的8个位上，而后把这8个位设为1。当查找时，利用一样的方法将号码映射到8个位上，若这8个位都为1，则说明该号码在黑名单中，不然就不在。

咱们能够发现，布隆过滤器的作法在思想上和hash函数将关键字映射到hash表的作法很类似，所以布隆过滤器也会遇到冲突问题，这会致使将一个“好”的号码误判为骚扰号码（但绝对不会将骚扰号码误判为一个“好”的号码）。下面咱们经过计算来证实，在大多数状况和场景中，这种误判咱们是能够忍受的。

假设某布隆过滤器共有的m个槽位，咱们要把n个号码添加到其中，而每一个号码会映射k个槽位。那么，添加这n个号码将会产生kn次映射。由于这m个槽位中，每一个槽被映射到的几率是相等的。所以，

在一次映射中，某个槽位被映射到的几率（即该槽位值为1的几率）为

该槽位值为0的几率为

通过kn次映射后，某个槽值为0的几率为

为1的几率为

因此，误判（k个槽位均为1）的几率就为

利用，上式可化为：

这时咱们注意到，当k=1时，状况就就变成了hash table的状况，

根据自变量的不一样咱们分如下两种方式讨论：

① 咱们把误判率p看做关于装载因子α的函数（k看做常数），这时咱们从函数的函数图像

中能够得出一下结论：

随着装载因子α（α = n / m）的增大，误判率（或者是产生冲突的几率）也将增大，但增加速度逐渐减慢。

要使误判率小于0.5，装载因子必须小于0.7。这也从某种程度上解释了为何JDK HashMap的装载因子默认是0.75。

② 咱们把误判率p看做关于k的函数（α做为常数），经过对p求导分析，咱们发现，当k=ln2 / α时，误判率p取得最小值。此时，p = 2^(-k)（或者k = – ln p / ln 2），这个结论让咱们可以根据能够忍受的误判率计算出最为合适的k值。

下面给出一个BloomFilter的Java实现代码（来自：https://github.com/MagnusS/Java-BloomFilter，只是把其中的变量和方法名换成了上文说起的）：

public class BloomFilter<E> implements Serializable {
    private static final long serialVersionUID = -9077350041930475408L;

    private BitSet bitset;// 二进制向量
    private int slotSize; // 二进制向量的总位（槽）数（文中的m）
    private double loadFactor; // 装载因子 （文中的α）
    private int capacity; // 布隆过滤器的容量（文中的n）
    private int size; // 装载的数目
    private int k; // 一个元素对应的位数（文中的k）

    static final Charset charset = Charset.forName("UTF-8");
    static final String hashName = "MD5";// 默认采用MD5算法，也可改成SHA1
    static final MessageDigest digestFunction;

    static {
        MessageDigest tmp;
        try {
            tmp = java.security.MessageDigest.getInstance(hashName);
        } catch (NoSuchAlgorithmException e) {
            tmp = null;
        }
        digestFunction = tmp;
    }

    public BloomFilter(int slotSize, int capacity) {
        this(slotSize / (double) capacity, capacity, (int) Math.round((slotSize / (double) capacity) * Math.log(2.0)));
    }

    public BloomFilter(double falsePositiveProbability, int capacity) {
        this(Math.log(2) / (Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))),//loadFactor = ln2 / k;
                capacity, //
                (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); //k = -ln p / ln2
    }

    public BloomFilter(int slotSize, int capacity, int size, BitSet filterData) {
        this(slotSize, capacity);
        this.bitset = filterData;
        this.size = size;
    }

    public BloomFilter(double loadFactor, int capacity, int k) {
        size = 0;
        this.loadFactor = loadFactor;
        this.capacity = capacity;
        this.k = k;
        this.slotSize = (int) Math.ceil(capacity * loadFactor);
        this.bitset = new BitSet(slotSize);
    }

    public static int createHash(String val, Charset charset) {
        return createHash(val.getBytes(charset));
    }

    public static int createHash(String val) {
        return createHash(val, charset);
    }

    public static int createHash(byte[] data) {
        return createHashes(data, 1)[0];
    }

    public static int[] createHashes(byte[] data, int hashes) {
        int[] result = new int[hashes];

        int k = 0;
        byte salt = 0;
        while (k < hashes) {
            byte[] digest;
            synchronized (digestFunction) {
                digestFunction.update(salt);
                salt++;
                digest = digestFunction.digest(data);
            }

            for (int i = 0; i < digest.length / 4 && k < hashes; i++) {
                int h = 0;
                for (int j = (i * 4); j < (i * 4) + 4; j++) {
                    h <<= 8;
                    h |= ((int) digest[j]) & 0xFF;
                }
                result[k] = h;
                k++;
            }
        }
        return result;
    }

    /**
     * Compares the contents of two instances to see if they are equal.
     *
     * @param obj
     *            is the object to compare to.
     * @return True if the contents of the objects are equal.
     */
    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final BloomFilter<E> other = (BloomFilter<E>) obj;
        if (this.capacity != other.capacity) {
            return false;
        }
        if (this.k != other.k) {
            return false;
        }
        if (this.slotSize != other.slotSize) {
            return false;
        }
        if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) {
            return false;
        }
        return true;
    }

    /**
     * Calculates a hash code for this class.
     * 
     * @return hash code representing the contents of an instance of this class.
     */
    @Override
    public int hashCode() {
        int hash = 7;
        hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
        hash = 61 * hash + this.capacity;
        hash = 61 * hash + this.slotSize;
        hash = 61 * hash + this.k;
        return hash;
    }

    /**
     * Calculates the expected probability of false positives based on the
     * number of expected filter elements and the size of the Bloom filter.
     * <br />
     * <br />
     * The value returned by this method is the <i>expected</i> rate of false
     * positives, assuming the number of inserted elements equals the number of
     * expected elements. If the number of elements in the Bloom filter is less
     * than the expected value, the true probability of false positives will be
     * lower.
     *
     * @return expected probability of false positives.
     */
    public double expectedFalsePositiveProbability() {
        return getFalsePositiveProbability(capacity);
    }

    /**
     * Calculate the probability of a false positive given the specified number
     * of inserted elements.
     *
     * @param numberOfElements
     *            number of inserted elements.
     * @return probability of a false positive.
     */
    public double getFalsePositiveProbability(double numberOfElements) {
        // (1 - e^(-k * n / m)) ^ k
        return Math.pow((1 - Math.exp(-k * (double) numberOfElements / (double) slotSize)), k);

    }

    /**
     * Get the current probability of a false positive. The probability is
     * calculated from the size of the Bloom filter and the current number of
     * elements added to it.
     *
     * @return probability of false positives.
     */
    public double getFalsePositiveProbability() {
        return getFalsePositiveProbability(size);
    }

    /**
     * Returns the value chosen for K.<br />
     * <br />
     * K is the optimal number of hash functions based on the size of the Bloom
     * filter and the expected number of inserted elements.
     *
     * @return optimal k.
     */
    public int getK() {
        return k;
    }

    /**
     * Sets all bits to false in the Bloom filter.
     */
    public void clear() {
        bitset.clear();
        size = 0;
    }

    /**
     * Adds an object to the Bloom filter. The output from the object's
     * toString() method is used as input to the hash functions.
     *
     * @param element
     *            is an element to register in the Bloom filter.
     */
    public void add(E element) {
        add(element.toString().getBytes(charset));
    }

    /**
     * Adds an array of bytes to the Bloom filter.
     *
     * @param bytes
     *            array of bytes to add to the Bloom filter.
     */
    public void add(byte[] bytes) {
        int[] hashes = createHashes(bytes, k);
        for (int hash : hashes)
            bitset.set(Math.abs(hash % slotSize));
        size++;
    }

    /**
     * Adds all elements from a Collection to the Bloom filter.
     * 
     * @param c
     *            Collection of elements.
     */
    public void addAll(Collection<? extends E> c) {
        for (E element : c)
            add(element);
    }

    /**
     * Returns true if the element could have been inserted into the Bloom
     * filter. Use getFalsePositiveProbability() to calculate the probability of
     * this being correct.
     *
     * @param element
     *            element to check.
     * @return true if the element could have been inserted into the Bloom
     *         filter.
     */
    public boolean contains(E element) {
        return contains(element.toString().getBytes(charset));
    }

    /**
     * Returns true if the array of bytes could have been inserted into the
     * Bloom filter. Use getFalsePositiveProbability() to calculate the
     * probability of this being correct.
     *
     * @param bytes
     *            array of bytes to check.
     * @return true if the array could have been inserted into the Bloom filter.
     */
    public boolean contains(byte[] bytes) {
        int[] hashes = createHashes(bytes, k);
        for (int hash : hashes) {
            if (!bitset.get(Math.abs(hash % slotSize))) {
                return false;
            }
        }
        return true;
    }

    /**
     * Returns true if all the elements of a Collection could have been inserted
     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the
     * probability of this being correct.
     * 
     * @param c
     *            elements to check.
     * @return true if all the elements in c could have been inserted into the
     *         Bloom filter.
     */
    public boolean containsAll(Collection<? extends E> c) {
        for (E element : c)
            if (!contains(element))
                return false;
        return true;
    }

    /**
     * Read a single bit from the Bloom filter.
     * 
     * @param bit
     *            the bit to read.
     * @return true if the bit is set, false if it is not.
     */
    public boolean getBit(int bit) {
        return bitset.get(bit);
    }

    /**
     * Set a single bit in the Bloom filter.
     * 
     * @param bit
     *            is the bit to set.
     * @param value
     *            If true, the bit is set. If false, the bit is cleared.
     */
    public void setBit(int bit, boolean value) {
        bitset.set(bit, value);
    }

    /**
     * Return the bit set used to store the Bloom filter.
     * 
     * @return bit set representing the Bloom filter.
     */
    public BitSet getBitSet() {
        return bitset;
    }

    /**
     * Returns the number of bits in the Bloom filter. Use count() to retrieve
     * the number of inserted elements.
     *
     * @return the size of the bitset used by the Bloom filter.
     */
    public int slotSize() {
        return slotSize;
    }

    /**
     * Returns the number of elements added to the Bloom filter after it was
     * constructed or after clear() was called.
     *
     * @return number of elements added to the Bloom filter.
     */
    public int size() {
        return size;
    }

    /**
     * Returns the expected number of elements to be inserted into the filter.
     * This value is the same value as the one passed to the constructor.
     *
     * @return expected number of elements.
     */
    public int capacity() {
        return capacity;
    }

    /**
     * Get expected number of bits per element when the Bloom filter is full.
     * This value is set by the constructor when the Bloom filter is created.
     * See also getBitsPerElement().
     *
     * @return expected number of bits per element.
     */
    public double getLoadFactor() {
        return this.loadFactor;
    }
}