Guava中的布隆过滤器

示例:

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class GuavaBloomFilter {
    public static void main(String[] args) {
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), 100000, 0.01);

        bloomFilter.put("shenzhen");

        System.out.println(bloomFilter.mightContain("guangzhou"));
        System.out.println(bloomFilter.mightContain("shenzhen"));
    }
}

结果:

false
true

采用Guava 27.0.1版本的源码,BF的具体逻辑位于com.google.common.hash.BloomFilter类中。开始读代码吧。

BloomFilter类的成员属性

不多,只有4个。

  /** The bit set of the BloomFilter (not necessarily power of 2!) */
  private final LockFreeBitArray bits;

  /** Number of hashes per element */
  private final int numHashFunctions;

  /** The funnel to translate Ts to bytes */
  private final Funnel<? super T> funnel;

  /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
  private final Strategy strategy;
  • bits即上文讲到的长度为m的位数组,采用LockFreeBitArray类型做了封装。
  • numHashFunctions即哈希函数的个数k。
  • funnel是Funnel接口实现类的实例,它用于将任意类型T的输入数据转化为Java基本类型的数据(byte、int、char等等)。这里是会转化为byte。
  • strategy是布隆过滤器的哈希策略,即数据如何映射到位数组,其具体方法在BloomFilterStrategies枚举中。

BloomFilter的构造

这个类的构造方法是私有的。要创建它的实例,应该通过公有的create()方法。它一共有5种重载方法,但最终都是调用了如下的逻辑。

  @VisibleForTesting
  static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    checkNotNull(funnel);
    checkArgument(
        expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
    checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
    checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
    checkNotNull(strategy);

    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
    /*
     * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
     * is proportional to -log(p), but there is not much of a point after all, e.g.
     * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

该方法接受4个参数:funnel是插入数据的Funnel,expectedInsertions是期望插入的元素总个数n,fpp即期望假阳性率p,strategy即哈希策略。

由上可知,位数组的长度m和哈希函数的个数k分别通过optimalNumOfBits()方法和optimalNumOfHashFunctions()方法来估计。

估计最优m值和k值

  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

  @VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

要看懂这两个方法,我们得接着上一节的推导继续做下去。

 

 

由假阳性率的近似计算方法可知,如果要使假阳性率尽量小,在m和n给定的情况下,k值应为:

 
guava之BloomFilter
 

这就是optimalNumOfHashFunctions()方法的逻辑。那么m该如何估计呢?

将k代入上一节的式子并化简,我们可以整理出期望假阳性率p与m、n的关系:

 
guava之BloomFilter
 

 

 

亦即:

 
guava之BloomFilter
 

这就是optimalNumOfBits()方法的逻辑。

从上也可以得出:

  • 如果指定期望假阳性率p,那么最优的m值与期望元素数n呈线性关系。
  •  

     

    最优的k值实际上只与p有关,与m和n都无关,即:
     
    guava之BloomFilter
     

所以,在创建BloomFilter时,确定合适的p和n值很重要。

哈希策略

在BloomFilterStrategies枚举中定义了两种哈希策略,都基于著名的MurmurHash算法,分别是MURMUR128_MITZ_32和MURMUR128_MITZ_64。前者是一个简化版,所以我们来看看后者的实现方法。

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1]

相关文章: