faiss的简单使用

时间 2019-12-09

标签 faiss 简单使用繁體版

原文原文链接

简介

faiss是为稠密向量提供高效类似度搜索和聚类的框架。由Facebook AI Research研发。具备如下特性。html

一、提供多种检索方法
二、速度快
三、可存在内存和磁盘中
四、C++实现，提供Python封装调用。
五、大部分算法支持GPU实现

下面给出一些快速连接方便查找更多内容。python

github
官方文档
 c++类信息
 Troubleshooting
官方安装文档c++

安装

文档中给出来编译安装，conda等安装方式。由于公司服务器编译安装须要权限，全部咱们通常使用conda的方式安装python Module。git

# 更新conda
conda update conda
# 先安装mkl
conda install mkl
# faiss提供gpu和cpu版，根据服务选择
conda install faiss-cpu -c pytorch # cpu
conda install faiss-gpu -c pytorch # gpu
# 校验是否安装成功
python -c "import faiss"

Quick Start

这里先给出官方提供的demo来感觉一下faiss的使用。github

首先构建训练数据和测试数据算法

import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

上面咱们构建了shape为[100000,64]的训练数据xb，和shape为[10000,64]的查询数据xq。
而后建立索引(Index)。faiss建立索引对向量预处理，提升查询效率。
faiss提供多种索引方法，这里选择最简单的暴力检索L2距离的索引：IndexFlatL2。
建立索引时必须指定向量的维度d。大部分索引须要训练的步骤。IndexFlatL2跳过这一步。
当索引建立好并训练(若是须要)以后，咱们就能够执行add和search方法了。add方法通常添加训练时的样本，search就是寻找类似类似向量了。
一些索引能够保存整型的ID，每一个向量能够指定一个ID，当查询类似向量时，会返回类似向量的ID及类似度(或距离)。若是不指定，将按照添加的顺序从0开始累加。其中IndexFlatL2不支持指定ID。服务器

import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

咱们有了包含向量的索引后，就能够传入搜索向量查找类似向量了。markdown

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(D[-5:])                  # neighbors of the 5 last queries

上面代码中，咱们定义返回每一个须要查询向量的最近4个向量。查询返回两个numpy array对象D和I。D表示与类似向量的距离(distance)，维度，I表示类似用户的ID。框架

咱们能够获得相似于下面的结果dom

[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]

[[ 0.          7.17517328  7.2076292   7.25116253]
 [ 0.          6.32356453  6.6845808   6.79994535]
 [ 0.          5.79640865  6.39173603  7.28151226]
 [ 0.          7.27790546  7.52798653  7.66284657]
 [ 0.          6.76380348  7.29512024  7.36881447]]

加速搜索

若是须要存储的向量太多，经过暴力搜索索引IndexFlatL2速度很慢，这里介绍一种加速搜索的方法的索引IndexIVFFlat。翻译过来叫倒排文件，实际上是使用K-means创建聚类中心，而后经过查询最近的聚类中心，而后比较聚类中的全部向量获得类似的向量。

建立IndexIVFFlat时须要指定一个其余的索引做为量化器(quantizer)来计算距离或类似度。

这里同使用IndexFlatL2对比，在add方法以前须要先训练。

下面简述示例中的几个参数。

faiss.METRIC_L2: faiss定义了两种衡量类似度的方法(metrics)，分别为faiss.METRIC_L2、faiss.METRIC_INNER_PRODUCT。一个是欧式距离，一个是向量内积。

nlist：聚类中心的个数

k：查找最类似的k个向量

index.nprobe：查找聚类中心的个数，默认为1个。

代码示例以下

nlist = 100                       #聚类中心的个数
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
       # here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
index.train(xb)
assert index.is_trained

index.add(xb)                  # add may be a bit slower as well
D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:])                  # neighbors of the 5 last queries

减小内存

2018-02-22以后版本添加了磁盘存储inverted indexes的方式，使用可参考demo.

上面咱们看到的索引IndexFlatL2和IndexIVFFlat都会全量存储全部的向量在内存中，为知足大的数据量的需求，faiss提供一种基于Product Quantizer(乘积量化)的压缩算法编码向量大小到指定的字节数。此时，存储的向量时压缩过的，查询的距离也是近似的。关于乘积量化的算法可自行搜索。

下面给出demo。相似IndexIVFFlat，这里使用的是IndexIVFPQ

nlist = 100
m = 8                             # number of bytes per vector
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
                                    # 8 specifies that each sub-vector is encoded as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(xq, k)     # search
print(I[-5:])

以前咱们定义的维度为d = 64，向量的数据类型为float32。这里压缩成了8个字节。因此压缩比率为 (64*32/8) / 8 = 32

返回的结果以下，第一个向量同本身的距离为1.40704751，不是0。由于如上所述返回的是近似距离，可是总体上返回的最类似的top k的向量ID没有变化。

[[   0  608  220  228]
 [   1 1063  277  617]
 [   2   46  114  304]
 [   3  791  527  316]
 [   4  159  288  393]]

[[ 1.40704751  6.19361687  6.34912491  6.35771513]
 [ 1.49901485  5.66632462  5.94188499  6.29570007]
 [ 1.63260388  6.04126883  6.18447495  6.26815748]
 [ 1.5356375   6.33165455  6.64519501  6.86594009]
 [ 1.46203303  6.5022912   6.62621975  6.63154221]]

简化索引的表达

经过上面IndexIVFFlat和IndexIVFPQ咱们能够看到，他们的构造须要先提供另一个index。相似的，faiss还提供pca、lsh等方法，有时候他们会组合使用。这样组合的对构造索引会比较麻烦，faiss提供了经过字符串表达的方式构造索引。
如，下面表达式就能表示上面的建立IndexIVFPQ的实例。

index = faiss.index_factory(d, "IVF100,PQ8")

这里有一点文档中没有提到的，经过查看c++代码，index_factory方法还有第三个参数，就是上面说的metric。可传入的就上面两种。

Index *index_factory (int d, const char *description_in, MetricType metric)

更多的组合实例能够看demo

每类索引的简写可查询Basic indexes

GPU使用

注意有些索引不支持GPU，哪些支持哪些不支持可查询Basic indexes

可经过faiss.get_num_gpus()查询有多少个gpu

ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)

使用gpu的完整示例。

一、使用一块gpu

# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

二、使用所有gpu

cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(cpu_index) # build the index

gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries