Annoy 近邻算法

时间 2019-12-13

标签 annoy 近邻算法繁體版

原文原文链接

Annoy

随机选择两个点，以这两个节点为初始中心节点，执行聚类数为2的kmeans过程，最终产生收敛后两个聚类中心点 git

二叉树底层是叶子节点记录原始数据节点，其余中间节点记录的是分割超平面的信息 github

可是上述描述存在两个问题： ide

（1）查询过程最终落到叶子节点的数据节点数小于咱们须要的Top N类似邻居节点数目怎么办？ ui

（2）两个相近的数据节点划分到二叉树不一样分支上怎么办？ idea

针对这个问题能够经过两个方法来解决： spa

（1）若是分割超平面的两边都很类似，那能够两边都遍历 orm

（2）创建多棵二叉树树，构成一个森林 blog

（3）全部树返回近邻点都插入到优先队列中，求并集去重, 而后计算和查询点距离，最终根据距离值从近距离到远距离排序，返回Top N近邻节点集合排序

Summary of features 队列

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

：全部节点的个数是trainning data的2倍左右：https://github.com/spotify/annoy/issues/338

写文件时候，会向磁盘写