Annoy 近邻算法

Annoy

   

   

随机选择两个点,以这两个节点为初始中心节点,执行聚类数为2的kmeans过程,最终产生收敛后两个聚类中心点 git

二叉树底层是叶子节点记录原始数据节点,其余中间节点记录的是分割超平面的信息 github

   

   

   

可是上述描述存在两个问题: ide

(1)查询过程最终落到叶子节点的数据节点数小于 咱们须要的Top N类似邻居节点数目怎么办? ui

(2)两个相近的数据节点划分到二叉树不一样分支上怎么办? idea

   

针对这个问题能够经过两个方法来解决: spa

(1)若是分割超平面的两边都很类似,那能够两边都遍历 orm

(2) 创建多棵二叉树树,构成一个森林 blog

(3)全部树返回近邻点都插入到优先队列中,求并集去重, 而后计算和查询点距离, 最终根据距离值从近距离到远距离排序, 返回Top N近邻节点集合 排序

Summary of features 队列

  • Euclidean distanceManhattan distancecosine distanceHamming distance, or Dot (Inner) Product distance
  • Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
  • Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
  • Small memory usage
  • Lets you share memory between multiple processes
  • Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
  • Native Python support, tested with 2.7, 3.6, and 3.7.
  • Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

build(-1)的树的颗数问题

:全部节点的个数是trainning data的2倍左右:https://github.com/spotify/annoy/issues/338

build_on_disk 问题

写文件时候,会向磁盘写

相关文章
相关标签/搜索