redis学习笔记之-(3)-HyperLogLogs(HLL)的使用

时间 2020-08-22

标签 redis 学习笔记 hyperloglogs hll 使用栏目 Redis 繁體版

原文原文链接

上一篇: redis学习笔记之-(2)-bitmap用法之2-上亿用户1周连续活跃用户数统计redis

3.1 HLL简介:Counting unique things

bitmap能够统计活跃用户数 , 甚至能够遍历出是那些用户id; 而 HyperLogLogs(简称HLL)是一个几率数据结构，用于估计集合的基数(只求unique的总数). 本质上, HLL并不是数据结构而是一种基数算法，经过HLL可用极小内存完成海量元素集合基数总量的统计;算法

基数: 不重复元素, 好比{1, 2, 3, 5, 5, 5, 6}, 基数集为{1, 2, 3, 5, 6}, 基数为5;

HLL是一种算法; 只计算基数, 不会存储基数集元素自己; 您并未真正将项目添加到HLL中;
HyperLogLogs 提供了3个命令：pfadd pfcount pfmerge
相似于set集合的scard set
最终获得带有标准偏差的估计量，在Redis实现的状况下，该偏差小于1％shell

近似为0.81％的标准偏差
该算法的神奇之处在于，不需用与所计数项目数成正比的内存量，而是使用恒定数量的内存！segmentfault

在最坏的状况下为12k字节
该偏差小于1％

3.2 关于pfadd命令的前缀pf的来源

redis之父antirez, 在他的博客中有说起: -->看来是为了致敬Philippe Flajolet数据结构

Redis new data structure: the HyperLogLogapp

There is a class of algorithms that use randomization in order to provide an approximation of the number of unique elements in a set using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet.

3.3 命令

加入一个元素: pfadd key element [element...]
计算已加入的元素的unique数: pfcount key [key...]
合并多个统计的hyperloglog为一个: pfmerge destkey sourcekey [sourcekey...]

3.4 案例使用

每有一个新元素, 使用PFADD加入计数;
想要查看当前加入元素的unique量的近似值, 使用PFCOUNT;
也能够合并多个HyperLogLog的基数到一个中, 使用PFMERGE;

(1). 简单案例: 两个hyperloglog, 一个hll1, 一个hll2, 最后合并两个到 hllres中:dom

127.0.0.1:6379> pfadd hll1 a b c d e
(integer) 1
127.0.0.1:6379> pfadd hll1 a b c e f
(integer) 1
127.0.0.1:6379> pfadd hll1 h
(integer) 1
127.0.0.1:6379> pfcount hll1
(integer) 7
127.0.0.1:6379> pfadd hll2 h i j k l m
(integer) 1
127.0.0.1:6379> pfadd hll2 k l m n o o p
(integer) 1
127.0.0.1:6379> pfcount hll2
(integer) 9
127.0.0.1:6379> pfmerge hllres hll1 hll2
OK
127.0.0.1:6379> pfcount hllres
(integer) 15

(2). 实际生产中咱们能够操做的数据集可能有: 能够是IP、Email、ID等;ide

IP : 统计网站的日UV;学习

email: 统计全部收到邮件的总数;网站

ID: 统计海量日志中记录的访问ID的不重复值的总数;

(3). 实用: 假若有一个需求是:

经过摄像头拍摄记录经海路地铁站的十字路口的交通情况, 计算一天通过的车辆总集合的基数, 用于估算附近车辆总数. ==> 目标很简单, 就是记录通过红绿灯的车牌号, 一成天的unique总量;

来统计经过路口的汽车的总量: 假设咱们上午统计一次(hll_traff_am), 下午统计一次(hll_traff_pm), 最后汇总(hll_traff_allday)

#1. 上午统计入 hyperloglog hll_traff_am
127.0.0.1:6379> pfadd hll_traff_am 京AF1234 京Q12345 陕A7804s 京AS2235 京P82333 陕A7804s 京AS2235 陕A7804s
(integer) 1
#2. 下午统计入 hyperloglog hll_traff_pm
127.0.0.1:6379> pfadd hll_traff_pm 京A00001 京B00001 陕A00001 京AS2235 京P00001 陕A7804s 陕A7804s
(integer) 1
127.0.0.1:6379> pfcount hll_traff_am
(integer) 5
127.0.0.1:6379> pfcount hll_traff_pm
(integer) 6
#3. 合并到总体的统计: hll_traff_allday
127.0.0.1:6379> pfmerge hll_traff_allday hll_traff_am hll_traff_pm
OK
127.0.0.1:6379> pfcount hll_traff_allday
(integer) 9
127.0.0.1:6379>

当车流量更大, 时间更多, 好比一全年, 这个统计更高效;

注意: 这个得是在能容忍有<1%的偏差率的前提下才可用; 好比, 我算DAU, 网站的日访问量, 我能容忍实际的101w次, 实际只获得100w;