hive中标准误差函数stddev()详细讲解

1.标准误差概念函数

标准误差(Std Dev,Standard Deviation) -统计学名词。一种度量数据分布的分散程度之标准,用以衡量数据值偏离算术平均值的程度。标准误差越小,这些值偏离平均值就越少,反之亦然。标准误差的大小可经过标准误差与平均值的倍率关系来衡量。spa

例如,A、B两组各有6位学生参加同一次语文测验,A组的分数为9五、8五、7五、6五、5五、45,B组的分数为7三、7二、7一、6九、6八、67。这两组的平均数都是70,但A组的标准差应该是17.078分,B组的标准差应该是2.160分,说明A组学生之间的差距要比B组学生之间的差距大得多。3d

标准误差又分为整体标准误差与样本标准误差code

整体标准误差:针对整体数据的误差,因此要平均,
 
样本标准误差,也称实验标准误差:针对从整体抽样,利用样原本计算整体误差,为了使算出的值与整体水平更接近,就必须将算出的标准误差的值适度放大,即,
 
 
2.标准误差计算公式:
 
样本标准误差
   
   
表明所采用的样本X1,X2,...,Xn的均值。
整体标准误差
   
   
表明整体X的均值。
例:有一组数字分别是200、50、100、200,求它们的样本标准误差。
 
= (200+50+100+200)/4 = 550/4 = 137.5
 
= [(200-137.5)^2+(50-137.5)^2+(100-137.5)^2+(200-137.5)^2]/(4-1)
样本标准误差 S = Sqrt(S^2)=75, 注:八年级(下册)上海科学技术出版 21.2数据的离散程度中的标准差是整体标准差
 
3.hive中的标准误差函数  stddev_pop(),stddev_samp(),stddev()
stddev_pop()  整体标准方差,stddev_samp() 样本标准方差
 
(1) hive引擎计算标准误差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

查询结果:orm

 

 (2)spark引擎查询标准误差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col

查询结果blog

由上可看出,hive中stddev()函数默认计算整体标准误差,spark 中stddev()函数默认计算样本标准误差it

 
4.stddev()也可用于窗口函数
select col, stddev(num) over(partition by col) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a

查询结果:spark

 

5. 当计算的输入数据只有一行时 ,hive和spark计算标准方差的结果
(1)hive
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

  查询结果:io

(2)sparkform

select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

 查询结果:

相关文章
相关标签/搜索