分组后查询每组的前2

时间 2019-12-25

标签分组查询每组繁體版

原文原文链接

近日，工做中突遇一需求：将一数据表分组，然后取出每组内按必定规则排列的前N条数据。乍想来，这本是寻常查询，无甚难处。可提笔写来，终究是困住了好一下子。左思右想，遍查网络，未曾想这居然是SQL界的一个经典话题。今日将我得来的若干方法列出，抛砖引玉，以期与众位探讨。网络

　　正文以前，对示例表结构加以说明。性能

表SectionTransactionLog，用来记录各部门各项活动的日志表
　　　SectionId，部门Id
　　　SectionTransactionType，活动类型
　　　TotalTransactionValue，活动花费
　　　TransactionDate，活动时间测试

咱们设定的场景为：选出每部门（SectionId）最近两次举行的活动。优化

用来测试的SectionTransactionLog表中数据超3,000,000。日志

1、嵌套子查询方式排序

1索引

1 SELECT * FROM SectionTransactionLog mLog2 where 3 (select COUNT(*) from SectionTransactionLog subLog4 wheresubLog.SectionId = mLog.SectionId and subLog.TransactionDate >= mLog.TransactionDate)<=25 order by SectionId, TransactionDate descit

　　运行时间：34秒io

　　该方式原理较简单，只是在子查询中肯定该条记录是不是其Section中新近发生的2条之一。table

1 SELECT * FROM SectionTransactionLog mLog2 where mLog.Id in3 (select top 2 Id 4 from SectionTransactionLog subLog5where subLog.SectionId = mLog.SectionId6 order by TransactionDate desc)7 order by SectionId, TransactionDate desc

　　运行时间：1分25秒

　　在子查询中使用TransactionDate排序，取top 2。并应用in关键字肯定记录是否符合该子查询。

2、自联接方式

1 select mLog.* from SectionTransactionLog mLog2 inner join3 (SELECT rankLeft.Id, COUNT(*) as rankNum FROMSectionTransactionLog rankLeft4 inner join SectionTransactionLog rankRight 5 on rankLeft.SectionId =rankRight.SectionId and rankLeft.TransactionDate <= rankRight.TransactionDate6 group by rankLeft.Id7 having COUNT(*)<= 2) subLog on mLog.Id = subLog.Id8 order by mLog.SectionId, mLog.TransactionDate desc

　　运行时间：56秒

　　该实现方式较为巧妙，但较之以前方法也稍显复杂。其中，以SectionTransactionLog表自联接为基础而构造出的subLog部分为每一活动（以Id标识）计算出其在Section内部的排序rankNum（按时间TransactionDate）。

　　在自联接条件rankLeft.SectionId = rankRight.SectionId and rankLeft.TransactionDate <= rankRight.TransactionDate的筛选下，查询结果中对于某一活动（以Id标识）而言，与其联接的只有同其在一Section并晚于或与其同时发生活动（固然包括其自身）。下图为Id=1的活动自联接示意：

　　从上图中一目了然能够看出，基于此结果的count计算，便为Id=1活动在Section 9022中的排次rankNum。

　　然后having COUNT(*) <= 2选出排次在2之内的，再作一次联接select出所需信息。

3、应用ROW_NUMBER()（SQL SERVER 2005及以后）

1 select * from2 (3 select *, ROW_NUMBER() over(partition by SectionId order by TransactionDate desc) as rowNum4from SectionTransactionLog5 ) ranked6 where ranked.rowNum <= 27 order by ranked.SectionId, ranked.TransactionDatedesc

　　运行时间：20秒

　　这是截至目前效率最高的实现方式。ROW_NUMBER() over(partition by SectionId order by TransactionDate desc)完成了分组、排序、取行号的整个过程。

效率思考

　　下面咱们对上述的4种方法作一个效率上的统计。

方法	耗时（秒）	排名
应用ROW_NUMBER()	20	1
嵌套子查询方式1	34	2
自联接方式	56	3
嵌套子查询方式2	85	4

　　4种方法中，嵌套子查询2所用时最长，其效率损耗在什么地方了呢？难道果然是使用了in关键字的缘故？下图为其执行计划（execute plan）：

　　从图中，咱们能够看出优化器将in解析为了Left Semi Join, 其损耗极低。而该查询绝大部分性能消耗在子查询的order by处（Top N Sort）。果真，若删掉子查询中的order by TransactionDate desc子句（固然结果不正确），其耗时仅为8秒。

　　添加有效索引可提升该查询方法的性能。

对于其中效率最高的一个，用下面的方式来进行验证和应用
select * from

(

select *, ROW_NUMBER() over(partition by product_id order by fee desc) as rowNum

from (select [product_id] ,[account] ,sum([debit_share]) fee from [products].[dbo].[T_COUNTER_PRODUCT_HOLDER] where [debit_share] >0 group by [product_id] ,[account] ) t

) ranked

where ranked.rowNum <= 2

order by ranked.product_id, ranked.fee desc