导读:html
随着大数据概念的火热,啤酒与尿布的故事广为人知。咱们如何发现买啤酒的人每每也会买尿布这一规律?数据挖掘中的用于挖掘频繁项集和关联规则的Apriori算法能够告诉咱们。本文首先对Apriori算法进行简介,然后进一步介绍相关的基本概念,以后详细的介绍Apriori算法的具体策略和步骤,最后给出Python实现代码。git
Github代码地址:https://github.com/llhthinker/MachineLearningLab/tree/master/Frequent%20Itemset%20Mininggithub
Apriori算法是经典的挖掘频繁项集和关联规则的数据挖掘算法。A priori在拉丁语中指"来自之前"。当定义问题时,一般会使用先验知识或者假设,这被称做"一个先验"(a priori)。Apriori算法的名字正是基于这样的事实:算法使用频繁项集性质的先验性质,即频繁项集的全部非空子集也必定是频繁的。Apriori算法使用一种称为逐层搜索的迭代方法,其中k项集用于探索(k+1)项集。首先,经过扫描数据库,累计每一个项的计数,并收集知足最小支持度的项,找出频繁1项集的集合。该集合记为L1。而后,使用L1找出频繁2项集的集合L2,使用L2找出L3,如此下去,直到不能再找到频繁k项集。每找出一个Lk须要一次数据库的完整扫描。Apriori算法使用频繁项集的先验性质来压缩搜索空间。算法
其中表示事务包含集合A和B的并(即包含A和B中的每一个项)的几率。注意与P(A or B)区别,后者表示事务包含A或B的几率。 数据库
通常而言,关联规则的挖掘是一个两步的过程:app
Apriori算法假定项集中的项按照字典序排序。若是Lk-1中某两个的元素(项集)itemset1和itemset2的前(k-2)个项是相同的,则称itemset1和itemset2是可链接的。因此itemset1与itemset2链接产生的结果项集是{itemset1[1], itemset1[2], …, itemset1[k-1], itemset2[k-1]}。链接步骤包含在下文代码中的create_Ck函数中。机器学习
因为存在先验性质:任何非频繁的(k-1)项集都不是频繁k项集的子集。所以,若是一个候选k项集Ck的(k-1)项子集不在Lk-1中,则该候选也不多是频繁的,从而能够从Ck中删除,得到压缩后的Ck。下文代码中的is_apriori函数用于判断是否知足先验性质,create_Ck函数中包含剪枝步骤,即若不知足先验性质,剪枝。ide
基于压缩后的Ck,扫描全部事务,对Ck中的每一个项进行计数,而后删除不知足最小支持度的项,从而得到频繁k项集。删除策略包含在下文代码中的generate_Lk_by_Ck函数中。函数
一旦找出了频繁项集,就能够直接由它们产生强关联规则。产生步骤以下: 学习
下图是《数据挖掘:概念与技术》(第三版)中挖掘频繁项集的样例图解。
本文基于该样例的数据编写Python代码实现Apriori算法。代码须要注意以下两点:
""" # Python 2.7 # Filename: apriori.py # Author: llhthinker # Email: hangliu56[AT]gmail[DOT]com # Blog: http://www.cnblogs.com/llhthinker/p/6719779.html # Date: 2017-04-16 """ def load_data_set(): """ Load a sample data set (From Data Mining: Concepts and Techniques, 3th Edition) Returns: A data set: A list of transactions. Each transaction contains several items. """ data_set = [['l1', 'l2', 'l5'], ['l2', 'l4'], ['l2', 'l3'], ['l1', 'l2', 'l4'], ['l1', 'l3'], ['l2', 'l3'], ['l1', 'l3'], ['l1', 'l2', 'l3', 'l5'], ['l1', 'l2', 'l3']] return data_set def create_C1(data_set): """ Create frequent candidate 1-itemset C1 by scaning data set. Args: data_set: A list of transactions. Each transaction contains several items. Returns: C1: A set which contains all frequent candidate 1-itemsets """ C1 = set() for t in data_set: for item in t: item_set = frozenset([item]) C1.add(item_set) return C1 def is_apriori(Ck_item, Lksub1): """ Judge whether a frequent candidate k-itemset satisfy Apriori property. Args: Ck_item: a frequent candidate k-itemset in Ck which contains all frequent candidate k-itemsets. Lksub1: Lk-1, a set which contains all frequent candidate (k-1)-itemsets. Returns: True: satisfying Apriori property. False: Not satisfying Apriori property. """ for item in Ck_item: sub_Ck = Ck_item - frozenset([item]) if sub_Ck not in Lksub1: return False return True def create_Ck(Lksub1, k): """ Create Ck, a set which contains all all frequent candidate k-itemsets by Lk-1's own connection operation. Args: Lksub1: Lk-1, a set which contains all frequent candidate (k-1)-itemsets. k: the item number of a frequent itemset. Return: Ck: a set which contains all all frequent candidate k-itemsets. """ Ck = set() len_Lksub1 = len(Lksub1) list_Lksub1 = list(Lksub1) for i in range(len_Lksub1): for j in range(1, len_Lksub1): l1 = list(list_Lksub1[i]) l2 = list(list_Lksub1[j]) l1.sort() l2.sort() if l1[0:k-2] == l2[0:k-2]: Ck_item = list_Lksub1[i] | list_Lksub1[j] # pruning if is_apriori(Ck_item, Lksub1): Ck.add(Ck_item) return Ck def generate_Lk_by_Ck(data_set, Ck, min_support, support_data): """ Generate Lk by executing a delete policy from Ck. Args: data_set: A list of transactions. Each transaction contains several items. Ck: A set which contains all all frequent candidate k-itemsets. min_support: The minimum support. support_data: A dictionary. The key is frequent itemset and the value is support. Returns: Lk: A set which contains all all frequent k-itemsets. """ Lk = set() item_count = {} for t in data_set: for item in Ck: if item.issubset(t): if item not in item_count: item_count[item] = 1 else: item_count[item] += 1 t_num = float(len(data_set)) for item in item_count: if (item_count[item] / t_num) >= min_support: Lk.add(item) support_data[item] = item_count[item] / t_num return Lk def generate_L(data_set, k, min_support): """ Generate all frequent itemsets. Args: data_set: A list of transactions. Each transaction contains several items. k: Maximum number of items for all frequent itemsets. min_support: The minimum support. Returns: L: The list of Lk. support_data: A dictionary. The key is frequent itemset and the value is support. """ support_data = {} C1 = create_C1(data_set) L1 = generate_Lk_by_Ck(data_set, C1, min_support, support_data) Lksub1 = L1.copy() L = [] L.append(Lksub1) for i in range(2, k+1): Ci = create_Ck(Lksub1, i) Li = generate_Lk_by_Ck(data_set, Ci, min_support, support_data) Lksub1 = Li.copy() L.append(Lksub1) return L, support_data def generate_big_rules(L, support_data, min_conf): """ Generate big rules from frequent itemsets. Args: L: The list of Lk. support_data: A dictionary. The key is frequent itemset and the value is support. min_conf: Minimal confidence. Returns: big_rule_list: A list which contains all big rules. Each big rule is represented as a 3-tuple. """ big_rule_list = [] sub_set_list = [] for i in range(0, len(L)): for freq_set in L[i]: for sub_set in sub_set_list: if sub_set.issubset(freq_set): conf = support_data[freq_set] / support_data[freq_set - sub_set] big_rule = (freq_set - sub_set, sub_set, conf) if conf >= min_conf and big_rule not in big_rule_list: # print freq_set-sub_set, " => ", sub_set, "conf: ", conf big_rule_list.append(big_rule) sub_set_list.append(freq_set) return big_rule_list if __name__ == "__main__": """ Test """ data_set = load_data_set() L, support_data = generate_L(data_set, k=3, min_support=0.2) big_rules_list = generate_big_rules(L, support_data, min_conf=0.7) for Lk in L: print "="*50 print "frequent " + str(len(list(Lk)[0])) + "-itemsets\t\tsupport" print "="*50 for freq_set in Lk: print freq_set, support_data[freq_set] print print "Big Rules" for item in big_rules_list: print item[0], "=>", item[1], "conf: ", item[2]
代码运行结果截图以下:
==============================
参考:
《数据挖掘:概念与技术》(第三版)
《机器学习实战》