基于关联规则的推荐(Association Rule-based Recommendation)是以关联规则为基础,把已购商品做为规则头,规则体为推荐对象。关联规则挖掘能够发现不一样商品在销售过程当中的相关性,在零售业中已经获得了成功的应用。管理规则就是在一个交易数据库中统计购买了商品集X的交易中有多大比例的交易同时购买了商品集Y,其直观的意义就是用户在购 买某些商品的时候有多大倾向去购买另一些商品。好比购买牛奶的同时不少人会同时购买面包。java
算法的第一步关联规则的发现最为关键且最耗时,是算法的瓶颈,但能够离线进行。其次,商品名称的同义性问题也是关联规则的一个难点。算法
以apriori算法为例,其挖掘步骤:数据库
1.依据支持度找出全部频繁项集(频度)ide
2.依据置信度产生关联规则(强度)ui
主要代码以下:spa
package apriori; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Map; import com.jolly.bi.aglorithm.conf.AprioriUnit; public class AprioriUtils { private final static String ITEM_SPLIT = ";"; // 项之间的分隔符 private static HashSet<String> getItem1FC(String trans) { HashSet<String> rItem1FcSet = new HashSet<String>();// 频繁1项集 String[] items = trans.split(ITEM_SPLIT); for (String item : items) { rItem1FcSet.add(item+ITEM_SPLIT); } return rItem1FcSet; } public static HashSet<String> getFC(String trans){ HashSet<String> frequentCollectionSet = new HashSet<String>();// 全部的频繁集 HashSet<String> tmp_1fc=getItem1FC(trans); frequentCollectionSet.addAll(tmp_1fc); //导入一项集 HashSet<String> itemkFcSet = new HashSet<String>(); itemkFcSet.addAll(tmp_1fc); int a=0; while (itemkFcSet!=null&&itemkFcSet.size()!=0&&a<=3) { HashSet<String> candidateCollection = getCandidateCollection(itemkFcSet); itemkFcSet.clear(); Iterator<String> it_cd= candidateCollection.iterator(); while (it_cd.hasNext()) { itemkFcSet.add(it_cd.next()); } frequentCollectionSet.addAll(itemkFcSet); a++; } return frequentCollectionSet; } public static HashSet<String> getCandidateCollection(HashSet<String> itemkFcSet){ HashSet<String> candidateCollection=new HashSet<String>(); Iterator<String> is1= itemkFcSet.iterator(); while (is1.hasNext()) { String itemk1=is1.next(); Iterator<String> is2= itemkFcSet.iterator(); while (is2.hasNext()) { String itemk2=is2.next(); String[] tmp1 = itemk1.split(ITEM_SPLIT); String[] tmp2 = itemk2.split(ITEM_SPLIT); String c = ""; if (tmp1.length == 1) { if (tmp1[0].compareTo(tmp2[0]) < 0) { c = tmp1[0] + ITEM_SPLIT + tmp2[0] + ITEM_SPLIT; } } else { boolean flag = true; for (int i = 0; i < tmp1.length - 1; i++) { if (!tmp1[i].equals(tmp2[i])) { flag = false; break; } } if (flag && (tmp1[tmp1.length - 1] .compareTo(tmp2[tmp2.length - 1]) < 0)) { c = itemk1 + tmp2[tmp2.length - 1] + ITEM_SPLIT; } } // 进行剪枝 boolean hasInfrequentSubSet = false; if (!c.equals("")) { String[] tmpC = c.split(ITEM_SPLIT); for (int i = 0; i < tmpC.length; i++) { String subC = ""; for (int j = 0; j < tmpC.length; j++) { if (i != j) { subC = subC + tmpC[j] + ITEM_SPLIT; } } } } else { hasInfrequentSubSet = true; } if (!hasInfrequentSubSet) { candidateCollection.add(c); } } } return candidateCollection; } private static void buildSubSet(List<String> sourceSet, List<List<String>> result) { // 仅有一个元素时,递归终止。此时非空子集仅为其自身,因此直接添加到result中 if (sourceSet.size() == 1) { List<String> set = new ArrayList<String>(); set.add(sourceSet.get(0)); result.add(set); } else if (sourceSet.size() > 1) { // 当有n个元素时,递归求出前n-1个子集,在于result中 buildSubSet(sourceSet.subList(0, sourceSet.size() - 1), result); int size = result.size();// 求出此时result的长度,用于后面的追加第n个元素时计数 // 把第n个元素加入到集合中 List<String> single = new ArrayList<String>(); single.add(sourceSet.get(sourceSet.size() - 1)); result.add(single); // 在保留前面的n-1子集的状况下,把第n个元素分别加到前n个子集中,并把新的集加入到result中; // 为保留原有n-1的子集,因此须要先对其进行复制 List<String> clone; for (int i = 0; i < size; i++) { clone = new ArrayList<String>(); for (String str : result.get(i)) { clone.add(str); } clone.add(sourceSet.get(sourceSet.size() - 1)); result.add(clone); } } } public static Map<String, Double> getRelationRules( String key,Map<String, Integer> frequentCollectionMap) { Map<String, Double> relationRules = new HashMap<String, Double>(); double countAll = frequentCollectionMap.get(key); String[] keyItems = key.split(ITEM_SPLIT); if (keyItems.length > 1) { List<String> source = new ArrayList<String>(); Collections.addAll(source, keyItems); List<List<String>> result = new ArrayList<List<String>>(); buildSubSet(source, result);// 得到source的全部非空子集 for (List<String> itemList : result) { if (itemList.size() < source.size()) {// 只处理真子集 List<String> otherList = new ArrayList<String>(); for (String sourceItem : source) { if (!itemList.contains(sourceItem)) { otherList.add(sourceItem); } } String reasonStr = "";// 前置 String resultStr = "";// 结果 for (String item : itemList) { reasonStr = reasonStr + item + ITEM_SPLIT; } for (String item : otherList) { resultStr = resultStr + item + ITEM_SPLIT; } double countReason = frequentCollectionMap .get(reasonStr); double itemConfidence = countAll / countReason;// 计算置信度 if (itemConfidence >= AprioriUnit._CONFIDENCE) { String rule = reasonStr + AprioriUnit._CON + resultStr; relationRules.put(rule, itemConfidence); } } } } return relationRules; } }
应用场景举例:code
(1)数据输入为订单下的全部商品id对象
125403,185733,196095,117965,201975,212841,181789
149693,210991,13992,64312,54796,194527blog
(2)计算全部频繁项集,依据支持度过滤递归
(3)根据公式,获得符合置信度条件的全部推荐
104138,196705,0.1875
104138,196705,0.1800
结论是 购买了商品104138中购买196705的最多