这章主要讲了如何作推荐,如今推荐最经常使用的几种算法:Collaborative Filtering、Cluster Models、Search-Based Methods、Item-to-Item Collaborative Filtering.前两种是经过找类似的Customer,后两种经过找类似的Item.论文Amazon.com Recommendations Item-to-Item Collaborative Filtering 对这几种算法都有介绍。这章主要提了Collaborative Filtering和tem-to-Item Collaborative Filtering。 Collaborative Filtering:经过搜索大量的Customer数据集来找到那一小撮和你口味类似的。书中举了一个电影评论的例子,每一个人都对一些电影进行评等级,经过这些数据来找到和你口味类似的人,以及对你没有看过的电影作推荐,并以这个例子演示了如何作推荐。
准备数据:(本笔记的代码使用ruby实现,python代码的实现见原书)java
- critics={
- 'Lisa Rose' => {'Lady in the Water' => 2.5, 'Snakes on a Plane' => 3.5,
- 'Just My Luck' => 3.0, 'Superman Returns' => 3.5, 'You, Me and Dupree' => 2.5,
- 'The Night Listener' => 3.0},
-
- 'Gene Seymour' => {'Lady in the Water' => 3.0, 'Snakes on a Plane' => 3.5,
- 'Just My Luck' => 1.5, 'Superman Returns' => 5.0, 'The Night Listener'=> 3.0,
- 'You, Me and Dupree' => 3.5},
-
- 'Michael Phillips' => {'Lady in the Water' => 2.5, 'Snakes on a Plane' => 3.0,
- 'Superman Returns' => 3.5, 'The Night Listener' => 4.0},
-
- 'Claudia Puig' => {'Snakes on a Plane' => 3.5, 'Just My Luck' => 3.0,
- 'The Night Listener' => 4.5, 'Superman Returns' => 4.0,
- 'You, Me and Dupree' => 2.5},
-
- 'Mick LaSalle'=> {'Lady in the Water' => 3.0, 'Snakes on a Plane' => 4.0,
- 'Just My Luck' => 2.0, 'Superman Returns' => 3.0, 'The Night Listener' => 3.0,
- 'You, Me and Dupree' => 2.0},
-
- 'Jack Matthews'=> {'Lady in the Water' => 3.0, 'Snakes on a Plane' => 4.0,
- 'The Night Listener'=> 3.0, 'Superman Returns'=> 5.0, 'You, Me and Dupree' => 3.5},
-
- 'Toby' => {'Snakes on a Plane' =>4.5,'You, Me and Dupree' =>1.0,'Superman Returns' => 4.0}
- }
定义类似度:python
欧拉距离:web
代码实现:算法
- def sim_distance(prefs,person1,person2)
- si = {}
- prefs[person1].each_key do |item|
- si[item] = 1 if prefs[person2][item]
- end
-
- return 0 if si.empty?
-
- sum_of_squares = si.keys.inject(0) do |sum,item|
- sum + (prefs[person1][item] - prefs[person2][item]) ** 2
- end
-
- return 1 / (1 + sum_of_squares)
- end
Pearson Correlation Score:ruby
代码实现:函数
- def sim_pearson(prefs,person1,person2)
- si = {}
- prefs[person1].each_key do |item|
- si[item] = 1 if prefs[person2][item]
- end
-
- return 0 if si.empty?
-
- sum1 = si.keys.inject(0){|sum,item| sum + prefs[person1][item]}
- sum2 = si.keys.inject(0){|sum,item| sum + prefs[person2][item]}
-
- sum1Sq = si.keys.inject(0){|sum,item| sum + prefs[person1][item] ** 2}
- sum2Sq = si.keys.inject(0){|sum,item| sum + prefs[person2][item] ** 2}
-
- pSum = si.keys.inject(0){|sum,item| sum + prefs[person1][item] * prefs[person2][item]}
- num = pSum - (sum1 * sum2 / si.size)
- den = Math.sqrt((sum1Sq - sum1 ** 2 / si.size) * (sum2Sq - sum2 ** 2 / si.size))
- return (if den == 0 then 0 else num/den end)
- end
根据前面的两个类似度的函数,咱们能够计算和你相同电影的口味的top N了: ui
- def top_matches(prefs,person,n=5,similarity="sim_pearson")
- scores = []
-
- prefs.each_key{|other| scores << eval("[#{similarity}(prefs,person,other),other]") if other != person}
-
- return scores.sort.reverse[0...n]
- end
下面咱们看看如何推荐你没有看过的电影,咱们平时的想法是,若是这部电影
你们评论很好,咱们就认为值得咱们看,可是你的口味可能和这些评论很高的
的人不一样,因此和你口味类似的人评论很高的电影,推荐给你效果会很好。
咱们这样虽然一我的对一部电影的评价很高,可是因为他和你的口味不一样,那么
这个评价对于你的贡献也不会太多。结合类似度和评价的一种方法是:
类似度与评价的成绩做为这个电影评论的一个贡献,同时为了不评论的人越多
最终的总分越高,能够用这个公式:
全部人(类似度与评论分的成绩) 之和 / 类似度之和,因而咱们能够获得以下
代码:
- def get_recommendations(prefs,person,similarity='sim_pearson')
- totals = {}
- simSums = {}
- prefs.each_key do |other|
-
- next if person == other
- sim = eval("#{similarity}(prefs,person,other)")
-
- next if sim <= 0
-
- prefs[other].each_key do |item|
- if (not prefs[person][item]) or (prefs[person][item] == 0) then
-
- totals[item] = if totals[item] then
- totals[item] + prefs[other][item] * sim
- else
- prefs[other][item] * sim
- end
-
- simSums[item] = if simSums[item] then
- simSums[item] + sim
- else
- sim
- end
- end
- end
- end
-
- rankings = totals.map{|item,total| [total/simSums[item],item]}
- return rankings.sort.reverse
- end
如何根据用户的评论来看产品的类似度呢?一种方法是经过看一我的喜欢某个产品,再看看他喜欢
的其余产品,这其实和前面的方法同样,你只须要把people和items交换一下位置。这样咱们只须要
对前面的字典作一下转置操做便可:
- def transform_prefs(prefs)
- result = {}
- prefs.each_key do |person|
- prefs[person].each_key do |item|
- result[item] = {} if not result[item]
- result[item][person] = prefs[person][item]
- end
- end
- result
- end
而后咱们就能够向前面的代码同样作top match和recommendation了:
- movies = transform_prefs(critics)
- p top_matches(movies,'Superman Returns')
- p get_recommendations(movies,'Just My Luck')
2、Item-Based Filtering:
前面介绍的算法被称为user-based collaborative filtering,每次都要计算一下
Customer之间的类似度,伸缩性不够好,一种更好的方法是事先把Item之间类似度
计算出来,而后排序好,保存下来,用户每次请求的时候只须要直接把top N返回给
用户就能够了,这个算法是基于 Items之间的类似性比较与Users之间的比较变化要
少这个事实的。
根据这个思想,咱们能够事先把Items之间的类似性计算出来并保存下来,下面就是
算法的实现:
- def calculate_similar_items(prefs,n=10)
- result = {}
-
- item_prefs = transform_prefs(prefs)
- c = 0
- item_prefs.each_key do |item|
- c += 1
- printf("[%d / %d]",c,item_prefs.size) if c % 100 == 0
-
- scores = top_matches(item_prefs,item,n,'sim_distance')
- result[item] = scores
- end
- return result
- end
如今你能够直接使用咱们前面已经保存下来的Item之间的类似度来作推荐了:
- def get_recommended_items(prefs,item_match,user)
- user_ratings = prefs[user]
- scores = {}
- total_sim = {}
-
- user_ratings.each do |item,rating|
-
- item_match[item].each do |similarity,item2|
-
- next if user_ratings[item2]
-
- scores[item2] = if scores[item2] then
- scores[item2] + similarity * rating
- else
- similarity * rating
- end
-
- total_sim[item2] = if total_sim[item2] then
- total_sim[item2] + similarity
- else
- similarity
- end
- end
- rankings = scores.map{|item,score| [score/total_sim[item],item]}
- return rankings.sort.reverse
- end
- end
- item_sim = calculate_similar_items(critics)
- p get_recommended_items(critics,item_sim,'Toby')