(原创文章,转载请注明出处!)html
推荐系统关注的是人与物品,但愿预测出人对物品的喜欢程度。不一样的人有相近的喜爱(好比:都喜欢武侠小说),不一样的物品有相近的特征(好比:都是武侠小说)。当想预测一个用户A对其尚未评价的的物品T的评分时,能够从两个角度来考虑:找和用户A有相近喜欢的人,经过他们对物品T的评分,来估计用户A对物品T的评分;另一个角度是用户A已经评价过的物品,看看哪些物品与物品T比较相近,经过这些相近的物品,来估计用户A对物品T可能的评分。基于这两种思路获得了两种计算推荐系统评分的方法:基于用户的协同过滤法和基于物品的协同过滤法。web
1、基于用户的协同过滤法,User-Based Collaborative Filtering ( UBCF )算法
1. 寻找类似用户app
思路一:计算用户A与全部对物品T评价过的其余用户的类似度,而后将与这些用户的类似度都应用到评分预测值的计算中;函数
思路二:计算用户A与全部对物品T评价过的其余用户的类似度,取其中类似度最大的K个,将这K个应用到评分预测值的计算中;post
思路三:计算用户A与全部对物品T评价过的其余用户的类似度,设置一个阈值,取比阈值大的类似度,将这些用户的类似度应用到评分预测值的计算中。spa
对于类似度的计算,能够有多种选择:皮尔逊相关系数(Pearson correlation coefficient)、夹角余弦、欧式距离等。(R中的cor函数能够用来计算皮尔逊相关系数;dist函数能够用来计算欧式距离(daist函数也能够,不过须要先安装cluster包))。rest
2. 计算用户A对物品T的评分预测值code
寻找类似的用户后,能够计算这些类似用户对物品T的评分的平均值,以此做为用户A对物品T评分的预测;在类似的用户中,每一个用户与用户A的类似度不尽相同,还可使用类似度与评分的加权平均来做为用户A对物品T评分的预测。orm
3. 实现
下面使用余弦夹角度量类似度,找出最大的K个类似用户,并使用这些用户的评分来计算评分预测值。训练数据是一个矩阵,每行是一个物品收到的全部评价,每列是一个用户对全部物品的评价,评分值是:1-5, 没有评价过值是:NA,代码以下:
1 ## normalize a vector with z-score method ( (x-u)/sigma ) 2 ## Args : 3 ## x - a matrix 4 ## Returns : 5 ## a list contains, mean of each colum, 6 ## standard derivation of each colum 7 ## normalized x 8 zScoreNormalization <- function(x) 9 { # sapply(,FUN=function(x) ( (x - mean(x)) / sd(x) )) 10 ## normalize the data 11 meanOfcol <- numeric(dim(x)[2]) 12 sdOfcol <- numeric(dim(x)[2]) 13 for (i in 1:dim(x)[2]) { 14 t <- x[,i] 15 idx <- which(t != 0) 16 if (length(idx) <= 1) { 17 meanOfcol[i] <- NA 18 sdOfcol[i] <- NA 19 next 20 } 21 meanOfcol[i] <- mean(t[idx]) 22 sdOfcol[i] <- sd(t[idx]) 23 x[idx,i] <- (t[idx] - mean(t[idx])) / sd(t[idx]) # z-score 24 } 25 26 return ( list(meanOfcol = meanOfcol, sdOfcol = sdOfcol, xNormalized=x) ) 27 } 28 ## inverse the z-score normalized training data 29 ## Args : 30 ## x - a vector, which need to be inversed 31 ## u - mean of original x 32 ## sigma - standard derivation of original x 33 ## Returns : 34 ## inversed vector x 35 zScoreNormalizationInverse <- function(x, u, sigma) 36 { 37 return (x*sigma + u) 38 } 39 40 ## calculate the consine of two vector angle 41 ## Args : 42 ## x - a vector 43 ## y - a vector 44 ## Returns : 45 ## cosine value of two vector's angle 46 cosineSimilarity <- function(x, y) { 47 if (length(x) != length(y)) { 48 stop("Function cosineSimilarity : length of two parameter vectors is different!") 49 } 50 xx <- x 51 yy <- y 52 xx[which(is.na(xx))] <- 0 53 yy[which(is.na(yy))] <- 0 54 ## if x and y is zero, return 0 without calculating 55 if ( sum(abs(xx*yy)) == 0 ) { 56 return (0) 57 } 58 59 sim <- sum(xx*yy) / ( sqrt(sum(xx^2)) * sqrt(sum(yy^2)) ) # cosine of vector angle 60 return ( 0.5 + 0.5*sim ) # ensure the similarity is in range [0,1] 61 } 62 63 ## find the top n items as the item recommendation list with the User-Based Collaborative Filtering algorithm 64 ## Args : 65 ## x - a matrix, contain all rating reslut. 66 ## Each colum is the rating by one user, each row is the rating of one movie. 67 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 68 ## userI - index of specified user 69 ## k - k nearest neigbour of user I 70 ## n - top n items that will be recommended to user I 71 ## Returns : 72 ## a list, contains recommendation result 73 recommendationUBCF <- function(x, userI, k, n) 74 { 75 x[which(is.na(x))] <- 0 76 ## normalize the data 77 normlizedResult <- zScoreNormalization(x) 78 x <- normlizedResult$xNormalized 79 80 ## find the k similary users 81 userSimilarity <- numeric(dim(x)[2]) 82 for (i in 1:dim(x)[2]) { 83 if (i == userI) { 84 userSimilarity[i] <- -1 85 next 86 } 87 userSimilarity[i] <- cosineSimilarity(x[,i], x[, userI]) 88 } 89 KSimilarUserIdx <- apply( matrix(userSimilarity,nrow=1), 90 MARGIN=1, # apply the function to each colum 91 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), k) 92 ) 93 KSimilarUserIdx <- as.vector(KSimilarUserIdx) 94 95 ## predict the rating of un-rated items 96 unRatedItems <- which( x[,userI]==0 ) 97 ratingOfUnRatedItems <- numeric( dim(x)[1] ) 98 for (i in unRatedItems) { 99 ratingOfUnRatedItems[i] <- sum( x[i,KSimilarUserIdx] * userSimilarity[KSimilarUserIdx] ) 100 / sum( userSimilarity[KSimilarUserIdx] ) 101 } 102 ratingOfUnRatedItems <- zScoreNormalizationInverse( ratingOfUnRatedItems, 103 normlizedResult$meanOfUsers[userI], 104 normlizedResult$sdOfusers[userI] ) 105 106 ## find the Top-N items 107 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 108 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 109 topnIdx <- as.vector(topnIdx) 110 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 111 return( recommendList ) 112 }
2、基于物品的协同过滤法,Item-Based Collaborative Filtering ( IBCF )
1. 算法流程
1) 找出指定用户还没评价过的全部物品
2) 对每一个没有评价过的物品,寻找与其最相近的k个指定用户已经评价过的物品,利用这k个相近物品的评分以及类似度值,预测未评价物品的评分
2. 实现
使用皮尔逊相关系数来计算物品间的类似度,训练数据同UBCF同样,实现代码以下:
1 ## find the top n items as the item recommendation list with the Item-Based Collaborative Filtering algorithm 2 ## Args : 3 ## x - a matrix, contain all rating reslut. 4 ## Each colum is the rating by one user, each row is the rating of one movie. 5 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 6 ## userI - index of specified user 7 ## k - k nearest neigbour of useriI 8 ## n - top n items that will be recommended to user-I 9 ## Returns : 10 ## a list, contains recommendation result 11 recommendationIBCF <- function(x, userI, k, n) 12 { 13 # Pearson correlation coefficient between two vectors : 14 # sum((x - u_x)*(y - u_y)) / (sd_x * sd_y) 15 16 x[which(is.na(x))] <- 0 17 ## normalize the data 18 normlizedResult <- zScoreNormalization( t(x) ) 19 x <- t( normlizedResult$xNormalized ) 20 21 ## predicting the rating of user-I's un-rated items 22 unRatedIdx <- which(x[,userI] == 0) 23 ratedIdx <- which(x[,userI] != 0) 24 ratingOfUnRatedItems <- numeric( dim(x)[1] ) 25 for (i in unRatedIdx) { 26 # calculate the Pearson correlation coefficient to each item 27 itemSim <- cor( x = x[i,], y = t(x[ratedIdx,]), use = "everything", method = "pearson" ) 28 29 # find the k nearest items to item-i 30 KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 31 MARGIN=1, # apply the function to each row 32 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), k) 33 ) 34 KSimilarItemIdx <- as.vector(KSimilarItemIdx) 35 36 # predicting the rating of un-rated item-i 37 r <- x[ratedIdx,] 38 ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )
39 / sum( itemSim[KSimilarItemIdx] ) 40 if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) { 41 next 42 } 43 ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 44 normlizedResult$meanOfcol[i], 45 normlizedResult$sdOfcol[i] ) 46 } 47 48 ## find the Top-N items 49 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 50 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 51 topnIdx <- as.vector(topnIdx) 52 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 53 return( recommendList ) 54 }
3、评分标准化,Normalization
不一样的用户有不一样的评分偏好,好比:有人喜欢评分时均打较低的分,有人则喜欢均打较高的分,须要对数据进行标准化(normalization)的预处理,来消除评分偏好带来的影响。选择正规化方法的原则是标准化后,还能还原回去。一般的标准化方法有均值标准化,Z-score标准化。
均值标准化的代码在文章推荐系统(二)中已经给出;Z-score标准化的实现代码见本文章上面的代码。