前面的博客介绍过如何构建一个推荐系统,以及简要的介绍了协同过滤的实现。本篇博客,笔者将介绍协同过滤在推荐系统的应用。推荐系统是大数据和机器学习中最多见、最容易理解的应用之一。其实,在平常的生活当中,咱们会频繁的遇到推荐的场景 ,好比你在电商网站购买商品、使用视频App观看视频、在手机上下载各类游戏等,这些都是使用了推荐技术来个性化你想要的内容和物品。html
本篇博客将经过如下方式来介绍,经过创建协同过滤模型,利用订单数据来想用户推荐预期的物品。步骤以下:算法
完成本篇博客所须要的技术使用Python和机器学习Turicreate来实现。Python所须要的依赖库以下:app
本次演示的数据源,包含以下:机器学习
加载Python依赖库,实现代码以下:ide
import pandas as pd import numpy as np import time import turicreate as tc from sklearn.model_selection import train_test_split
查看数据集,实现代码以下:函数
customers = pd.read_csv('customer_id.csv') transactions = pd.read_csv('customer_data.csv') print(customers.head()) print(transactions.head())
预览结果以下:oop
将上述csv中的数据集中,将products列中的每一个物品列表分解成行,并计算用户购买的产品数量。性能
实现代码以下:学习
transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')]) data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), id_vars=['customerId'], value_name='products') \ .dropna().drop(['variable'], axis=1) \ .groupby(['customerId', 'products']) \ .agg({'products': 'count'}) \ .rename(columns={'products': 'purchase_count'}) \ .reset_index() \ .rename(columns={'products': 'productId'}) data['productId'] = data['productId'].astype(np.int64) print(data.shape) print(data.head())
预览截图以下:测试
实现代码以下:
def create_data_dummy(data): data_dummy = data.copy() data_dummy['purchase_dummy'] = 1 return data_dummy data_dummy = create_data_dummy(data) print(data_dummy.head())
预览结果以下:
实现代码以下:
df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') print(df_matrix.head())
预览结果以下:
矩阵规范化实现代码以下:
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) print(df_matrix_norm.head())
预览结果以下:
建立一个表做为模型的输入,实现代码以下:
d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() print(data_norm.shape) print(data_norm.head())
预览结果以下:
上述步骤能够组合成下面定义的函数,实现代码以下 :
def normalize_data(data): df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()
上面,咱们规范化了用户的购买历史记录,从0到1(1是一个物品的最多购买次数,0是该物品的0个购买计数)。
拆分函数实现以下:
def split_data(data): ''' Splits dataset into training and test set. Args: data (pandas.DataFrame) Returns train_data (tc.SFrame) test_data (tc.SFrame) ''' train, test = train_test_split(data, test_size = .2) train_data = tc.SFrame(train) test_data = tc.SFrame(test) return train_data, test_data
如今咱们有了是三个数据集,分别是购买计数、购买虚拟数据和按比例的购买计数,这里咱们将每一个数据集分开进行建模,实现代码以下:
train_data, test_data = split_data(data) train_data_dummy, test_data_dummy = split_data(data_dummy) train_data_norm, test_data_norm = split_data(data_norm)
print(train_data)
这里打印训练结果数据,预览结果以下:
在运行更加复杂的方法(好比协同过滤)以前,咱们应该运行一个基线模型来比较和评估模型。因为基线一般使用一种很是简单的方法,所以若是在这种方法以外使用的技术显示出相对较好的准确性和复杂性,则应该选择这些技术。
Baseline Model是机器学习领域的一个术语,简而言之,就是使用最广泛的状况来作结果预测。好比,猜硬币游戏,最简单的策略就是一直选择正面或者反面,这样从预测的模型结果来看,你是有50%的准确率的。
一种更复杂可是更常见的预测购买商品的方法就是协同过滤。下面,咱们首先定义要在模型中使用的变量,代码以下:
# constant variables to define field names include: user_id = 'customerId' item_id = 'productId' users_to_recommend = list(customers[user_id]) n_rec = 10 # number of items to recommend n_display = 30 # to display the first few rows in an output dataset
Turicreate使咱们很是容易去调用建模技术,所以,定义全部模型的函数以下:
def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display): if name == 'popularity': model = tc.popularity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target) elif name == 'cosine': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='cosine') elif name == 'pearson': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='pearson') recom = model.recommend(users=users_to_recommend, k=n_rec) recom.print_rows(n_display) return model
购买计数实现代码以下:
name = 'popularity' target = 'purchase_count' popularity = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(popularity)
截图以下:
购买虚拟人代码以下:
name = 'popularity' target = 'purchase_dummy' pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pop_dummy)
截图以下:
按比例购买计数实现代码以下:
name = 'popularity' target = 'scaled_purchase_freq' pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pop_norm)
截图以下:
根据用户如何在协做购买物品的基础上推荐类似的物品。例如,若是用户1和用户2购买了相似的物品,好比用户1购买的X、Y、Z,用户2购买了X、Y、Y,那么咱们能够向用户2推荐物品Z。
公式以下:
购买计数代码以下:
name = 'cosine' target = 'purchase_count' cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos)
截图以下:
购买虚拟人代码以下:
name = 'cosine' target = 'purchase_dummy' cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos_dummy)
截图以下:
按比例购买计数,实现代码以下:
name = 'cosine' target = 'scaled_purchase_freq' cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos_norm)
截图以下:
类似性是两个向量之间的皮尔逊系数。
购买计数实现代码:
name = 'pearson' target = 'purchase_count' pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear)
截图以下:
购买虚拟人实现代码:
name = 'pearson' target = 'purchase_dummy' pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear_dummy)
截图以下:
按比例购买计数:
name = 'pearson' target = 'scaled_purchase_freq' pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear_norm)
截图以下:
在评价推荐引擎时,咱们可使用RMSE和精准召回的概念。
为什么召回和准确度如此重要呢?
下面,咱们为模型求值建立初识可调用变量,实现代码以下:
models_w_counts = [popularity, cos, pear] models_w_dummy = [pop_dummy, cos_dummy, pear_dummy] models_w_norm = [pop_norm, cos_norm, pear_norm] names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts'] names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy'] names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']
而后,让咱们比较一下咱们基于RMSE和精准召回特性构建的全部模型,代码以下:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts) eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy) eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)
评估结果输出以下:
完成实例代码以下:
import pandas as pd import numpy as np import time import turicreate as tc from sklearn.model_selection import train_test_split customers = pd.read_csv('customer_id.csv') transactions = pd.read_csv('customer_data.csv') # print(customers.head()) # print(transactions.head()) transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')]) data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), id_vars=['customerId'], value_name='products') \ .dropna().drop(['variable'], axis=1) \ .groupby(['customerId', 'products']) \ .agg({'products': 'count'}) \ .rename(columns={'products': 'purchase_count'}) \ .reset_index() \ .rename(columns={'products': 'productId'}) data['productId'] = data['productId'].astype(np.int64) # print(data.shape) # print(data.head()) def create_data_dummy(data): data_dummy = data.copy() data_dummy['purchase_dummy'] = 1 return data_dummy data_dummy = create_data_dummy(data) # print(data_dummy.head()) df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') # print(df_matrix.head()) df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) # print(df_matrix_norm.head()) # create a table for input to the modeling d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() # print(data_norm.shape) # print(data_norm.head()) def normalize_data(data): df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() def split_data(data): ''' Splits dataset into training and test set. Args: data (pandas.DataFrame) Returns train_data (tc.SFrame) test_data (tc.SFrame) ''' train, test = train_test_split(data, test_size = .2) train_data = tc.SFrame(train) test_data = tc.SFrame(test) return train_data, test_data train_data, test_data = split_data(data) train_data_dummy, test_data_dummy = split_data(data_dummy) train_data_norm, test_data_norm = split_data(data_norm) # print(train_data) # constant variables to define field names include: user_id = 'customerId' item_id = 'productId' users_to_recommend = list(customers[user_id]) n_rec = 10 # number of items to recommend n_display = 30 # to display the first few rows in an output dataset def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display): if name == 'popularity': model = tc.popularity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target) elif name == 'cosine': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='cosine') elif name == 'pearson': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='pearson') recom = model.recommend(users=users_to_recommend, k=n_rec) recom.print_rows(n_display) return model name = 'popularity' target = 'purchase_count' popularity = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(popularity) name = 'popularity' target = 'purchase_dummy' pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pop_dummy) name = 'popularity' target = 'scaled_purchase_freq' pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pop_norm) name = 'cosine' target = 'purchase_count' cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos) name = 'cosine' target = 'purchase_dummy' cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos_dummy) name = 'cosine' target = 'scaled_purchase_freq' cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos_norm) name = 'pearson' target = 'purchase_count' pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear) name = 'pearson' target = 'purchase_dummy' pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear_dummy) name = 'pearson' target = 'scaled_purchase_freq' pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear_norm) models_w_counts = [popularity, cos, pear] models_w_dummy = [pop_dummy, cos_dummy, pear_dummy] models_w_norm = [pop_norm, cos_norm, pear_norm] names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts'] names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy'] names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts'] eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts) eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy) eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm) # Final Output Result # final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), user_id=user_id, item_id=item_id, target='purchase_dummy', similarity_type='cosine') # recom = final_model.recommend(users=users_to_recommend, k=n_rec) # recom.print_rows(n_display) # df_rec = recom.to_dataframe() # print(df_rec.shape) # print(df_rec.head())
这篇博客就和你们分享到这里,若是你们在研究学习的过程中有什么问题,能够加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!
另外,博主出书了《Kafka并不难学》和《Hadoop大数据挖掘从入门到进阶实战》,喜欢的朋友或同窗, 能够在公告栏那里点击购买连接购买博主的书进行学习,在此感谢你们的支持。关注下面公众号,根据提示,可免费获取书籍的教学视频。