咱们在帮助纽约时报(The New York Times,如下简称NYT)开发一套基于内容的推荐系统,你们能够把这套系统看做一个很是简单的推荐系统开发示例。依托用户近期的文章浏览数据,咱们会为其推荐适合阅读的新文章,而想作到这一点,只需以这篇文章的文本数据为基础,推荐给用户相似的内容。web
数据检验 如下是数据集中第一篇NYT文章中的摘录,咱们已经作过文本处理。数据库
'TOKYO — State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"bash
首先须要解决的问题是,该如何将这段内容矢量化,而且设计诸如Parts-of-Speech 、N-grams 、sentiment scores 或 Named Entities等新特征。dom
显然NLP tunnel有深刻研究的价值,甚至能够花费不少时间在既有方案上作实验。但真正的科学每每是从试水最简单可行的方案开始的,这样后续的迭代才会越发完善。ide
而在这篇文章中,咱们就开始执行这个简单可行的方案。函数
数据拆分 咱们须要将标准数据进行预加工,方法是肯定数据库中符合要求的特征,打乱顺序,而后将这些特征分别放入训练和测试集。oop
# move articles to an array
articles = df.body.values
# move article section names to an array
sections = df.section_name.values
# move article web_urls to an array
web_url = df.web_url.values
# shuffle these three arrays
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)
# split the shuffled articles into two arrays
n = 10
# one will have all but the last 10 articles -- think of this as your training set/corpus
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]
# the other will have those last 10 articles -- think of this as your test set/corpus
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]
复制代码
文本矢量化系统 能够从Bag-of-Words(BoW)、Tf-Idf、Word2Vec等几种不一样的文本矢量化系统中选择。测试
咱们选择Tf-Idf的缘由之一是,不一样于BoW,Tf-Idf识别词汇重要性的方式除文本频率外,还包括逆文档频率。ui
举例,一个像“Obama”这样的词汇虽然在文章中仅出现几回(不包括相似“a”、“the”这样不能传达太多信息的词汇),但出如今多篇不一样的文章中,那么就应该获得更高的权重值。this
由于“Obama”既不是停用词,也不是平常用语(即说明该词汇与文章主题高度相关)。
类似性准则 肯定类似性准则时有好几种方案,好比将Jacard和Cosine作对比。
Jacard的实现依靠两集之间的比较及重叠元素选择。考虑到已选择Tf-Idf做为文本矢量化系统,做为选项,Jacard类似性并没有意义。若是选择BoWs矢量化,可能Jacard可能才能发挥做用。
所以,咱们尝试将Cosine做为类似性准则。
若是文章A中相似“Obama” 或者“White House”这样的标记权重较高,而且文章B中也是如此,那么相对于文章B中相同标记权重低的状况来讲,二者的类似性乘积将得出一个更大的数值。
创建推荐系统 根据用户已读文章和全部语料库中的其余文章(即训练数据)的类似性数值,如今你就能够创建一个输出前N篇文章的函数,而后开始给用户推荐了。
def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
'''This function calculates similarity scores between a document and a corpus INPUT: vectorized document corpus, 2D array text document corpus, 1D array user article, 1D array article section names, 1D array article URLs, 1D array number of articles to recommend, int OUTPUT: top n recommendations, 1D array top n corresponding section names, 1D array top n corresponding URLs, 1D array similarity scores bewteen user article and entire corpus, 1D array '''
# calculate similarity between the corpus (i.e. the "test" data) and the user's article
similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
# get sorted similarity score indices
sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
# get sorted similarity scores
sorted_sim_scores = similarity_scores[sorted_indicies]
# get top n most similar documents
top_n_recs = X_train[sorted_indicies[:n]]
# get top n corresponding document section names
rec_sections = X_train_sections[sorted_indicies[:n]]
# get top n corresponding urls
rec_urls = X_train_urls[sorted_indicies[:n]]
# return recommendations and corresponding article meta-data
return top_n_recs, rec_sections, rec_urls, sorted_sim_scores
复制代码
如下是该函数的执行步骤:
1.计算用户文章和语料库的类似性;
2.将类似性分值从高到低排序;
3.得出前N篇最类似的文章;
4.获取对应前N篇文章的小标题及URL;
5.返回前N篇文章,小标题,URL和分值
结果验证 如今咱们已经能够根据用户正在阅读的内容,为他们推荐可供阅读的文章来检测结果是否可行了。
接下来让咱们将用户文章和对应小标题与推荐文章和对应小标题做对比。
首先看一下类似性分值。
# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# .
复制代码
Cosine类似度的取值范围在0-1,因而可知其分值并不高。该如何提升分值呢? 能够选择相似Doc2Vec这样不一样的矢量化系统,也能够换一个类似性准则。尽管如此,仍是让咱们来看一下结果。
# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'
# corresponding section names for top n recs
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'
复制代码
从结果能够看出,推荐的小标题是符合须要的。
#user's article X_test[k] 'LOS ANGELES — The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard. If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we're not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'
前五篇推荐文章都与读者当前阅读的文章相关,事实证实该推荐系统符合预期。
关于验证的说明 经过比较推荐文本和小标题的ad-hoc验证过程,代表咱们的推荐系统能够按照要求正常运行。
人工验证的效果还不错,不过咱们最终但愿获得的是一个彻底自动化的系统,以便于将其放入模型并自我验证。
如何将该推荐系统放入模型不是本文的主题,本文旨在说明如何在真实数据集的基础上设计这样的推荐系统原型。
原文做者为数据科学家Alexander Barriga,由国内智能推荐平台先荐_个性化推荐专家编译,部分有删改,转载请注明出处。