基本概念html
咱们在google上进行一个查询,每个查询结果都包含一个“类似页面”连接,单击该连接,就会发送另外一个搜索请求,查找出与起初结果相似的文档。java
Solr中的类似查询(MoreLikeThis)实现了同样的功能。apache
文档中这样解释:数组
Generate "more like this" similarity queries.
Based on this mail:
Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the docFreq() of every term in the document is
probably too slow.
You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.
It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to provide a "more like this" button on a search results page that does a decent job and has good performance, such techniques might be useful. An efficient, effective "more-like-this" query generator would be a great contribution, if anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.
Doug
复制代码
上述内容的几个关键点(核心思想):bash
Lucene5中内置的MoreLikeThis的实现方式是使用打分的方式计算类似度,根据最终得分高低放入优先级队列,评分高的天然在队列最高处。app
官方文档中这样解释:less
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).ide
The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.函数
The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.源码分析
官方文档中这样解释:
MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl parameter, below). For best results, the fields should have stored term vectors in schema.xml. For example:
<field name="cat" ... termVectors="true" /> 复制代码
If term vectors are not stored, MoreLikeThis will generate terms from stored fields. A uniqueKey must also be stored in order for MoreLikeThis to work properly.
The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf parameter, below) and a new document set is returned.
MoreLikeThis 会基于索引文档中的Term构造一个Lucene Query,而这些Term须要从定义的域列表上获取,该域列表须要经过mlt.fl参数执行,为了获取最佳效果,那些域应该存储Term向量信息,即域的term Vectors属性须要设置为true。
若是域的Term向量信息未存储,那么MoreLikeThis会自动从存储域(stored = true 的域)上生成Term。为了使MoreLikeThis正常工做,你还必须存储UniqueKey。
好吧!其实我仍是不知道MoreLikeThis具体是怎么实现的。 可是咱们应该能从上面描述中获取一些关键点:
咱们须要来看看源码是怎么实现的了
参数 | 解释 |
---|---|
private Analyzer analyzer | 分词器 |
private int minTermFreq | 词频最小值(默认2) |
private int minDocFreq | 文档频率最小值(默认5) |
private int maxDocFreq | 文档频率最大值(默认2147483647) |
private int maxQueryTerms | 查询词的数组大小(默认25) |
private TFIDFSimilarity similarity | 计算相关度 |
MoreLikeThis为了实现与Lucene良好的互动,且扩充了Lucene:它提供了一个方法,该方法返回一个Query对象,即Lucene的查询对象,只要Lucene经过这个对象检索,就能得到类似结果;因此MoreLikeThis和Lucene彻底可以无缝结合。Solr中就提供了一个不错的例子。咱们以 public Query like(int docNum) 方法来解释类似查询实现原理:
public Query like(int docNum) throws IOException {
if (this.fieldNames == null) {
Collection<String> fields = MultiFields.getIndexedFields(this.ir);
this.fieldNames = (String[])fields.toArray(new String[fields.size()]);
}
return this.createQuery(this.retrieveTerms(docNum));
}
复制代码
其中的docNum参数为那个搜索结果的id,即你要经过这个搜索结果,来查找其余与之类似搜索结果。 fieldNames能够理解为咱们选择的一些域,咱们将取出该结果在这些域中的值,以此来分析类似度。 程序很明显,这些域是可选的。
上述代码实现逻辑以下:
termVector是项向量。项向量其实就是根据Term在文档中出现的频率和文档中包含Term的频率创建的数学模型,计算两个项向量的夹角的方式来判断他们的类似性。
/** * Find words for a more-like-this query former. * * @param docNum the id of the lucene document from which to find terms */
private PriorityQueue<ScoreTerm> retrieveTerms(int docNum) throws IOException {
Map<String, Int> termFreqMap = new HashMap<>();
for (String fieldName : fieldNames) {
final Fields vectors = ir.getTermVectors(docNum);
final Terms vector;
if (vectors != null) {
vector = vectors.terms(fieldName);
} else {
vector = null;
}
// field does not store term vector info
//若是当前字段没有存储termVector,那么须要从新计算。其实这里就是分词,并计算term词频的过程,注意他默认使用的是StandardAnalyzer分词器!!!
if (vector == null) {
Document d = ir.document(docNum);
IndexableField fields[] = d.getFields(fieldName);
for (IndexableField field : fields) {
final String stringValue = field.stringValue();
if (stringValue != null) {
addTermFrequencies(new StringReader(stringValue), termFreqMap, fieldName);
}
}
} else {//若是以前保存了termVector就直接添加便可
addTermFrequencies(termFreqMap, vector);
}
}
return createQueue(termFreqMap);
}
复制代码
因为TermVector中的term和field没有关系,不论是标题仍是正文,只要term内容同样就将其频率累加。addTermFrequencies就作这个事情!
把累加的结果存放到termFreqMap中。
private void addTermFrequencies(Reader r, Map<String, MoreLikeThis.Int> termFreqMap, String fieldName) throws IOException {
if (this.analyzer == null) {
throw new UnsupportedOperationException("To use MoreLikeThis without term vectors, you must provide an Analyzer");
} else {
TokenStream ts = this.analyzer.tokenStream(fieldName, r);
try {
int tokenCount = 0;
CharTermAttribute termAtt = (CharTermAttribute)ts.addAttribute(CharTermAttribute.class);
ts.reset();
while(true) {
if (ts.incrementToken()) {
String word = termAtt.toString();
++tokenCount;
if (tokenCount <= this.maxNumTokensParsed) {
if (!this.isNoiseWord(word)) {
MoreLikeThis.Int cnt = (MoreLikeThis.Int)termFreqMap.get(word);
if (cnt == null) {
termFreqMap.put(word, new MoreLikeThis.Int());
} else {
++cnt.x;
}
}
continue;
}
}
ts.end();
return;
}
} finally {
IOUtils.closeWhileHandlingException(new Closeable[]{ts});
}
}
}
复制代码
上面操做中,咱们还须要进行降噪操做 降噪操做有几个原则:
/** * determines if the passed term is likely to be of interest in "more like" comparisons * * @param term The word being considered * @return true if should be ignored, false if should be used in further analysis */
private boolean isNoiseWord(String term) {
int len = term.length();
if (minWordLen > 0 && len < minWordLen) {
return true;
}
if (maxWordLen > 0 && len > maxWordLen) {
return true;
}
return stopWords != null && stopWords.contains(term);
}
复制代码
这里的queue应该是一个优先级队列,上一步咱们得到了全部<term, frequency>,虽然作了去噪,可是term项目仍是太多了,还须要找出相对重要的前N个Term。
在这里,咱们对每一个term进行了打分排序,主要仍是经过tf、idf进行计算。
/** * Create a PriorityQueue from a word->tf map. * * @param words a map of words keyed on the word(String) with Int objects as the values. */
private PriorityQueue<ScoreTerm> createQueue(Map<String, Int> words) throws IOException {
// have collected all words in doc and their freqs
//获取当前index的文档总数。
int numDocs = ir.numDocs();
final int limit = Math.min(maxQueryTerms, words.size());
FreqQ queue = new FreqQ(limit); //按照term的得分进行存放
for (String word : words.keySet()) { //对每一个词
int tf = words.get(word).x; // 对应term的词频
if (minTermFreq > 0 && tf < minTermFreq) {
continue; // 和去噪相似,tf过小的term直接过掉。
}
// 遍历全部的field,找到df最大的那个字段
String topField = fieldNames[0];
int docFreq = 0;
for (String fieldName : fieldNames) {
int freq = ir.docFreq(new Term(fieldName, word));
topField = (freq > docFreq) ? fieldName : topField;
docFreq = (freq > docFreq) ? freq : docFreq;
}
if (minDocFreq > 0 && docFreq < minDocFreq) {
continue; // df过小的term也要直接过掉
}
if (docFreq > maxDocFreq) {
continue; // df太大的term也要直接过掉
}
if (docFreq == 0) {
continue; // index update problem?df==0的term也要直接过掉,怎么会有df的term???这里说是index文件的问题
}
float idf = similarity.idf(docFreq, numDocs);
float score = tf * idf;
//将结果存放到优先队列中。
if (queue.size() < limit) {
// there is still space in the queue
queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
} else {
ScoreTerm term = queue.top();
if (term.score < score) { // update the smallest in the queue in place and update the queue.
term.update(word, topField, score, idf, docFreq, tf);
queue.updateTop();
}
}
}
return queue;
}
复制代码
到这里咱们已经将term的打分排序拿到了,分值越大的term更能表述整篇document的主要内容!(这样想的话其实有点理解文章开始的意思了,咱们找一篇文章的类似文章,天然是相同的词越多越好了)
/** * Create the More like query from a PriorityQueue */
private Query createQuery(PriorityQueue<ScoreTerm> q) {
BooleanQuery query = new BooleanQuery();
ScoreTerm scoreTerm;
float bestScore = -1;
while ((scoreTerm = q.pop()) != null) {
TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
//这里还能够对termquery进行boost的设置。默认为false
if (boost) {
if (bestScore == -1) {
bestScore = (scoreTerm.score);
}
float myScore = (scoreTerm.score);
tq.setBoost(boostFactor * myScore / bestScore);
}
////构建boolean query,should关联。
try {
query.add(tq, BooleanClause.Occur.SHOULD);
}
catch (BooleanQuery.TooManyClauses ignore) {
break;
}
}
return query;
}
复制代码
这样就根据一篇document和指定字段获得了一个query。这个query做为表明着document的灵魂,将寻找和他相似的documents。