[1] Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval.网络
[2] Although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs.app
[3] This paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results.dom
[4] This paradigm is implemented in practice in the IntelliZap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document (“the context”).ide
[5] The context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries.工具
[6] The latter are submitted to a host of general and domain-specific search engines.性能
[7] Search results are then semantically reranked, using context. Experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the Web.ui
[1] The core of IntelliZap technology is a semantic network, which provides a metric for measuring distances between pairs of words.this
[2] The basic semantic network is implemented using a vector-based approach, where each word is represented as a vector in multi-dimensional space.搜索引擎
[3] To assign each word a vector representation, we first identified 27 knowledge domains (such as computers, business and entertainment) roughly partitioning the whole variety of topics.lua
[4] We then sampled a large set of documents in these domains on the Internet Word vectors were obtained by recording the frequencies of each word in each knowledge domain.
[5] Each domain can therefore be viewed as an axis in the multi-dimensional space.
[6] The distance measure between word vectors is computed using a correlation-based metric:
[1] Unfortunately, there are no accepted procedures for evaluating performance of semantic metrics.
[2] Following Resnik [1999], we evaluated different metrics by computing correlation between their scores and human-assigned scores for a list of word pairs.
[3] The intuition behind this approach is that a good metric should approximate human judgments well.
[4] While Resnik used a list of 30 noun pairs from Miller and Charles [1991], we opted for a more comprehensive evaluation.
[5] To this end, we prepared a diverse list of 350 noun pairs representing various degrees of similarity,10 and employed 16 subjects to estimate the “relatedness” of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words).
[6] The vector-based metric achieved 41% correlation with averaged human scores, and the WordNet-based metric achieved 39% correlation11,12 A linear combination of the two metrics achieved 55% correlation with human scores.
[7] Currently, our semantic network is defined for the English language, though the technology can be adapted for other languages with minimal effort.
[8] This would require training the network using textual data for the desired language, properly partitioned into domains.
[9] Linguistic information can be added, subject to the availability of adequate tools for the target language (e.g., EuroWordNet for European languages [Euro WordNet] or EDR for Japanese [Yokoi 1995]).
[1] 基于关键字的搜索引擎做为一种流行的基于Web的信息检索手段,在今天获得了普遍的应用。
[2] 虽然这样的系统看起来彷佛很简单,但为了知足非琐碎的信息需求,须要大量的技巧。
[3] 本文提出了一种新的在上下文中执行搜索的概念范式,它在很大程度上自动化了搜索过程,甚至为非专业用户召回了高度相关的结果。
[4] 这种范例是在 Intellizap 系统中实现的。在该系统中,搜索从用户在其所查看的文档中标记的文本查询开始,并由该文档中标记的查询周围的文本(“上下文”)来引导。
[5] 上下文驱动的信息检索过程包括语义关键字提取和聚类,从而自动生成新的、扩充的查询。
[6] 后者被提交给一系列通用和特定于域的搜索引擎。
[7] 而后使用上下文对搜索结果进行语义从新排序。实验结果代表,利用上下文来引导搜索,甚至能够有效地为没有经验的用户提供一种先进的网络搜索工具。
[1] Intellizap技术的核心是一个语义网络,它为测量成对词之间的距离提供了一个度量标准。
[2] 基本语义网络是使用基于向量的方法实现的,其中每一个词在多维空间中表示为一个向量。
[3] 为了给每一个单词分配一个向量表示,咱们首先肯定了27个知识域(如计算机、商业和娱乐),大体划分了各类主题。
[4] 而后,咱们对这些领域中的大量文档进行了抽样,经过记录每一个知识领域中每一个单词的频率,得到了互联网上的单词向量。
[5] 所以,能够将每一个域看做多维空间中的一个轴。
[6] 单词向量之间的距离度量是使用基于相关性的度量来计算的:
[1] 不幸的是,没有能够被接受的手段来评估语义度量的性能。
[2] 继 Resnik[1999] 以后,咱们经过计算机器打分与人类对指定的单词打分列表之间的相关性,来评估不一样的指标。
[3] 这种方法背后的直觉是,一个好的度量应该很好地近似人类的判断。
[4] 虽然 Resnik 使用了 Miller 和 Charles[1991] 的 30 个名词对列表,但咱们选择了更全面的评估。
[5] 为此,咱们准备了一份 350 个不一样的名词词对的列表,分别表明不一样程度的类似性,由 10 个和 16 个受试者,以从0(彻底无关的词)到10(很是相关或相同的词)的尺度来估计词对间的“相关性”。
[6] 基于向量的度量与平均人类分数的相关性达到41%,基于 WordNet 的度量与平均人类分数的相关性达到 39%,11,12这两个度量的线性组合与人类分数的相关性达到55%。
[7] 目前,咱们的语义网络是为英语定义的,尽管这项技术能够用最少的努力适应其余语言。
[8] 这须要使用目标语言的文本数据对网络进行培训,并将其正确划分为域。
[9] 可根据目标语言的适当工具(例如,欧洲语言的 EurowordNet [欧元wordNet] 或日语的 EDR[Yokoi 1995])添加语言信息。