AgreementMaker：Efficient Matching for Large Real-World 翻译

时间 2019-12-13

标签 agreementmaker efficient matching large real world 翻译繁體版

原文原文链接

正文以前

这篇文章仍是我看前几天那个基于框架进行本体匹配的一个Previous Work里面的一个Previous Work。能够说有点菜，可是仍是比较有参考意义的，因此我把源码下载了下来，而后准备把对应的文章读一读，而后我我的比较喜欢中英对照，直接看中文的时候略过一些不重要的地方，在关键部位看原文。因此就有了这么多的翻译版本了。。node

引用以下：Cruz I F, Antonelli F P, Stroe C. AgreementMaker: efficient matching for large real-world schemas and ontologies[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1586-1589.python

正文

Abstract

摘要

We present the AgreementMaker system for matching real world schemas and ontologies, which may consist of hundreds or even thousands of concepts. The end users of the system are sophisticated domain experts whose needs have driven the design and implementation of the system: they require a responsive, powerful, and extensible framework to perform, evaluate, and compare matching methods. The system comprises a wide range of matching methods addressing diﬀerent levels of granularity of the components being matched (conceptual vs. structural), the amount of user intervention that they require (manual vs. automatic), their usage (stand-alone vs. composed), and the types of components to consider (schema only or schema and instances). Performance measurements (recall, precision, and runtime) are supported by the system, along with the weighted combination of the results provided by those methods. The AgreementMaker has been used and tested in practical applications and in the Ontology Alignment Evaluation Initiative (OAEI) competition. We report here on some of its most advanced features, including its extensible architecture that facilitates the integration and performance tuning of a variety of matching methods, its capability to evaluate, compare, and combine matching results, and its user interface with a control panel that drives all the matching methods and evaluation strategies.web

咱们提出了AgreementMaker系统，用于匹配真实世界模式和本体，可能包含数百甚至数千个概念。系统的最终用户是复杂的领域专家，他们的需求推进了系统的设计和实现：他们须要一个响应迅速，功能强大且可扩展的框架来执行，评估和比较匹配方法。该系统包含多种匹配方法，能够解决匹配的组件（概念与结构）的不一样粒度级别，他们须要的用户干预量（手动与自动），它们的使用（独立与组合），以及要考虑的组件类型（仅架构或架构和实例）。系统支持性能测量（召回率，准确率和运行时性能），以及这些方法提供的结果的加权组合。 AgreementMaker已在实际应用和Ontology Alignment Evaluation Initiative（OAEI）竞赛中使用和测试。咱们在此报告其一些最早进的功能，包括其可扩展的体系结构，有助于各类匹配方法的集成和性能调整，评估，比较和组合匹配结果的能力，以及控制全部匹配方法和评估策略的用户界面和控制面板。算法

1. Introduction

1. 介绍

The issue of schema matching in databases [11], which has been investigated since the early 80’s, is fundamental to data integration, as is the closely-related issue of ontology alignment or matching [12]. The matching problem consists of defining mappings among schema or ontology elements that are semantically related. Such mappings are typically defined between two schemas or two ontologies at a time one being called the source and the other being called the target.数据库

自80年代早期以来一直在研究的数据库[11]中的模式匹配问题是数据集成的基础，与本体对齐或匹配密切相关的问题也是如此[12]。匹配问题包括定义在语义上相关的模式或本体元素之间的映射。这种映射一般在两个模式或两个本体之间定义，一个被称为源本体，另外一个被称为目标本体。express

We have been developing the AgreementMaker matching system, whose name takes after agreement, the encoding of a mapping. The capabilities of our system have been driven by the real-world problems of end users who are sophisticated domain experts. We have considered a variety of domains and applications, including: geospatial [2], environmental [4], and biomedical [13]. The conceptual information for these applications is stored in the form of ontologies. However, as demonstrated by others, the same approach can be used for schema matching [1, 10]. To validate our approach, we competed against seven other systems in the biomedical track of the 2007 Ontology Alignment Evaluation Initiative (OAEI), to match ontologies describing the mouse adult anatomy of the Mouse Gene Expression Database Project (2744 classes) and the human anatomy of the National Cancer Institute (3304 classes). We came in third in terms of accuracy (F-measure) [5].数据结构

咱们一直在开发AgreementMaker匹配系统，其名称取决于协议（映射的编码）。咱们系统的功能受到最终用户的现实问题的驱动，这些最终用户是很是复杂的领域专家。咱们已经考虑了各类领域和应用，包括：地理空间[2]，环境[4]和生物医学[13]。这些应用程序的概念信息以本体的形式存储。可是，正如其余人所证实的那样，相同的方法能够用于模式匹配[1,10]。为了验证咱们的方法，咱们与2007年本体校准评估计划（OAEI）的生物医学行业中的其余七个系统进行了竞争，以匹配描述小鼠基因表达数据库项目（2744类）的成年小鼠解剖学的本体和国家癌症研究所（3304类）的人体解剖学分类本体。咱们在准确性方面排名第三（F-measure）[5]。架构

The AgreementMaker, which is currently in its third version, has been evolving to accommodate: (1) user requirements, as expressed by domain experts; (2) a wide range of input (ontology) and output (agreement file) formats; (3) a large choice of matching methods depending on the diﬀerent granularity of the set of components being matched (local vs. global), on diﬀerent features considered in the comparison (conceptual vs. structural), on the amount of intervention that they require from users (manual vs. automatic), on usage (stand-alone vs. composed), and on the types of components to consider (schema only or schema and instances); (4) improved performance, that is, accuracy (precision, recall, F-measure) and eﬃciency (execution time) for the automatic methods; (5) an extensible architecture to incorporate new methods easily and to tune their performance; (6) the capability to evaluate, compare, and combine diﬀerent strategies and matching results; (7) a comprehensive user interface supporting both advanced visualization techniques and a control panel that drives all the matching methods and evaluation strategies.app

目前处于第三版的AgreementMaker正在不断发展以适应：（1）领域专家表达的用户需求; （2）普遍的输入（本体）和输出（协议文件）格式; （3）根据不一样粒度的组件集的匹配选项（本地与全局），在比较中考虑的不一样特征（概念与结构），他们须要的来自用户的干预量（手动与自动），使用（独立与组合），以及要考虑的组件类型（仅架构或架构和实例）; （4）改进性能，即自动方法的准确度（精确度，召回率，F测量值）和效率（执行时间）; （5）可扩展的架构，能够轻松地整合新方法并调整其性能; （6）评估，比较和组合不一样策略和匹配结果的能力; （7）全面的用户界面，支持高级可视化技术和控制面板，驱动全部匹配方法和评估策略。框架

In this demo paper, we focus on the most recent developments of the system, which has been almost completely redesigned in the last year. In particular, we describe: (1) the user interface with particular emphasis on the control panel and improved visualization and interaction capabilities; (2) the automatic matching methods and execution capabilities; and (3) the evaluation strategies for determining the eﬃciency of the matching methods and for performing the combination of results.

在本演示文章中，咱们将重点介绍该系统的最新发展，该系统在去年几乎彻底从新设计。特别是，咱们描述：（1）用户界面，特别强调控制面板和改进的可视化和交互功能; （2）自动匹配方法和执行能力; （3）用于肯定匹配方法的效率和执行结果组合的评估策略。

2. RELATED WORK

2.相关工做

There are several notable systems related to ours, including Clio [6], COMA++ [1], Falcon-AO [7], and Ri MOM [14] (just to mention a few). Clio stands apart because of its single focus on database-specific constraints and operators (e.g., foreign keys, joins) to infer the mappings whereas constraints in ontologies (as implemented in the other three systems and in AgreementMaker) are of a diﬀerent nature [12]. This diﬀerent emphasis also permeates the remaining components of the various systems, as those that also support ontology matching implement a rich tool box of stringsimilarity and structural-based techniques and focus on performance. Consequently, some of these systems do not focus on user interaction: for example, Falcon-AO and Ri MOM provide simple interfaces that oﬀer limited user interaction (e.g., no manual manipulation of the ontologies). However, what separates AgreementMaker from these other systems (including from COMA++, which has a more sophisticated user interface than the other two) is the degree to which it integrates the evaluation of the quality of the obtained mappings with the graphical user interface and therefore with the iterative matching process. This tight integration emerged from our work with domain experts, who required that the evaluation be an integral part of the matching process, not an “add on” capability.

有几个与咱们相关的着名系统，包括Clio [6]，COMA ++ [1]，Falcon-AO [7]和Ri MOM [14]（仅举几例）。 Clio之因此不同凡响，是由于它专一于特定于数据库的约束和运算符（例如，外键，链接）来推断映射，而本体中的约束（在其余三个系统和AgreementMaker中实现）具备不一样的性质[12 ]。这种不一样的重点也渗透到各类系统的其他组件中，由于那些支持本体匹配的组件实现了丰富的类似性和基于结构的技术工具箱，并专一于性能。所以，这些系统中的一些不关注用户交互：例如，Falcon-AO和Ri MOM提供了限制用户交互的简单接口（例如，没有对本体的手动操纵）。然而，将AgreementMaker与其余系统（包括COMA ++，其具备比其余两个更复杂的用户界面）区别开来的是它将得到的映射的质量评估与图形用户界面集成的程度，所以迭代匹配过程（大意是能够直接看到评估结果的改进？）。这种紧密集成源于咱们与领域专家的合做，他们要求评估是匹配过程当中不可或缺的一部分，而不是“附加”功能。

3. ARCHITECTURE

3.架构

The AgreementMaker supports a wide variety of methods or matchers. Our architecture (see Figure 1) allows for serial and parallel composition where, respectively, the output of one or more methods can be used as input to another one, or several methods can be used on the same input and then combined. A set of mappings may therefore be the result of a sequence of steps, called layers.

AgreementMaker支持各类方法或匹配器。咱们的体系结构（参见图1）容许串行和并行组合，其中一个或多个方法的输出能够分别用做另外一个方法的输入，或者能够在同一输入上使用多个方法而后组合。所以，一组映射多是一系列步骤的结果，称为层。

The matching process of a generic matcher (see Figure 2), can be divided into two main modules: (1) similarity computation in which each concept of the source ontology is compared with all the concepts of the target ontology, thus producing two similarity matrices (one for classes and the other one for properties), which contain a value for each pair of concepts; (2) mappings selection in which the matrix is scanned to select only the best mappings according to a given threshold and to the cardinality of the correspondences, for example, 1-1, 1-N, N-1, M-N

通用匹配器的匹配过程（见图2）能够分为两个主要模块：（1）类似度计算，其中源本体的每一个概念与目标本体的全部概念进行比较，从而产生两个类似性矩阵（一个用于类，另外一个用于属性），其中包含每对概念的值; （2）映射选择，扫描矩阵以根据给定阈值和对应关系的基数仅选择最佳映射，例如1-1,1-N，N-1，M-N

To enable extensibility, we adopted the object-oriented template pattern by defining the skeleton of the matching process in a generic matcher, which defers only a few operations to the concrete matcher extensions (see Figure 3). This abstraction minimizes development eﬀort by completely decoupling the structure of a single method from the architecture of the whole system, thus allowing reuse or any possible composition of matching modules.

为了实现可扩展性，咱们经过在通用匹配器中定义匹配过程的框架来实现面向对象的模板模式（？？？不懂），该模式仅将少数操做推迟到具体的匹配器扩展（参见图3）。这种抽象经过将单个方法的结构与整个系统的体系结构彻底解耦来最小化开发效率，从而容许重用或任何可能的匹配模块组合。

A first layer matcher produces the similarity matrices, while the second and third layer matchers extend the first layer matchers. In particular, a second layer matcher improves on the results of a first layer matcher using conceptual or structural information, depending on whether it considers one concept alone or a concept and its neighbors. Finally, a third layer matcher combines the results of two or more matchers from the previous layers, in order to obtain a final matching or alignment, that is, a set of mappings.

第一层匹配器产生类似性矩阵，而第二和第三层匹配器扩展第一层匹配器。特别地，第二层匹配器使用概念或结构信息改进第一层匹配器的结果，这取决于它是单独考虑一个概念仍是概念及其邻居。最后，第三层匹配器组合来自先前层的两个或更多个匹配器的结果，以便得到最终匹配或对齐，即一组映射。

4. USER INTERFACE

4.用户界面

The source and target ontologies (in XML, RDFS, OWL, or N3) are visualized side by side using the familiar outline tree paradigm (see Figure 4). Agreements can be exported in diﬀerent formats (e.g., XML, Excel). Because all the matching operations and their results are managed by this interface, we gave special consideration to its design [4]. We describe next two new features of the interface: the control panel and the visualization of non-hierarchical ontologies (e.g., due to multiple inheritance in OWL). The latter feature allows for specific subtrees to be visually duplicated. Because we adopt the Model-View-Control pattern, this duplication does not aﬀect the underlying data structures. The control panel (see Figure 5) allows users to run and manage matching methods and their results. Users can select parameters common to all methods (such as threshold and cardinality) and method-specific parameters. When a method has run, a new row is dynamically added to the table that is part of the control panel at the same time that lines depicting the mappings between the concepts are added (see Figure 4). Each row is color coded and allows for its selection so that the corresponding mappings (of the same color) can be compared visually. Each row also displays the performance values for the associated methods, thus allowing for the comparison with those of other rows. In addition, users can modify at runtime the method parameters by changing directly their values in the table or by selecting previously calculated matchings as input to the methods to be applied next. Multiple matchings can also be combined manually or with an automatic combination matcher.

源和目标本体（在XML，RDFS，OWL或N3中）使用熟悉的大纲树范例并排显示（参见图4）。匹配结果能够以不一样的格式导出（例如，XML，Excel）。因为全部匹配操做及其结果均由此接口管理，所以咱们特别考虑了其设计[4]。咱们将介绍接口的下两个新功能：控制面板和非分层结构的可视化（例如，因为OWL中的多重继承）。后一特征容许在视觉上复制特定的子树。由于咱们采用模型-视图-控制模式，因此这种应用不会影响基础数据结构。控制面板（参见图5）容许用户运行和管理匹配方法及其结果。用户能够选择全部方法共有的参数（例如阈值和基数）和特定于方法的参数。当一个方法运行时，一个新行被动态地添加到做为控制面板一部分的表中，同时添加了描述概念之间映射的行（参见图4）。每行都是彩色编码的，并容许其选择，以即可以在视觉上比较相应的映射（相同颜色）。每行还显示相关方法的性能值，从而容许与其余行的性能值进行比较。此外，用户能够在运行时经过直接更改表中的值或经过选择先前计算的匹配结果做为下一个要应用的方法的输入来修改这个方法的参数。多个匹配也能够手动组合或与自动组合匹配器组合。

5. MATCHING METHODS

5.匹配方法

First layer matchers compare concept features (e.g., label, comments, annotations, and instances) and use a variety of methods including syntactic and lexical comparison algorithms as well as the use of a lexicon like Word Net. Of those methods some were proposed by others (e.g., edit distance, Jaro-Winkler) and some devised by us, including a substring-based comparison that favors the length of the common substrings and a concept document-based comparison containing a wide range of features. Those features are represented as TF-IDF vectors and use a cosine similarity metric (see Figure 6).

第一层匹配器比较概念特征（例如，标签，注释，注释和实例）并使用各类方法，包括句法和词汇比较算法以及Word Net等词典的使用。其中一些方法是由其余人提出的（例如，编辑距离，Jaro-Winkler）和咱们设计的一些方法，包括基于子串的比较，这有利于公共子串的长度和基于文件的概念等方面进行普遍特征上的比较。这些特征表示为TF-IDF向量并使用余弦类似性度量（参见图6）。

Second layer matchers use structural properties of the ontologies. Our own methods include the Descendant’s Similarity Inheritance (DSI) and the Sibling’s Similarity Contribution (SSC) matchers [3].

第二层匹配器使用本体的结构属性。咱们本身的方法包括后代的类似性遗传（DSI）和兄弟姐妹的类似性贡献（SSC）匹配[3]。

Finally, third layer matchers combine the results of two or more matchers so as to obtain a unique final matching in two steps. In the first step, a similarity matrix is built for each pair of concepts, using our Linear Weighted Combination (LWC) matcher, which processes the weighted average for the diﬀerent similarity results (see Figure 7). Weights can be assigned manually or automatically, the latter assignment being determined using our evaluation methods. The second step uses that similarity matrix and takes into account a threshold value and the desired cardinality. When the cardinality is 1-1, we adopt the Shortest Augmenting Path algorithm [9] to find the optimal solution for this optimization problem (namely the assignment problem reduced to the maximum weight matching in a bipartite graph) in polynomial time.

最后，第三层匹配器组合两个或更多匹配器的结果，以便在两个步骤中得到惟一的最终匹配。在第一步中，使用咱们的线性加权组合（LWC）匹配器为每对概念创建类似性矩阵，该匹配器处理不一样类似性结果的加权平均值（参见图7）。能够手动或自动分配权重，后者分配使用咱们的评估方法肯定。第二步使用该类似性矩阵并考虑阈值和指望的基数。当基数为1-1时，咱们采用最短增广路径算法[9]，在多项式时间内找到该优化问题的最优解（即，将分配问题降级到二分图中的最大权重匹配）。

6. EVALUATION

6.评估

The design of optimal methods to find correct and complete mappings between real-world ontologies is a hard task for several reasons. First of all, an algorithm may be effective for a given scenario, but not for others. Even within the same scenario, the use of diﬀerent parameters can change significantly the outcome. Moreover, in interviewing domain experts in the geospatial domain, we discovered that they do not trust automatic methods unless quality metrics are associated with the matching results. These observations have motivated a variety of evaluation techniques, that determine runtime and accuracy (precision, recall, and F-measure).

因为几个缘由，设计在现实世界本体之间找到正确和完整映射的最佳方法是一项艰巨的任务。首先，算法可能对给定场景有效，但对其余场景则无效。即便在相同的状况下，使用不一样的参数也能够显着改变结果。此外，在访问地理空间域中的域专家时，咱们发现他们不信任自动方法，除非质量度量与匹配结果相关联。这些观察结果激发了各类评估技术，这些技术决定了运行时间和准确性（精确度，召回率和F测量值）。

The most eﬀective evaluation technique compares the mappings found by the system between the two ontologies with a reference matching or “gold standard,” which is a set of correct and complete mappings as built by domain experts. When a reference matching is available, the AgreementMaker can determine the quality of the found matching analytically or visually. A reference matching can also be used to tune algorithms by using a feedback mechanism provided by a succession of runs.

最有效的评估技术将系统在两个本体之间发现的映射与参考匹配或“黄金标准”进行比较，后者是由领域专家构建的一组正确和完整的映射。当参考匹配可用时，AgreementMaker能够分析或直观地肯定找到的匹配的质量。参考匹配也能够用于经过使用由一系列运行提供的反馈机制来调整算法。

When a gold standard is not available, “inherent” quality measures need to be considered. Quality measures can be defined at two levels as associated with the two main modules of a matcher (see Figure 2): similarity or selection level. We can consider local quality as associated with a correspondence at the similarity level (or mapping at the selection level) or global quality as associated with all the correspondences at the similarity level (or with all possible mappings at the selection level). We have incorporated in our system a global-selection quality measure proposed by others [8] and a local-similarity quality measure that we have devised. Experiments have shown that our quality measure is usually eﬀective in defining weights for the LWC matcher.

若是没有黄金标准，则须要考虑“固有的”质量措施。质量测量能够在两个级别定义，与匹配器的两个主要模块相关联（参见图2）：类似性或选择级别。咱们能够将与类似性级别（或选择级别的映射）的对应关联的本地质量或与类似性级别（或选择级别的全部可能映射）的全部对应关联的全局质量相关联【PS这什么鬼！！！】。咱们已经在咱们的系统中归入了其余人提出的全球选择质量测量[8]以及咱们设计的局部类似性质量测量。实验代表，咱们的质量测量一般在定义LWC匹配器的权重方面是有效的。

7. DEMONSTRATION

7.演示

Our demo focuses on the matching methods and evaluation strategies for determining the eﬃciency of ontology matching methods. Due to the tight integration of the evaluation strategies with the graphical user interface, a unique feature of our system, all the steps will be performed through the interface. Users will start by uploading their own ontologies, load our own, or download ontologies from the web, thus taking advantage of the several standard formats supported. Users can then explore the interface freely or follow a walk-through, consisting of browsing the ontologies, expanding and contracting nodes, and customizing the display. They have access to the information associated with each concept to be aligned, including descriptions, annotations, and (context) relations, and they can use them to visually detect mappings.

咱们的演示侧重于肯定本体匹配方法的效率的匹配方法和评估策略。因为评估策略与图形用户界面（咱们系统的独特功能）的紧密集成，全部步骤都将经过界面执行。用户将首先上传他们本身的本体（加载咱们提供的本体，或从网上下载的本体）从而利用支持的几种标准格式。而后，用户能够自由地浏览界面或按照演练进行浏览，包括浏览本体，扩展和收缩节点以及自定义显示。他们能够访问与要对齐的每一个概念相关的信息，包括描述，注释和（上下文）关系，他们可使用它们来直观地检测映射。

正文以后

第一版是直接CAJViewer文字识别，而后用python进行清洗，而后谷歌文件直接翻译，最后整合起来的。因此估摸着友好度比较低，等我看完以后慢慢一点点的改正吧。。