A Survey of Current Link Discovery Frameworks

Transcript

1 Undefined 0 (2015) 1–0 1 IOS Press A Survey of Current Link Discovery Frameworks a , b ∗ a a , Michael Hartung Axel-Cyrille Ngonga Ngomo and Erhard Rahm Markus Nentwig a Database Group, University of Leipzig, Augustusplatz 10, 04109 Leipzig, Germany { E-mail: nentwig, hartung, rahm } @informatik.uni-leipzig.de b AKSW, University of Leipzig, Augustusplatz 10, 04109 Leipzig, Germany E-mail: [email protected] Abstract: Links build the backbone of the Linked Data Cloud. With the steady growth in size of datasets comes an increased need for end users to know which frameworks to use for deriving links between datasets. In this survey, we comparatively evaluate current Link Discovery tools and frameworks. For this purpose, we outline general requirements and derive a generic architecture of Link Discovery frameworks. Based on this generic architecture, we study and compare the features of state-of- the-art linking frameworks. We also analyze reported performance evaluations for the different frameworks. Finally, we derive insights pertaining to possible future developments in the domain of Link Discovery. 1 2 dia (4.5 million resources) and LinkedGeoData (1+ 1. Introduction million resources) would last several decades if check- ing whether two resources should be linked lasted 1ms. Over the last years, the Linked Open Data (LOD) Several software tools and frameworks have already Cloud has been the most well-known incarnation of been developed to address the link discovery prob- the Linked Data Principles. The intention behind this lem especially to identify semantically equivalent ob- set of interlinked datasets is to create the initial seed jects in different data sources. The basic intuition be- for the machine-readable extension of the current Web hind most of these approaches is to reduce the link dubbed the Data Web. While partly very large datasets discovery problem to a similarity computation prob- are being added to the LOD Cloud on a regular ba- S , the goal lem: Given two sets of resources and T sis (e.g., Linked TCGA [47]), they are only sparsely T × S is to automatically find pairs of resources in linked with other datasets. Recent studies show that that should be linked with each other, e.g., according of the LOD datasets are not connected to other 44% relationship. Two main problems owl:sameAs to a datasets at all [48]. This problem is of major impor- arise when dealing with link discovery in this man- tance as links are central for manifold applications in- ner: achieving both a high effectiveness and a high cluding federated queries [46] and answering complex efficiency of the linking process. A high effective- questions [49,53]. The main reason for this blatant lack ness requires finding (almost) all links between two of links in the LOD Cloud lies in the creation of links given sources without deriving incorrect links. Achiev- being a very tedious process when carried out man- ing this goal requires finding a suitable link configu- ually. This is especially true when dealing with large ration or specification [20,33] specifying the similar- knowledge bases which contain a very large number of ∈ have T ity condition(s) two resources s ∈ S and t resources. For example, creating links between DBpe- * 1 [email protected] Corresponding author. E-mail: http://dbpedia.org 2 leipzig.de http://linkedgeodata.org c 0000-0000/15/$00.00 © 2015 – IOS Press and the authors. All rights reserved

2 2 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks to comply with in order to count as a being in the in- Solving the LD problem is challenging due to the typically large volume and semantic heterogeneity of put relation. Even when given a suitable link specifica- datasets making it difficult to meet major requirements tion, we have to address the efficiency problem since a such as high effectiveness and high efficiency. These ̈ na ıve implementation which compares all elements of and further requirements are part of the LD problem with all elements of would have a complexity of T S and will be discussed in the next subsection. LD has T O ) | . |·| S | ( many similarities with the problem of entity resolu- Link discovery and the related problems of entity tion (also called deduplication, reference reconcilia- resolution or object matching are being studied exten- tion or object matching) that has already been exten- sively. A large number of techniques have already been sively addressed [9,25,7]. In particular, similar tech- described in several surveys and books, e.g., [13,55,7], niques for evaluating the similarity between objects In contrast to these works, we focus on surveying and and for improving the efficiency can be applied. Still comparing the currently available link discovery tools there are significant differences between LD and entity and frameworks. The goal is thus to survey the state- resolution that have lead to the development of specific of-the-art in existing solutions which could be applied tools for LD. Most entity resolution approaches focus to solve specific linking tasks. Our comparison is based on homogeneous datasets of relatively simple, struc- on numerous criteria derived from major requirements tured objects, described by a set of single-valued at- as well as from the steps of a generic link discovery tributes. By contrast, the resources for LD can be het- workflow that we will present in the following sec- erogeneous and highly interrelated within the datasets. tions. The workflow takes into account the newest de- In particular, resources usually abide by an ontology, velopments in this research area including support for which describes the properties that resources of a cer- learning-based configurations and human interaction. tain type can have as well as the relations between the We will first present a functional comparison of ten classes that the resources instantiate. Thus, the LD pro- current frameworks. We will consider published per- cess usually involves an ontology and instance match- formance evaluations for the considered tools includ- ing part (see general workflow in Figure 1). Further- ing the outcome of instance-level benchmarks of the more, entity resolution techniques focus on finding se- Ontology Evaluation Alignment Initiative (OAEI). We mantically equivalent objects while LD aims at iden- will try to assess the used evaluation criteria and com- owl:sameAs and tifying diverse relations (including parability of the achieved results. domain-specific relations). We expect the presented criteria and methodol- ogy to be useful to comparatively evaluate additional 2.2. Requirements tools. We plan to continuously extend and update the tool comparison under http://aksw.org/ As mentioned before, supporting a high effective- . projects/linkinglod ness and efficiency are two main requirements for a LD framework. In the following we pose further re- quirements and desiderata such as low manual effort 2. Problem statement and requirements for configuration and tuning, support for online LD as well as the provision of a powerful infrastructure. 2.1. Link Discovery Problem : A LD tool should generate mappings Effectiveness of high-quality w.r.t. common measures such as pre- The Link Discovery (LD) problem can be described cision, recall and F-measure. Hence, results should be T and as follows: Given two sets of resources S precise, i.e., the links generated by a given framework (for example about movies) and a relation R (e.g., should be correct (precision). A LD tool should also owl:sameAs ), find all pairs dbo:producer or generate as many as possible links to ensure complete- ( R ) T × S ∈ s, t ( holds. The result is ) s, t such that ness. In summary, only links between resources that represented as a set of links called a mapping : M = really belong together should be produced. This aim S,T } . Optionally a similarity , R { ( ) | a a ∈ A, b , b ∈ B is usually achieved by a combination of different LD i j i j , ) computed by an LD tool can be 1] score ( [0 ∈ sim methods. Systems may support rather simple match added to the entries of mappings to express the con- techniques such as string similarity comparisons for la- fidence of a computed link. In this case, links can be bels (e.g., [40]) but also complex ones, e.g., by consid- )) ( , sim , b , b R , . a ( represented as quadruples a ering the semantic neighborhood of an resource or by i j i j

3 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 3 reusing already available links [21,37]. Furthermore, a specified LD workflow should be executable on dif- LD tool should support different link types [54]. In our ferent platforms, preferably with parallel processing. comparison we will evaluate which LD methods are Furthermore, mechanisms for collaborative work in supported by an LD tool and which effectiveness could groups or crowd-sourcing should be provided to more be demonstrated in benchmark evaluations. easily overcome problems like labeling of training data Efficiency : A LD tool should be fast and scalable or the generation of gold standards. Overall, a tool to large datasets, e.g., with hundreds of thousands or should be designed domain-independent but it should millions of resources. A naive, non-scalable approach be possible to flexibly customize it for specific LD evaluates all possible pairs of resources (Cartesian tasks, e.g., linking geographical resources or knowl- product) resulting in a quadratic complexity. Hence, edge from the life sciences. a main efficiency goal is to reduce the search space so that the evaluation of irrelevant pairs of resources is largely avoided. Another general optimization ap- 3. LD workflow proach is parallel LD on multiple cores or multiple nodes in a cluster. This includes the utilization of mod- Current LD frameworks mostly apply workflows ern hardware and infrastructures such as graphical pro- which consists of several steps to perform LD. In most cessing units (GPU) or Hadoop-based clusters [33]. cases, these workflows are instantiations of the generic Low Configuration and Tuning Effort : Achieving workflow shown in Figure 1. This workflow is a gen- a high effectiveness generally demands complex link eralization of the architecture given by analysing the specifications with the combined use of multiple sim- later on compared LD frameworks (starting in section ilarity measures and adequate settings for configura- 4). The input of the workflow includes the two datasets tion parameters such as similarity thresholds. Man- (source, target) to be linked, configuration parameters ually specifying such configurations is very difficult and optional background knowledge resources. The in- and time-consuming so that this effort should largely put data may be provided in the form of RDF/OWL be reduced by automated approaches. This can be dumps or in the form of a SPARQL endpoint for query- achieved by learning-based methods, e.g., by super- based data access. Linking may be restricted to a sub- vised approaches using training data of matching or set of a data source, e.g., instances of a particular class, non-matching pairs of resources. Alternatively, the LD as for example a geographic data source contains set- framework can analyze the datasets, e.g., to select suit- tlements and there is no need to compare these with ac- able similarity measures or properties to evaluate. In tors from a more generic data source such as DBpedia. order to really reduce the manual configuration effort, The configuration input may either be a complete link- the automated approaches should not introduce a sig- ing specification (e.g., rules for comparing resources) nificant extra configuration, e.g., for providing training or selected parameters such as similarity thresholds. data or specifying new tuning parameters. Training data required for learning-based linking is an- : In addition to a classical of- Online and Offline LD other kind of configuration input. Optionally, tools can fline execution of LD, applications such as mashups make use of further knowledge resources such as dic- or on-demand query systems demand an online LD tionaries or previously determined mappings for reuse. to integrate data from several data sources at runtime. The output of the workflow is the set of found links or Hence, a LD tool should support such a runtime or ad- correspondences representing a mapping between the hoc LD, e.g., by providing an appropriate API. Typi- source and target datasets. cally, the number of resources to be linked in this way The generic workflow itself has three main phases: is small thereby facilitating a sufficiently fast execu- preprocessing, matching (similarity computation) and tion. postprocessing. Preprocessing in turn deals with two : The support for LD dis- Powerful infrastructure important tasks: finalizing the linking specification cussed in the previous desiderata requires a set of pow- (configuration) and improving runtime efficiency, e.g., erful and easy-to-use tools. In particular, a LD tool by reducing the search space for similarity computa- should come with flexible libraries with different sim- tions in the main match phase. Preprocessing may also ilarity functions, support different performance opti- include preparatory steps to transform and clean the in- mizations and provide different possibilities to access put data, e.g., to remove stop words or resolve abbrevi- data sources for LD and a graphical user interface to ations. While matching is completely automatic there display and configure the workflow. Furthermore, the may be user interaction for preprocessing, e.g., to la-

4 4 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks Human Expert Source Target Instance Matching Preprocessing Mapping Post- (configuration, Configuration (Set of processing runtime parameters Links) optimization) Ontology Matching External Resources Figure 1. General workflow LD Frameworks (steps with dashed borders are optional) aggregate different similarity values, e.g., by taking a bel training data for learning-based linking, and post- processing, e.g., to verify computed links with a lower weighted average, and apply a single similarity thresh- confidence. old to the aggregated value. Rule-based approaches In the following we describe the two preprocessing match rules to derive a match or link de- use so-called steps about configuration and runtime optimization as cision. Such rules define logical combinations of con- well as the match and postprocessing phases and their ditions, e.g., 3-gram similarity for title > 0.9 and equal implementation alternatives in more detail. In the tool Workflow-based approaches release year. are less com- evaluation we will study which of the different options mon and assume the iterative calculation of different are applied. similarity values during the match phase to determine a link decision. For example, one could first calculate 3.1. Configuration the string similarity for a selected property and then apply a more expensive context-based similarity mea- LD is typically based on evaluating the similarity sure (e.g., for the set of movie actors) only for pairs of resources according to one or several criteria. Each of resources with a high similarity for the first crite- criterion is based on a specific similarity measure or rion [52,31]. similarity function and compares either properties or A manual definition of effective linking specifica- the semantic context of resources. For example two tions such as match rules is difficult to achieve in many owl:sameAs property movies may be linked by a cases even for domain experts. Hence, it is desirable to based on the similarity of their titles, their release years automate at least some of the decisions such as select- and the set of actors who starred in them. Specifying ing the properties or the similarity measures to evalu- a linking configuration thus entails the specification of adaptive LD approaches that ate. This is achieved by the elements (properties, context) to evaluate as well as analyze characteristics of the input data to achieve a the similarity measures to apply (e.g., a 3-gram string partially automated specification of the linking config- similarity, Jaccard similarity for sets or numerical dif- uration [36,3,56,29]. ference) and a way to derive a combined linking deci- can be ap- learning-based approaches Alternatively, sion from the individual similarity values, e.g., based plied to semi-automatically or automatically derive on similarity thresholds to meet. a linking specification. The proposed learning ap- According to [25], different similarity values may proaches for this purpose are mostly supervised, i.e., be combined either numerically or using rule-based Numerical approaches or workflow-based approaches. they depend an suitable training data consisting of

5 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 5 pairs of resources which are labeled as matching (link- Blocking and filtering can jointly be applied, e.g., to ing) or non-matching. The learned classification model reduce the number of comparisons for partition-wise may be based on different learning techniques such as linking. Furthermore, both approaches can be utilized decision trees, SVM or genetic algorithms. Labeling in combination with parallel LD [24,33]. training data is often a manual step requiring the in- teraction of humans. The manual labeling effort may 3.3. Match approaches be kept feasible by crowdsourcing. Alternatively, the active learning amount of training can be limited by The main phase of the LD workflow applies the where user feedback is only requested for a smaller linking specification and evaluates the specified sim- amount of controversial pairs where a similarity func- ilarity measures on the pairs of resources that still tion cannot find a clear linking decision. Learning- need to be considered according to the used block- based approaches may also be unsupervised, thereby ing or filter methods. An LD tool typically has a li- avoiding the need for training data. However, these ap- brary of different match techniques (or matchers) that proaches may still require the specification of critical apply a similarity measure on the resources to link parameters such as suitable similarity or distance mea- with each other. These matchers have been categorized sures and threshold values [38,29]. as either element- or structure-based [45,11] depend- ing on whether they evaluate simple resource elements 3.2. Runtime optimization such as atomic property values (literals) or whether they consider the context of resources (e.g., related The main approach to optimize the runtime for LD instances or the ontological context), Element-level during preprocessing is a reduction of the search space matchers are most common and can be based on sim- to avoid that the Cartesian Product of the input datasets ilarity measures for strings (n-gram, TF/IDF, edit dis- needs to be evaluated. This is mainly supported T × S tance, etc.) [6], numbers or domain-specific data types by two complementary approaches called blocking and such as geographical coordinates. They are typically partitions the datasets into multiple Blocking filtering. applied on matching of comparable properties of re- partitions or blocks such that links are only determined sources that have been specified as part of the link- between resources of the same partition. There are sev- ing specification (either manually or automatically). eral approaches with disjoint or overlapping partitions Similarity computation may also utilize different kinds for this, e.g., standard blocking (based on a predefined of background knowledge such as general-purpose or blocking key) or canopy clustering [4,7]. Furthermore, domain-specific dictionaries and thesauri. multiple blocking keys may be applied to partition the Structure- or context-based matchers are more so- input according to several criteria so that the likeli- phisticated and aim at deriving the similarity of re- hood of finding all links is improved. The blocking sources from the similarity of their context. There is key is commonly based on attribute or property val- a large spectrum of possible approaches depending on ues, e.g., one could partition movies according to the what context and which similarity computation is ap- first three letters of the movie title or according to last plied. For example, some approaches use so-called an- name of the movie director. Resources can also be par- chor links between highly similar resources as a seed titioned based on their associated ontology concepts to iteratively find matching entities in the sets of their if both data sources have comparable concepts, e.g., related entities [21,18]. The search for matches can the genre of movies. We will call such an approach also be confined to instances of equivalent or related concept-based blocking . classes thereby utilizing the ontological context. Filtering utilizes details of the linking configuration, A promising LD approach is to utilize already ex- such as the similarity measure or similarity threshold, isting links and mappings to find new links. Based to filter pairs of records that cannot meet the similar- on the transitivity of the equality relation one can ity condition. For example, token-based string simi- compose several owl:sameAs links to derive new larity measures such as the Jaccard or Dice similarity owl:sameAs links. Effective strategies for such a can only exceed a certain threshold if the input strings composition of mappings and links have been pro- are of similar length and share a certain number of to- posed and evaluated in [16]. Public mapping reposito- kens [5]. Preprocessing can support the efficient exe- ries such as BioPortal [43] or LinkLion [27] support cution of such filters in the match phase, e.g., by creat- the publication of links and thus their reuse for deter- ing a token index. mining new links.

6 6 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 3.4. Postprocessing tioned two groups of 7+3 systems. The considered cri- teria belong to the following categories largely follow- In the final phase the results of the matchers need ing the steps of the introduced LD workflow: to be combined and the links need to be selected from Supported input formats – the set of candidate links according to the linking spec- Configuration approach – ification, e.g., by applying a match rule or a learned Runtime optimizations – classification model. The resulting links may be fur- – Match approaches ther refined or repaired to avoid inconsistencies, such Postprocessing – as the violation of ontological or application-specific Support for parallel processing – constraints. For example, one could request a 1:1 map- User interface (GUI support) and interaction – ping so that each instance is linked with at most one General availability – instance of the other input dataset. Hence, postpro- cessing could enforce this restriction by selecting the In the following subsections we will discuss these as- best link per instance, e.g., with the highest computed pects for the different frameworks. Finally we will confidence value. Human feedback is generally help- summarize our observations from the functional com- ful during postprocessing to verify the correctness of parison. computed links. 4.1. Data input 4. Functional Comparison Eight of the ten tools accept the input datasets in RDF file format while two frameworks (Agree- In this section, we provide a functional comparison mentMaker, SERIMI) need to retrieve the data from of ten state-of-the-art frameworks for LD based on the SPARQL endpoints. While SPARQL endpoints sup- requirements and the general LD workflow discussed port a flexible and dynamic data access they can cause in the previous sections. The selection of tools was fur- availability and performance problems. In addition thermore driven by the following criteria: to RDF, CODI, LogMap and RiMOM additionally support OWL input files. Access to SPARQL end- – OAEI benchmark instance matching track partici- points is also supported by the learning-based tools pation and relatively good performance in the ap- Silk, LIMES and KnoFuss. Dynamic data access with propriate year or SPARQL typically uses a restriction to certain classes learning-based approach for LD and usage of – (e.g., books, settlements) thereby limiting the data vol- OAEI instance matching track datasets. ume and search space for finding links. Given this premise, Table 1 lists the considered Surprisingly, a large number of the considered frameworks with their originating organization, their frameworks does not seem to rely on external back- first LD-related publication and further criteria that ground knowledge such as dictionaries or already allows a rough grouping of the tools. Seven of the known links and mappings. This statement obviously tools have participated in the instance matching contest does not hold for the framworks that rely on any form of the OAEI. The remaining three frameworks (Silk, of supervised machine learning. This in strong contrast LIMES, KnoFuss) support among others learning- to ontology matching where virtually all current tools based approaches for determining linking specifica- utilize dictionaries such as WordNet as background tions. A further criterion indicates that four of the knowledge [44]. The tools RiMOM, AgreementMaker seven tools of the first group have support for pure and LogMap also utilize such dictionaries for their on- ontology matching in addition to instance matching. tology matching but apparently not for linking instance In fact, these frameworks (RiMOM, AgreementMaker, data. LogMap and CODI) mostly started with ontology matching and supported instance matching later. Due 4.2. Configuration to the generality of the LD workflow and the given requirements the approach of the comparison can be Most frameworks can only determine owl:sameAs easily used for other tools, too. links or equivalent instances. LIMES and Silk also For the more detailed comparison of the tools we support additional link types which need to be manu- collect the main features in Tables 2 and 3 for the men- ally specified by the tool user.

7 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 7 Learning- OAEI IM parti- System / initial Support for pure Institution ontology matching publication cipation based RiMOM [51] Univ. of Tsinghua, China XX KnoFuss [39] Open Univ. Milton Keynes, UK X AgreementMaker [8] Univ. of Illinois at Chicago, USA XX X Silk [54] Univ. of Mannheim, Germany XX CODI [37] Univ. of Mannheim, Germany LIMES [32] Univ. of Leipzig, Germany X LogMap [21] Univ. of Oxford, UK XX Delft Univ. of Techn., Netherlands X SERIMI [3] Shanghai Jiao Tong Univ., China Zhishi.links [40] X SLINT+ [35] Nat. Inst. of Informatics, Japan X Table 1 Considered LD tools (sorted by year of initial publication) Four frameworks rely on a purely manually spec- tomatically chosen link candidates [19]. Link candi- dates for active learning are selected to optimize crite- ified linking configuration (CODI, LogMap, Agree- ria such as entropy or the similarity correlation to un- mentMaker, Zhishi.links). In case of several matchers labeled instances [34]. the resulting similarity values are combined according KnoFuss and LIMES also implement an unsuper- to a weighted average approach or a match rule. The vised learning of the linking specification [38,29]. The learning-based tools KnoFuss, Silk, and LIMES also approaches also utilize genetic programming but try support manually specified match rules. Three tools to iteratively optimize measures that evaluate indi- (RiMOM, SERIMI, SLINT+) already follow a semi- rect quality criteria such as high similarity values and automatic, adaptive linking specification by analyz- closeness to a 1:1 mapping (assuming duplicate-free ing the datasets and identifying the most discriminat- data sources) [38,30,29]. In KnoFuss, the candidate ing properties. For example, if publications have to be linking specifications aggregate the weighted similar- matched, the title will be more discriminating than the ity values for several string matchers and require the venue of the publication. SERIMI is limited to only a aggregated similarity value to exceed a certain thresh- single property to be selected for matching. Further pa- old. The approach thus has to select the matchers, de- rameters such as similarity thresholds have to be man- termine their weights, the aggregation function (e.g., ually specified. average or max) and the similarity threshold. Silk and LIMES both support supervised learning of a linking specification with genetic programming 4.3. Runtime optimization either with batch learning or active learning [19,28]. Genetic programming starts from a set of random link specifications and uses the evolutionary principles of Silk is the only framework implementing an explicit selection and variation to evolve these specifications to reduce the search space. They support the blocking until a linking condition meets a predefined optimiza- (manual) specification of multiple blocking keys, i.e., tion criterion (fitness function) or a maximal number only instances sharing one of the blocking keys must of iterations is reached. For supervised learning, man- be compared with each other. A multidimensional in- ually labeled link candidates are used within the ge- dex is applied to implement this strategy [20]. An im- netic algorithm to find link specifications that come plicit blocking is achieved by preselecting in the input close to the match decisions for the training data. Ac- specification the classes to be processed but this does tive learning aims at reducing the labeling effort for not allow to a-priori reduce the search space for the instances of a class which may be numerous. training data and applies an interactive labeling of au-

8 8 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks Characteristics of proposed LD frameworks (“-” means not existing, “?” unclear from publication, “*” supported in respectively ontology matching framework, Open Source project Parallel processing Post-processing String similarity Data Input measures Structure matcher - Filtering Download Tool/Source Further similiarty measures Configuration Use of - external dictionaries - Blocking - existing mappings GUI/web interface Runtime optimization - matcher combination Supported linktypes - - / - - ?* weighted indexing - - - - RDF, OWL adaptive - owl:sameAs owl:sameAs owl:sameAs owl:sameAs owl:sameAs owl:sameAs owl:sameAs RiMOM X average XXXXXXX / - combination AgreementMaker X similarity - - ?* - - - semantic manual weighted SPARQL - - indexing 1 / ? / - answer on form submission) based mapping XXX X checks generation - / - - weighted - - iterative anchor- - Coherence RDF, OWL manual - - average CODI / XX Table 2 repair generation - - ?* - RDF, OWL indexing weighted - iterative anchor- Inconsistency manual X average LogMap based mapping / / XX X - SPARQL - - - - - - - adaptive - - / - SERIMI / XX - / ? Zhishi.links combination similarity - coordinates RDF MapReduce - geographical - - weighted indexing semantic - manual / - - disparity X RDF - - inverted - - - weighted indexing - adaptive - / - SLINT+ average / - 1 no

9 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 9 Silk LIMES KnoFuss RDF, SPARQL, CSV RDF, SPARQL RDF, SPARQL, CSV Data Input owl:sameAs Supported linktypes , owl:sameAs owl:sameAs , user-specified others user-specified others Configuration manual (match rules), manual(match rules), manual (match rules), supervised learning supervised learning unsupervised learning (genetic programming) (genetic programming, (genetic programming, active learning) active learning), unsupervised (genetic programming) Runtime optimization - multi-dimensional - Blocking - - indexing - Filtering space tiling String similarity XXX measures Further similiarty numeric, date geographical coordinates, - measures equality numeric, date equality Structure matcher - - - Use of - external dictionaries - - - - existing mappings - - - one-to-one Stable marriage, Post-processing - mapping hospital-resident - MapReduce Parallel Processing (MapReduce)* GUI/web interface - / - X / XX / X Download Tool/Source X / XX / XX / - Open Source project XX - Table 3 Characteristics of learning-based LD frameworks (“-” means not existing, “*” not in current release) 4.4. Matching strategies The main approach to improve runtime in the con- sidered tools is filtering, especially by utilizing in- All tools support element-level matchers on selected verted index structures. This optimization focuses properties based on string similarity measures such as mostly on a specific property and similarity measure edit distance, n-gram, or Jaccard [6]. Only few tools (matcher). For example token-based string similarity (Zhishi.links, Silk, LIMES) also support built-in nu- measures such as Jaccard require matching values to merical similarity measures (e.g., Euclidean distance) share several tokens. Hence all pairs without a com- or domain-specific measures such as for geographi- mon token can be excluded from the comparison. An cal coordinates. Except SERIMI, all frameworks can inverted index allows to quickly determine the in- match on more than one property [2]. The similarity stances that still must be considered. values of different matchers are combined according to

10 10 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 3 the linking specification (match rule, weighted average LIMES [17], Zhishi.links [40] and Silk . Another promising option is to use parallel processing on mas- or according to a learned linking specification). sively parallel graphic processors (GPUs) as already In addition to simple matching on property values explored in [33]. While these optimizations have al- four frameworks (CODI, LogMap, AgreementMaker, ready been studied in the context of the mentioned Zhishi.links) already apply a structural matching based tools they are not an integral part of the available on the ontology structure to find links. LogMap and tool versions as they require a specific infrastructure CODI apply an iterative anchor-based matching ap- (Hadoop cluster or GPU). proach. Within the instances of comparable concepts so-called anchor links are determined first between al- 4.7. User interface and interaction most identical instances. Both LogMap and CODI then use information from the ontology to iteratively ex- User interfaces for the ten frameworks range from tend the existing mapping by evaluating the similarity simple command line interfaces (with diverging sets of related instances, either utilizing object-property- of options) over stand-alone installations to web appli- assertions [18] or logical reasoning [23]. In LogMap cations. Only four tools (LogMap, AgreementMaker, the similarity computation is performed by an algo- LIMES, Silk) support a GUI for convenient interactive rithm called ISUB [50] that combines three different use (Tables 2 and 3). metrics. CODI simply employs a threshold-based edit distance [41]. 4.8. Availability for other researchers The structural matching in AgreementMaker is based on its approach used for ontology matching. As seen in Tables 2 and 3 all tools (except Agree- Zhishi.links applies a two-step matching approach. Ini- mentMaker) are publicly available; five tools even fol- tially it determines property-based similarities. The re- low an Open Source strategy. sults are filtered via a threshold and the similarities are then semantically refined based on the similarity of 4.9. Observations related resources in the ontological context [40]. The considered tools provide a very good general 4.5. Postprocessing availability providing a rich choice for interested users and researchers. Most tools already support advanced The main task of postprocessing is to select the links methods for semi-automatic configuration of linking according to the linking specification, e.g., by apply- specifications, in three cases based on learning ap- ing a match rule taking into account the computed sim- proaches such as genetic programming. Efficiency is ilarity values. Additional verification steps are applied mainly addressed by filtering techniques for specific by LogMap and CODI to avoid that inconsistent map- matchers rather than by more general blocking ap- pings are determined. These tools also support pure proaches to reduce the search space. Parallel process- ontology matching where such postprocessing steps ing has been investigated by some of the tools but is not are quite common. Specifically, LogMap applies logi- generally available in the offered tool versions. Most cal reasoning [22] and CODI utilizes logical coherence tools only support rather simple property-based match- checks to identify links contradicting ontological re- ers; the more advanced structural match techniques are strictions [42]. Furthermore, KnoFuss and LIMES em- available in four tools. The high potential of utilizing ploy postprocessing strategies to ensure that every in- already existing links and mappings as well as other stance in the source can only have at most one corre- data sources or dictionaries as background knowledge sponding instance in the target dataset [38,28]. has not yet been exploited. GUI support is also lim- ited to few tools. The three learning-based tools Kno- 4.6. Support for parallel LD Fuss, Silk and LIMES provide the most options for linking configuration and runtime optimization; Silk For high efficiency and scalability, support for par- and LIMES also provide a GUI and are not limited to allel LD is beneficial. In addition to utilizing multi- links only. finding owl:sameAs ple processors of a single node parallel LD may also use several nodes in a cluster, e.g., running Apache 3 Hadoop with MapReduce. Three of the ten frame- https://www.assembla.com/spaces/silk/wiki/ Silk_MapReduce works already support a MapReduce implementation:

11 11 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks Most tasks have only been used in one year while oth- 5. Comparison of evaluation results ers like IIMB have been changed in different years. In this section, we analyze the published evaluation Most tests are based on artificially changed datasets results for the considered frameworks. Special em- where values and the structural context of instances phasis is given to results for the Ontology Evaluation have been modified in a controlled way. The tests 4 Alignment Initiative (OAEI) in the instance matching cover different domains (life sciences, people, geogra- track aiming on an evaluation of different systems un- phy, etc.) and LOD data sources (DBpedia, Freebase, der the same conditions. GeoNames, NYTimes, etc.). Frequently the bench- Similarly to previous evaluation studies on entity marks consist of several match (linking) tasks to cover resolution [7,25] we consider the following criteria: a certain spectrum of complexity. The number of in- stances is rather small in all tests with a maximal – Format of input data (RDF, OWL, etc.). size of a data source of 9,958 or fewer instances. – Determined link types, Real vs. artificial (synthetic) datasets: artificial – The evaluation focus has been solely on the effective- datasets are typically created by systematically ness (e.g., F-Measure) while runtime efficiency has not changing real instances to create create similar been measured. Almost all tasks focus on identifying (matching) instances to identify by the evaluated equivalent instances ( owl:sameAs links). approaches. This supports the generation of large We briefly characterize the different OAEI tasks as datasets for scalability experiments. follows. Considered data sources and domains. – The IIMB benchmark has IIMB and Sandbox (SB) Effectiveness: achieved linking quality in terms – been part of the 2010, 2011 and 2012 contests and of precision, recall and F-measure w.r.t. a perfect consists of 80 test cases using synthetically modified linking result (gold standard). datasets derived from instances of 29 Freebase con- Efficiency: runtime results and scalability to large – cepts. The tests and number of instances vary from data volumes. year to year but the tests are generally of a very small In the following we first describe the results for size (e.g., at most 375 instances in 2012). The Sandbox OAEI instance matching benchmarks which provide (SB) benchmark from 2012 is very similar to IIMB but the best possible comparability for the different tools limited to 10 different test cases [1]. so far. Afterwards we briefly discuss observations from additional evaluations and summarize the main find- This benchmark is based PR (Persons/Restaurant) ings. on real person and restaurant instance data which are artificially modified by adding duplicates and varia- 5.1. OAEI benchmark tests tions of property values. The dataset is relatively small with about 500-600 instances in the restaurant data The Ontology Evaluation Alignment Initiative (OAEI) source and even less in the person data source. [12] performs yearly contests since 2005 to comparatively evaluate current tools for ontology and instance match- DI-NYT (Data Interlinking - NYT) This 2011 bench- ing. The original focus has been on ontology matching mark includes seven tasks to link about 10,000 in- but since 2009 instance matching has also been a regu- stances from the NYT data source to DBpedia, Free- lar evaluation track. As already discussed in the previ- base and GeoNames instances. The perfect match re- ous section, seven of the ten tools have already partici- sult contains about 31,000 owl:sameAs links to be pated in this track. Even the three other learning-based identified [10]. frameworks used some of the OAEI test cases for their RDFT This 2013 benchmark is also of small size evaluations. Despite this situation, the analysis of the results for the OAEI benchmark is made complicated (430 instances) and uses several tests with differently because the tasks and the participating systems change modified DBpedia data. For the first time in the OAEI every year. instance matching track, no reference mapping is pro- Table 4 gives an overview over the OAEI instance vided for the actual evaluation task. Instead, training matching tasks in five contests from 2010 until 2014. data with an appropriate reference mapping is given for each test case thereby supporting frameworks relying 4 http://www.ontologymatching.org on supervised learning [15].

12 12 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks Input Type of LOD Sources Link Type Max. # Tasks Name Domains Resources problem Format RDF real life sciences diseasome equality 5,000 4 DI 2010 drugbank dailymed sider artificial cross-domain IIMB equality 1,416 80 OWL Freebase PR artificial people - equality 864 3 RDF, OWL geography DI-NYT real people NYTimes equality 9,958 7 2011 RDF DBpedia geography organizations Freebase Geonames IIMB OWL artificial Freebase equality 1,500 80 cross-domain 2012 SB OWL artificial cross-domain Freebase equality 375 10 IIMB OWL cross-domain Freebase equality 375 80 artificial RDFT equality artificial people DBpedia 2013 430 5 RDF 2014 OWL artificial publications ? equality 2,649 1 id-rec sim-rec OWL artificial publications ? similarity 173 1 Table 4 OAEI instance matching tasks over the years (“-” means not existing, “?” unclear from publication) for participating tools. The sim-rec task is not further Two benchmark tasks have to be per- OAEI 2014 evaluated in this paper. formed in 2014, the first one (id-rec) requiring the identification of the same real-world book entities (sameAs links). For this purpose, 1,330 book instances 5.2. Evaluation results of OAEI tasks have to be matched with with 2,649 synthetically mod- ifies instances in the target dataset. Data transforma- Table 5 shows the participation of the considered tions include changes like the substitution of book tools in the different OAEI contests and benchmarks. titles and labels with keywords as well as language Overall, many tools participated only once or twice transformations. The second task (sim-rec) requires (AgreementMaker, SERIMI, Zhishi.links, SLINT+) determining the similarity of pairs of instances which and several benchmarks have only been evaluated by do not reflect the same real-world entities. This ad- one or two systems (IIMB 2010, 2011 and 2012, SB, dresses common preprocessing tasks, e.g., to reduce DI 2010, id-rec 2014). The learning-based tools have the search space for LD. In 2014, the central evaluation used the PR and DI-NYT benchmarks but not within platform SEALS [14] is used for instance matching, the contest so that a direct comparability is not given. too. Therefore, a runtime evaluation can be conducted This is because outside the contest tools could apply a

13 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 13 solve or attracted only few frameworks participating in more intensive tuning and utilize additional informa- tion such as training data. Our comparison will thus the contest. RiMOM could outperform competing sys- focus on the benchmarks with most participants: PR, tems in three different benchmarks. The frameworks DI-NYT and RDFT. using OAEI benchmarks outside the contest achieved Table 6 shows the reported F-Measure results for the generally very good results that unfortunately are not PR benchmark tasks for matching people and restau- directly comparable with the results for the frame- rant records. The original reference mapping proved to works participating in the OAEI contests. Runtime val- be erroneous so that it was corrected after the OAEI ues and thus scalability have not yet been evaluated for contest making it difficult to compare the achieved re- OAEI instance matching. sults. Within the contest the RiMOM system could clearly outperform the CODI system. The evaluations 5.3. Other evaluations outside the competition used the corrected reference mapping and show especially good results for Kno- The learning-based frameworks KnoFuss, Silk and Fuss. In general, the small size of the linking problems LIMES did not yet participate in the OAEI contest and the achievable F-Measure of 0.98-1.0 indicate hat but evaluated their effectiveness and runtime efficiency the benchmark tasks are easy to solve. with several evaluations using diverse data sources in- The F-Measure results for the DI-NYT benchmark cluding DBpedia, DrugBank, GeoNames, DBLP and in Table 7) indicate a more diverse situation. From LinkedMDB. Unfortunately the studies typically used the three frameworks participating in the contest, different test cases with specific configurations so that Zhishi-links achieved the best results with consistent the results can hardly be compared with each other. F-Measure values between 0.87 and 0.97 for the seven For example, Silk [19] and LIMES [28] both evaluate tasks. By contrast, AgreementMaker and SERIMI per- a LinkedMDB-DBpedia dataset but use varying num- formed somewhat worse due to problems for one or bers of resources. Similarly, reported execution times two of the tasks. The results reported for the three sys- strongly depend on the used hardware configuration so tems that did not participate in the contest are generally that they mainly serve to show the relative performance better. The achievable F-measure results for all tasks of the respective system w.r.t. different data sizes and are between 0.93 and 0.99 indicating that these tasks other configuration parameters. are also relatively easy to solve. For genetic programming algorithms, efficiency F-Measure results for RDFT benchmark from the largely depends on the number of needed iterations. OAEI 2013 contest are summarized in Table 8. Again, As an example, Silk needed 2,558.8s for 25 iterations the different tasks could be solved to a large de- to link DrugBank with DBpedia but already 21,387.5s gree with maximal F-Measure values between 0.96 for 50 iterations [19]. The genetic algorithm also faces and 1.0. The overall best results are achieved by Ri- a quadratic complexity of the selection phase w.r.t. the MOM followed by SLINT+ and LogMap. The 2014 data volume. Hence, random sampling is applied to re- id-rec task turned out to be much more challeng- duce the number of possible candidates for the gener- ing. From the two participants, RiMOM again outper- ation of the next population. Again, runtime and qual- formed LogMap with a F-Measure result of only 0.56 ity of the results compete with each other as shown vs. 0.10. in [38] where bigger sampling sizes help to achieve a good F-measure at the expense of increased execution LogMap SLINT+ FMeas. RiMOM2013 times. Instance-based linking is similar to entity resolu- 0.98 1.00 0.80 test01 tion and the comparative evaluation of entity resolution 0.97 test02 1.00 0.88 frameworks faces similar challenges than the evalua- test03 0.84 0.98 0.92 tion of LD frameworks. The study [26] evaluated sev- 0.96 0.91 test04 0.80 eral entity resolution tools on several real datasets on test05 0.88 0.96 0.74 publications and product offers of e-commerce web- Table 8 sites. While the publication-related match tasks were F-measure results for test cases of OAEI 2013 benchmark RDFT. relatively easy to solve, the two e-commerce match tasks turned out to be especially challenging with a maximal F-Measure of only 60 and 71 % for the con- In summary, most of the OAEI instance benchmarks so far have been of small size and relatively easy to sidered tools. These match tasks have also been used to

14 14 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks AgreementMaker SERIMI CODI Zhishi.links LogMap RiMOM SLINT+ LIMES Silk KnoFuss Task * X * X * 2010 PR XXX XX 2010 IIMB 2010 DI X 2011 IIMB X * X * X * 2011 DI-NYT XXXX X 2012 SB 2012 IIMB X 2013 RDFT XXX 2014 id-rec XX Table 5 Tool participation in OAEI instance matching tracks over the years. “*” did not participate in OAEI contest CODI KnoFuss * RiMOM LIMES (unsupervised) * Silk * Person1 0.91 1.00 - 1.00 1.00 0.97 0.99 - 0.94 Person2 0.36 0.81 0.72 0.78 - - Restaurant (OAEI) 0.98 0.99 0.82 - - Restaurant (fixed) Table 6 F-Measure results of the OAEI 2010 benchmark PR (Person/Restaurant). “*” result was achieved outside the OAEI contest. SERIMI Zhishi.links KnoFuss * Silk * Slint+ * AgrMaker nyt-dbpedia-loc. 0.69 0.68 0.92 0.89 0.93 0.97 nyt-dbpedia-org. 0.74 0.91 0.92 - 0.95 0.88 0.88 0.97 0.97 - 0.99 nyt-dbpedia-peo. 0.94 0.85 0.91 0.88 nyt-freebase-loc. - 0.95 0.93 nyt-freebase-org. 0.80 0.91 0.87 0.92 - 0.96 nyt-freebase-peo. 0.96 0.92 0.93 0.95 - 0.99 nyt-geonames 0.85 0.91 0.90 - 0.99 0.80 0.82 H-mean 0.91 0.93 - 0.97 0.85 Table 7 F-Measure results for OAEI 2011 benchmark DI-NYT [10]. H-mean is calculated manually from the single F-measure values of the appropriate publication. “*” result was achieved outside the OAEI contest. evaluate further tools including LD frameworks such tools for LD is still a largely open challenge. This is as LIMES, e.g., in [34,28]). Results in [34] confirm mainly because the participation in the OAEI contest the difficulty of the e-commerce match tasks with has been limited so far and using the OAEI tasks out- . 35% achieved F-Measure values ranging below side the competition limits the comparability of the achieved results as they are typically based on different 5.4. Observations prerequisites, e.g., the use of training data. Evaluation results on a single system or approach aim at showing Despite the laudable effort of the OAEI instance their effectiveness and efficiency rather than provid- matching tracks the comparable evaluation of existing

15 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 15 Hollink, Ernesto Jimenez-Ruiz, Christian Meilicke, Andriy ing a neutral comparative evaluation between systems. Nikolov, et al. Results of the Ontology Alignment Evalua- Given the general availability of LD tools it would be a Proc. 7th ISWC workshop on ontology tion Initiative 2012. In worthwhile investigation to apply them under the same matching (OM) , pages 73–115, 2012. prerequisites on a set of LD tasks similar than in the [2] Samur Araujo, Arjen de Vries, and Daniel Schwabe. SERIMI entity resolution study [26]. Such a study could eval- Results for OAEI 2011. In Proc. of the 6th International Work- uate not only the effectiveness and efficiency but also shop on Ontology Matching , page 212, 2011. ́ [3] Samur Ara ujo, Jan Hidders, Daniel Schwabe, and Arjen P. the usability of tools. de Vries. SERIMI - Resource Description Similarity, RDF In- Proc. of the 6th Interna- stance Matching and Interlinking. In , 2011. tional Workshop on Ontology Matching 6. Conclusion [4] Rohan Baxter, Peter Christen, and Tim Churches. A Compar- ACM ison of Fast Blocking Methods for Record Linkage. In We investigated ten LD frameworks and compared , volume 3, pages 25–27. Citeseer, 2003. SIGKDD their functionality based on a common set of criteria. [5] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. The criteria cover the main steps such as the configu- Scaling Up All Pairs Similarity Search. In Proceedings of the ration of linking specifications and methods for match- , pages 131– 16th International Conference on World Wide Web 140, 2007. ing and runtime optimization. We also covered gen- [6] Michelle Cheatham and Pascal Hitzler. String Similarity Met- eral aspects such as the supported input formats and rics for Ontology Alignment. In Harith Alani, Lalana Kagal, link types, support for a GUI and software availabil- Achille Fokoue, Paul Groth, Chris Biemann, JosianeXavier ity as open source. We observed that the considered Parreira, Lora Aroyo, Natasha Noy, Chris Welty, and Krzysztof tools already provide a rich functionality with support The Semantic Web – ISWC 2013 Janowicz, editors, , volume , pages 294–309. Lecture Notes in Computer Science 8219 of for semi-automatic configuration including advanced Springer Berlin Heidelberg, 2013. learning-based approaches such as unsupervised ge- [7] Peter Christen. Data Matching - Concepts and Techniques for netic programming or active learning. On the other Record Linkage, Entity Resolution, and Duplicate Detection . side, we found that most tools still focus on simple Data-centric systems and applications. Springer, 2012. property-based match techniques rather than using the [8] Isabel F Cruz, Flavio Palandri Antonelli, and Cosmin Stroe. AgreementMaker: efficient matching for large real-world ontological context within structural matchers. Fur- schemas and ontologies. Proceedings of the VLDB Endow- thermore, existing links and background knowledge , 2(2):1586–1589, 2009. ment are not yet exploited in the considered frameworks. [9] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassil- More comprehensive support of efficiency techniques IEEE ios S. Verykios. Duplicate Record Detection: A Survey. is also necessary such as the combined use of blocking, Transactions on Knowledge and Data Engineering , 19(1):1– filtering and parallel processing. 16, 2007. ˆ ́ [10] J er ome Euzenat, Alfio Ferrara, Willem Robert van Hague, We also analyzed comparative evaluations of the LD Laura Hollink, Christian Meilicke, Andriy Nikolov, Franc ̧ois frameworks to assess their relative effectiveness and ́ ab- Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Sv efficiency. In this respect the OAEI instance matching Zamazal, et al. Final results of the Ontology Alignment Evalu- track is the most relevant effort and we thus analyzed ation Initiative 2011. In Proc. 6th ISWC workshop on ontology its match tasks and the tool participation and results for , pages 85–110, 2011. matching (OM) ˆ ́ . Ontology matching ome Euzenat, Pavel Shvaiko, et al. [11] J er the last years. Unfortunately, the participation has been Springer, 2007. rather low thereby preventing the comparative evalua- ́ ˆ er [12] J ome Euzenat, Alfio Ferrara, Christian Meilicke, Andriy tion between most of the tools. Moreover, the focus of Nikolov, Juan Pane, Franc ̧ois Scharffe, Pavel Shvaiko, Heiner the contest has been on effectiveness so far while run- ́ ́ ab-Zamazal, Vojtech Sv Stuckenschmidt, Ondrej Sv atek, and time efficiency has not yet been evaluated. To better ́ assia Trojahn dos Santos. Results of the Ontology Align- C ˆ ́ ome Eu- er ment Evaluation Initiative 2010. In Pavel Shvaiko, J assess the relative effectiveness and efficiency of LD zenat, Fausto Giunchiglia, Heiner Stuckenschmidt, Ming Mao, tools it would be valuable to test them on a common set and Isabel Cruz, editors, Proc. 5th ISWC workshop on ontology of benchmark tasks on the same hardware. Given the , pages 85–117, 2010. matching (OM), Shanghai (CN) general availability of the tools and the existence of a [13] Alfio Ferrara, Andriy Nikolov, and Franc ̧ois Scharffe. Data considerable set of match task definitions and datasets linking for the semantic web. International Journal on Seman- tic Web and Information Systems (IJSWIS) , 7(3):46–76, 2011. this should be feasible with reasonable effort. ́ ́ [14] Ra ıa-Castro and Stuart N. Wrigley. SEALS Method- ul Garc ology for Evaluation Campaigns. Technical report, Seventh Framework Programme, 2011. References ́ ˆ [15] Bernardo Cuenca Grau, Zlatan Dragisic, Kai Eckert, J er ome ́ ˆ ́ Euzenat, Alfio Ferrara, Roger Granada, Valentina Ivanova, [1] Jos ome e Luis Aguirre, Bernardo Cuenca Grau, Kai Eckert, J er ́ Ernesto Jim enez-Ruiz, Andreas Oskar Kempf, Patrick Lam- Euzenat, Alfio Ferrara, Robert Willem van Hague, Laura

16 16 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks the Web of Data. In Proceedings of the Twenty-Second interna- brix, et al. Results of the Ontology Alignment Evaluation Ini- Proc. 8th ISWC workshop on ontology match- tiative 2013. In tional joint conference on Artificial Intelligence - Volume Vol- ume Three , IJCAI’11, pages 2312–2317. AAAI Press, 2011. ing (OM) , pages 61–100, 2013. [33] Axel-Cyrille Ngonga Ngomo, Lars Kolb, Norman Heino, [16] Michael Hartung, Anika Groß, and Erhard Rahm. Composition ̈ oren Auer, and Erhard Rahm. When to Michael Hartung, S , 2013. BTW Methods for Link Discovery. In Reach for the Cloud: Using Parallel Hardware for Link Dis- [17] Stanley Hillner and Axel-Cyrille Ngonga Ngomo. Parallelizing Proceedings of the 7th LIMES for large-scale link discovery. In covery. In , pages The Semantic Web: Semantics and Big Data 275–289. Springer, 2013. International Conference on Semantic Systems , I-Semantics [34] Axel-Cyrille Ngonga Ngomo, Klaus Lyko, and Victor Chris- ’11, pages 9–16, New York, NY, USA, 2011. ACM. [18] Jakob Huber, Timo Sztyler, Jan Noessner, and Christian Meil- COALA–Correlation-Aware Active Learning of Link ten. Specifications. In The Semantic Web: Semantics and Big Data , icke. CODI: Combinatorial optimization for data integration - pages 442–456. Springer, 2013. results for OAEI 2011. In Proc. of the 6th International Work- , page 134, 2011. shop on Ontology Matching [35] Khai Nguyen, Ryutaro Ichise, and Bac Le. SLINT: A Schema- Independent Linked Data Interlinking System. In Proc. of the [19] Robert Isele and Christian Bizer. Active Learning of Expres- , 2012. 7th International Workshop on Ontology Matching sive Linkage Rules using Genetic Programming. Journal of , 2013. Web Semantics [36] Khai Nguyen, Ryutaro Ichise, and Bac Le. Interlinking Linked Seman- Data Sources Using a Domain-Independent System. In [20] Robert Isele, Anja Jentzsch, and Christian Bizer. Efficient Mul- tic Technology , pages 113–128. Springer, 2013. tidimensional Blocking for Link Discovery without losing Re- call. In Fourteenth International Workshop on theWeb and [37] M. Niepert, C. Meilicke, and H. Stuckenschmidt. A Probabilistic-Logical Framework for Ontology Matching. In , 2011. Databases (WebDB 2011) ́ , Proc. of the 24th AAAI Conference on Artificial Intelligence [21] Ernesto Jim enez-Ruiz and Bernardo Cuenca Grau. LogMap: 2010. Logic-Based and Scalable Ontology Matching. In The Seman- [38] Andriy Nikolov, Mathieu d’Aquin, and Enrico Motta. Unsu- tic Web–ISWC 2011 , pages 273–288. Springer, 2011. ́ [22] Ernesto Jim enez-Ruiz, Bernardo Cuenca Grau, and Ian Hor- 9th Ex- pervised learning of link discovery configuration. In tended Semantic Web Conference (ESWC 2012) , 2012. rocks. LogMap and LogMapLt results for OAEI 2013. Ontol- [39] Andriy Nikolov, Victoria Uren, and Enrico Motta. KnoFuss: A ogy Matching , 2013. ́ enez-Ruiz, Bernardo Cuenca Grau, Yujiao Zhou, [23] Ernesto Jim Proceed- comprehensive architecture for knowledge fusion. In , ings of the 4th international conference on Knowledge capture and Ian Horrocks. Large-scale Interactive Ontology Match- pages 185–186. ACM, 2007. ECAI ing: Algorithms and Implementation. In , pages 444–449, 2012. [40] Xing Niu, Shu Rong, Yunlong Zhang, and Haofen Wang. ́ ˆ Zhishi.links results for OAEI 2011. In Pavel Shvaiko, J ome er [24] Lars Kolb, Andreas Thor, and Erhard Rahm. Dedoop: Effi- Euzenat, Tom Heath, Christoph Quix, Ming Mao, and Isabel F. cient Deduplication with Hadoop. Proceedings of the VLDB Proc. of the 6th International Workshop on On- Cruz, editors, , 5(12):1878–1881, 2012. Endowment ̈ tology Matching , volume 814 of CEUR Workshop Proceed- Frameworks for entity opcke and Erhard Rahm. [25] Hanna K ings , 2011. matching: A comparison. Data & Knowledge Engineering , [41] Jan Noessner and Mathias Niepert. CODI: combinatorial opti- 69(2):197 – 210, 2010. ̈ mization for data integration: results for OAEI 2010. In Proc. [26] Hanna K opcke, Andreas Thor, and Erhard Rahm. Evaluation of entity resolution approaches on real-world match problems. of the 5th International Workshop on Ontology Matching (OM- 2010) , page 142, 2010. Proc. VLDB Endow.t , 3(1-2):484–493, September 2010. [42] Jan Noessner, Mathias Niepert, Christian Meilicke, and Heiner [27] Markus Nentwig, Tommaso Soru, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. LinkLion: A Link Repository for Stuckenschmidt. Leveraging Terminological Structure for Ob- The Semantic Web: Research and Ap- ject Reconciliation. In the Web of Data. In , ESWC 2014 Posters & Demo session , pages 334–348. Springer, 2010. plications 2014. [28] Axel-Cyrille Ngomo and Klaus Lyko. EAGLE: Efficient Ac- [43] Natalya F Noy, Nigam H Shah, Patricia L Whetzel, Ben- jamin Dai, Michael Dorf, Nicholas Griffith, Clement Jonquet, tive Learning of Link Specifications Using Genetic Program- Daniel L Rubin, Margaret-Anne Storey, Christopher G Chute, The Semantic Web: Research and Applications - 9th ming. In Extended Semantic Web Conference , pages 149–163, 2012. et al. BioPortal: ontologies and integrated data resources at the click of a mouse. , 37:W170–W173, Nucleic acids research Unsuper- [29] Axel-Cyrille Ngonga Ngomo and Klaus Lyko. 2009. vised Learning of Link Specifications: Deterministic vs. Non- [44] Erhard Rahm. Towards Large-Scale Schema and Ontology Proc. of the 8th International Workshop on Deterministic. In Matching. In Zohra Bellahsene, Angela Bonifati, and Erhard , pages 25–36, 2013. Ontology Matching , Data-Centric Schema Matching and Mapping Rahm, editors, [30] Axel-Cyrille Ngonga Ngomo, Mohamed Ahmed Sherif, and Systems and Applications, pages 3–27. Springer Berlin Hei- Klaus Lyko. Unsupervised Link Discovery through Knowl- delberg, 2011. edge Base Repair. In The Semantic Web: Trends and Chal- , pages 380–394. Springer, 2014. lenges [45] Erhard Rahm and Philip A. Bernstein. A survey of approaches The VLDB Journal , 10:334– to automatic schema matching. [31] Axel-Cyrille Ngonga Ngomo. HELIOS–Execution Optimiza- 350, 2001. The Semantic Web - ISWC 2014 - tion for Link Discovery. In 13th International Semantic Web Conference, Riva del Garda, [46] Muhammad Saleem and Axel-Cyrille Ngonga Ngomo. HiBIS- CuS: Hypergraph-Based Source Selection for SPARQL End- Italy, October 19-23, 2014 , pages 17–32. Springer, 2014. ̈ point Federation. In Proceedings of the 11th Extended Seman- oren Auer. LIMES - A [32] Axel-Cyrille Ngonga Ngomo and S Time-Efficient Approach for Large-Scale Link Discovery on tic Web nConference , I-SEMANTICS ’13. Springer, 2014.

17 M. Nentwig et al. / A Survey of Current Link Discovery Frameworks 17 , pages 469–480. Springer Lecture Notes in Computer Science [47] Muhammad Saleem, Shanmukha S. Padmanabhuni, Axel- Berlin Heidelberg, 2004. Cyrille Ngonga Ngomo, Jonas S. Almeida, Stefan Decker, and [52] Andreas Thor and Erhard Rahm. MOMA - A Mapping-based Helena F. Deus. Linked Cancer Genome Atlas Database. In , pages 247–258, 2007. CIDR Object Matching System. In Proceedings of the 9th International Conference on Semantic ̈ [53] Christina Unger, Lorenz B uhmann, Jens Lehmann, Axel- Systems , I-SEMANTICS ’13, pages 129–134, New York, NY, Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. USA, 2013. ACM. [48] Max Schmachtenberg, Christian Bizer, and Heiko Paulheim. Template-based question answering over RDF data. In , WWW pages 639–648, 2012. Adoption of the linked data best practices in different topical [54] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobi- domains. In , pages 245–260. The Semantic Web–ISWC 2014 Springer, 2014. larov. Discovering and Maintaining Links on the Web of Data. In Proceedings of the 8th International Semantic Web Con- [49] Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngonga Ngomo, ̈ ference , ISWC ’09, pages 650–665, Berlin, Heidelberg, 2009. and S oren Auer. Question Answering on Interlinked Data. In Springer-Verlag. Proceedings of the 22Nd International Conference on World ̈ ̈ [55] Stephan W urger, Elena olger, Katharina Siorpaes, Tobias B , WWW ’13, pages 1145–1156, Republic and Canton Wide Web Simperl, Stefan Thaler, and Christian Hofer. A Survey On Data of Geneva, Switzerland, 2013. International World Wide Web Interlinking Methods. Technical report, STI Innsbruck, March Conferences Steering Committee. 2011. [50] Giorgos Stoilos, Giorgos Stamou, and Stefanos Kollias. A [56] Qian Zheng, Chao Shao, Juanzi Li, Zhichun Wang, and Linmei The Semantic Web– string metric for ontology alignment. In Proceedings of Hu. RiMOM2013 Results for OAEI 2013. In , pages 624–637. Springer, 2005. ISWC 2005 , page the 8th International Workshop on Ontology Matching [51] Jie Tang, Bang-Yong Liang, Juanzi Li, and Kehong Wang. Risk 161, 2013. Minimization Based Ontology Mapping. In Chi-Hung Chi and Content Computing , volume 3309 of Kwok-Yan Lam, editors,

Related documents