1 Discovering and Maintaining Links on the Web of Data 2 1 2 1 , and Georgi Kobilarov , Martin Gaedke Julius Volz , Christian Bizer 1 Chemnitz University of Technology Distributed and Self-Organizing Systems Group Straße der Nationen 62, 09107 Chemnitz, Germany [email protected] [email protected] 2 Freie Universit ̈at Berlin Web-based Systems Group Garystr. 21, 14195 Berlin, Germany [email protected] [email protected] Abstract. The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to create explicit data links between entities within different data sources. This pa- per presents the Silk – Linking Framework, a toolkit for discovering and maintaining data links between Web data sources. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on a declarative specification of the conditions that entities must fulfill in order to be interlinked; 2. A tool for evaluating the generated data links in order to fine-tune the linking specification; 3. A protocol for maintaining data links between continuously changing data sources. The protocol allows data sources to exchange both linksets as well as detailed change information and enables continuous link recom- putation. The interplay of all the components is demonstrated within a life science use case. Key words: Linked data, web of data, link discovery, link maintenance, record linkage, duplicate detection 1 Introduction The central idea of Linked Data is to extend the Web with a data commons by creating typed links between data from different data sources [1, 2]. Techni- cally, the term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web in a way that data is machine-readable, its meaning is explicitly defined, it is linked to other external datasets, and can in turn be linked to from external datasets. The data links that connect data sources take the form of RDF triples, where the subject of the triple is a URI
2 2 Discovering and Maintaining Links on the Web of Data reference in the namespace of one dataset, while the object is a URI reference in the other [2, 3]. The most visible example of adoption and application of Linked Data has 3 , a grassroots community effort to been the Linking Open Data (LOD) project bootstrap the Web of Data by interlinking open-license datasets. Out of the 6.7 billion RDF triples that are served as of July 2009 by participants of the project, 4 . approximately 148 million are RDF links between datasets As Linked Data sources often provide information about large numbers of entities, it is common practice to use automated or semi-automated methods to generate RDF links. In various domains, there are generally accepted nam- ing schemata, such as ISBN and ISSN numbers, ISIN identifiers, EAN and EPC product codes. If both datasets already support one of these identification schemata, the implicit relationship between entities in the datasets can easily be made explicit as RDF links. This approach has been used to generate links between various data sources in the LOD cloud. If no shared naming schema exists, RDF links are often generated by computing the similarity of entities within both datasets using a combination of different property-level similarity metrics. While there are more and more tools available for publishing Linked Data on the Web , there is still a lack of tools that support data publishers in setting RDF links to other data sources, as well as tools that help data publishers to maintain RDF links over time as data sources change. The Silk – Linking Frame- work contributes to filling this gap. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on shared identifiers and/or object similarity; 2. A tool for evaluating the generated RDF links in order to fine-tune the linking specification; 3. A protocol for maintaining RDF links between continuously changing data sources. Using the declarative , data Silk - Link Specification Language (Silk-LSL) publishers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions can apply different similarity metrics to multiple properties of an entity or related entities which are addressed using a path-based selector language. The resulting similarity scores can be weighted and combined using various similarity aggregation functions. Silk accesses data sources via the SPARQL protocol and can thus be used to discover links between local or remote data sources. The main features of the Silk link discovery engine are: • it supports the generation of owl:sameAs links as well as other types of RDF links. • it provides a flexible, declarative language for specifying link conditions. • it can be employed in distributed environments without having to replicate datasets locally. 3 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData 4 http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/ DataSets/LinkStatistics
3 Discovering and Maintaining Links on the Web of Data 3 it can be used in situations where terms from different vocabularies are mixed • and where no consistent RDFS or OWL schemata exist. • it implements various caching, indexing and entity preselection methods to increase performance and reduce network load. Datasets change and are extended over time. In order to keep links between two data sources current and to avoid dead links, new RDF links should be con- tinuously generated as entities are added to the target dataset and invalidated RDF links should be removed. For this task, we propose the Web of Data – Link Maintenance Protocol (WOD-LMP) . The protocol automates the communica- link source tion between two cooperating Web data sources: The link and the target , where the link source is a Web data source that publishes RDF links pointing at data published by the target data source. The protocol supports: • notifying the target data source that the link source has published a set of links pointing at the target source. This allows the target to track incoming links and decide whether it wants to set back-links. • the link source to request a list of changes from the target source. Based on the changes, the link source can recompute existing links and generate additional links pointing at new resources. • the link source to monitor resources in the target dataset by subscribing to be informed about changes that occur to these resources. This paper is structured as follows: Section 2 gives an overview of the Silk – Link Specification Language along a concrete usage example. In Section 3, we present the Silk user interface to evaluate generated links. We describe the Web of Data – Link Maintenance Protocol in Section 4 and give an overview of the implementation of the Silk framework in Section 5. Section 6 reviews related work. 2 Link Specification Language The is used to express heuristics Silk – Link Specification Language (Silk-LSL) for deciding whether a semantic relationship exists between two entities. The language is also used to specify the access parameters for the involved data sources, and to configure the caching, indexing and preselection features of the framework. Link conditions can use different aggregation functions to combine similarity scores. These aggregation functions as well as the implemented sim- ilarity metrics and value transformation functions were chosen by abstracting from the link heuristics that were used to establish links between data sources in the LOD cloud. Figure 1 contains a complete Silk-LSL link specification. In this particular use case, we want to discover owl:sameAs links between the URIs that are used by DBpedia , a data source publishing information extracted from Wikipedia, and by the Linked Data version of DrugBank , a pharmaceutical database, to identify medical drugs. In line 22 of the link specification, we thus configure the
4 4 Discovering and Maintaining Links on the Web of Data 01
5 Discovering and Maintaining Links on the Web of Data 5 2.1 Data Access For accessing the source and target data sources, we first specify access parame- ters for the DBpedia and DrugBank SPARQL endpoints using the
6 6 Discovering and Maintaining Links on the Web of Data Available similarity metrics in Silk Table 1. jaroSimilarity String similarity based on Jaro distance metric String similarity based on Jaro-Winkler metric jaroWinklerSimilarity qGramSimilarity String similarity based on q-grams stringEquality Returns 1 when strings are equal, 0 otherwise numSimilarity Percentual numeric similarity dateSimilarity Similarity between two date values uriEquality Returns 1 if two URIs are equal, 0 otherwise taxonomicSimilarity Metric based on the taxonomic distance of two concepts Returns the highest encountered similarity of comparing maxSimilarityInSet a single item to all items in a set setSimilarity Similarity between two sets of items weighted individually, with higher weighted metrics having a greater influence on the aggregated result.
7 Discovering and Maintaining Links on the Web of Data 7 is split into a prefix and a suffix part, while it is stored in a single property on the DrugBank side. Hence, we use the concat transformation function to con- catenate the code’s prefix and suffix parts on the DBpedia side before comparing it to the single-property code in DrugBank (lines 55 to 58). Table 2. Available transformation functions in Silk removeBlanks Remove whitespace from string removeSpecialChars Remove special characters from string lowerCase Convert a string to lower case Convert a string to upper case upperCase Concatenate two strings concat Apply word stemming to a string stem Strip all non-alphabetic characters from a string alphaReduce numReduce Strip all non-numeric characters from a string replace Replace all occurrences of a string with a replacement regexReplace Replace all occurences of a regex with a replacement stripURIPrefix Strip the URI prefix from a string translateWithDictionary Translate string using a provided CSV dictionary file After specifying the link condition, we finally specify within the
8 8 Discovering and Maintaining Links on the Web of Data Sometimes, we need to navigate backwards along a property edge. For ex- ample, drugs in DrugBank contain a drugbank:target property pointing to the drug’s target molecule. However, there exists no explicit reverse property like in the drug target’s resource. So if a path begins with a drug drugbank:drug target and we need to select all of the drugs that apply to it, we may use the ) to navigate property edges in reverse. Navigating back- backward operator ( \ would select the applicable drugs. wards along the property drugbank:target The filter operator ([ ]) can be used to restrict selected resources to match a certain predicate. To select only drugs amongst the ones applicable to a target molecule which have been marked as , we could for instance use the RDF approved path ” \ drugbank:target[drugbank:drugType drugType:approved] ”. ?target The filter operator also supports numeric comparisons. For example, to select drugs with a molecular weight above 200, the path ” \ drugbank:target ?target [drugbank:molecularWeightAverage > 200] ” can be used. 2.4 Pre-Matching To compare all pairs of entities of a source dataset S and a target dataset T would result in an unsatisfactory runtime complexity of O ( | S |·| T | ). Even after using SPARQL restrictions to select suitable subsets of each dataset, the required time and network load to perform all pair comparisons might prove to be impractical in many cases. To avoid this problem, we need a way to quickly find a limited set of target entities that are likely to match a given source entity. Silk provides this by offering rough index pre-matching. When using pre-matching, all target resources are indexed by the values of one or more specified properties (most commonly, their labels) before any de- tailed comparisons are performed. During the subsequent resource comparison phase, the previously generated index is used to look up potential matches for 8 a given source resource. This lookup uses the BM25 weighting scheme for the ranking of search results and additionally supports spelling corrections of indi- vidual words of a query. Only a limited number of target resources found in this lookup is then considered as candidates for a detailed comparison. An example of such a pre-matching specification that could be applied to our drug linking example is presented in Figure 2. This directive instructs Silk to in- dex the drugs in the target dataset by their rdfs:label and drugbank:synonym property values. When performing comparisons, the rdfs:label of a source re- source is used as a search term into the generated indexes and only the first ten target hits found in each index are considered as link candidates for detailed comparisons. If we neglect a slight index insertion and search time dependency on the target dataset size, we now achieve a runtime complexity of O ( | S | + | T | ), making it feasible to interlink even large datasets under practical time constraints. Note however that this pre-matching may come at the cost of missing some links 8 http://xapian.org/docs/bm25.html
9 Discovering and Maintaining Links on the Web of Data 9
10 10 Discovering and Maintaining Links on the Web of Data Comparing resource pairs with the Silk web interface Fig. 3. due to a dataset error, the section aggregating exact numeric drug identifiers contains a similarity value of 0 for the CAS registry numbers. This erroneously low value is corrected by the availability of other exactly matching identifiers in
11 Discovering and Maintaining Links on the Web of Data 11 Detailed drill-down into a resource pair comparison Fig. 4. 1,227 confident links above the threshold of 0.9 and found 32 more links above the threshold of 0.7. To evaluate the quality of the retrieved links, we created a reference linkset pertaining to 50 drugs selected randomly from DrugBank and found 38 manually researched links to DBpedia. We then ran Silk a second time to find only links from these 50 selected DrugBank resources to DBpedia and compared both the generated and the reference linkset. The evaluation revealed 4 missing links and one incorrectly discovered link. This corresponded to a Precision of 0.97, a Recall of 0.89 and an F -measure of 1 0.93. To better understand why certain links are missing and why one link was incorrect, we then compared their source and target resources via the resource comparison web interface. One link was missed because of radically differing molecular weights in both datasets. Three other missing links were not discov- ered due to the fact that their CAS registry numbers did not match while at the same time no other exact identifiers were present. Finally, one link was discov- ered incorrectly since the resource labels were very similar and no other relevant property values were present in the datasets. In a subsequent tuning of the link specification, we mitigated the effect of a single mismatching exact identifier by lowering the weight for the surrounding aggregation to 3 and setting a default value of 0.85 for the IDs in the same
12 12 Discovering and Maintaining Links on the Web of Data Evaluating linksets with the Silk web interface Fig. 5. were missing, which means that we now reached a Recall value of 0.95 and an F -measure of 0.96. 1 4 Web of Data – Link Maintenance Protocol Changes or additions in either of the interlinked datasets can invalidate existing links or imply the need to generate new ones. With the Web of Data – Link Maintenance Protocol (WOD-LMP) , we propose a solution to this problem. The WOD-LMP protocol automates the communication between two coop- erating Web data sources. It assumes two basic roles: Link source and link target , where the link source is a Web data source that publishes RDF links pointing at data published by the target data source. The protocol covers the following three use cases: 4.1 Link Transfer to Target In the simplest use case, a link source wants to send a set of RDF links to the target data source so that the target may keep track of incoming links and can decide whether it wants to set back-links. Afterwards, the source wants to keep the target informed about subsequent updates (i.e. additions and deletions) to the transferred links. To achieve the transfer of the initial set of links and of subsequently generated ones, a Link Notification message is sent to the target data source. This notification includes the generated links along with the URL of the WOD-LMP protocol endpoint at the source side. Single deletion of links
13 Discovering and Maintaining Links on the Web of Data 13 Link Deletion Notification by the source is communicated to the target in a message, which in turn contains the link triples to be deleted. 4.2 Request of Target Change List In this use case, the source data source asks the target to supply a list of all changes that have occurred to RDF resources in a target dataset within a spe- cific time period. The source may then use the provided change information for periodic link recomputation. The protocol also provides requesting only addi- tions, updates or deletions of resources. WOD-LMP uses incremental sequence numbers to identify resource changes. The changes are requested by the remote data source by sending a Get Changes message, which contains both the desired change sequence number range as well as the desired change type filter options. Change Notification The target replies to this with a , which lists the requested changes together with their corresponding sequence numbers and change types. If no upper sequence number is supplied, the target sends all changes to the latest change. This case of selective link recomputation requires periodic polling of the remote data source by the source but has the advantage of working without maintaining a persistent relationship between the linked data sources. 4.3 Subscription of Target Changes The protocol also supports fine-grained link recomputation by monitoring the resources in the target dataset that were used to compute links. As illustrated in Figure 6, the source informs the target dataset via a Link Notification mes- sage about a group of generated links and for each transferred link, supplies the URIs of the resources in the target dataset that were used to compute the link. The target saves this information and monitors the resources. If one of them changes or is deleted, the target notifies the source about these changes by send- ing a Change Notification message. The source may then use this information to recompute affected links and possibly delete invalidated ones. In this case, it notifies the target about deleted links with a Link Deletion Notification , which cancels the subscription of resources relevant to these links. The implementation of the WOD-LMP protocol is based on SOAP. The com- plete specification of the protocol is available at http://www4.wiwiss.fu-berlin. de/bizer/silk/wodlmp/ . The WOD-LMP protocol is used to maintain the links between DBpedia and DrugBank. Links generated on the DrugBank side are sent and integrated into DBpedia, while DBpedia notifies the DrugBank Silk instance about changes to subscribed resources. This synchronization will become especially important as DBpedia will start to utilize the Wikipedia live update stream to continuously extract data from changed Wikipedia pages. Thus, DBpedia resources will be continuously updated to match Wikipedia, while at the same time the DrugBank Silk instance will be able to maintain and recompute links to DBpedia.
14 14 Discovering and Maintaining Links on the Web of Data t target : TargetDataset source : SourceDatase intUri : string, links : Lin kList) : linkNotification(endpo nges : ChangeList) rgetSeqNum : int, cha : changeNotification(ta : linkDeletionNotificatio n(links : LinkList) Fig. 6. Subscribing to resource changes in the target data source 5 Implementation Silk is written in Python and is run from the command line. When generating linksets, Silk is started as a batch process. It runs as a daemon when serving the web interface or WOD-LMP protocol endpoints. The framework may be 9 under the terms of the BSD license. For calcu- downloaded from Google Code 10 lating string similarities, a library from Febrl , the Freely Extensible Biomedical Record Linkage Toolkit, is used, while Silk’s pre-matching features are achieved 11 using the search engine library Xapian . The web interface was realized with 12 the Werkzeug toolkit, while the link maintenance protocol endpoints use the 13 free soaplib library for the exchange of SOAP messages. 6 Related Work There is a large body of related work on record linkage  and duplicate detection  within the database community as well as on ontology matching  in the knowledge representation community. Silk builds on this work by implementing similarity metrics and aggregation functions that proved successful within other scenarios. What distinguishes Silk from this work is its focus on the Linked Data scenario where different types of semantic links should be discovered between Web data sources that often mix terms from different vocabularies and where no consistent RDFS or OWL schemata spanning the data sources exist. Related work that also focuses on Linked Data includes Raimond et al.  who propose a link discovery algorithm that takes into account both the simi- larities of web resources and of their neighbors. The algorithm is implemented within the GNAT tool and has been evaluated for interlinking music-related 9 http://silk.googlecode.com 10 http://sourceforge.net/projects/febrl 11 http://xapian.org 12 http://werkzeug.pocoo.org 13 http://trac.optio.webfactional.com/
15 Discovering and Maintaining Links on the Web of Data 15 datasets. In , Hassanzadeh et al. describe a framework for the discovery of semantic links over relational data which also introduces a declarative language for specifying link conditions. Their framework is meant to be used together with relational database to RDF wrappers like D2R Server or Virtuoso RDF Views. Differences between LinQL and Silk-LSL are the underlying data model and Silk’s ability to more flexibly combine metrics through aggregation functions. A framework that deals with instance coreferencing as part of the larger process of fusing Web data is the KnoFuss Architecture proposed in . In contrast to Silk, KnoFuss assumes that instance data is represented according to consistent OWL ontologies. Furthermore, approaches to track changes and updates in Linked Data sources 14 , a central registry for Web of Data documents include PingtheSemanticWeb which offers XML-RPC and REST APIs to notify the service about new or changed documents. A further approach to making change information available is proposed by Auer et al. and implemented in Triplify. Similar to the second WOD-LMP use case, change information is requested on a peer-to-peer basis in- stead of being aggregated into a central registry, such as PingtheSemanticWeb. This approach is also implemented by DSNotify, which runs as an add-on to a local data source and uses indexes to track resource changes. DSNotify sup- ports the active notification of subscribers as well as providing change data on demand. It further uses heuristics to determine the cause of a resource change and whether a deleted link target has become available under a different URI. 7 Conclusion We presented the Silk framework, a flexible tool for discovering links between en- tities within different Web data sources. The Silk-LSL link specification language was introduced and its applicability was demonstrated within a life science use case. We then proposed the WOD-LMP protocol for synchronizing and main- taining links between continuously changing Linked Data sources. Future work on Silk will focus on the following areas: We will implement further similarity metrics to support a broader range of linking use cases. To as- sist users in writing Silk-LSL specifications, machine learning techniques could be employed to adjust weightings or optimize the structure of the matching specification. Finally, we will evaluate the suitability of Silk for detecting dupli- cate entities within local datasets instead of using it to discover links between disparate RDF data sources. The value of the Web of Data rises and falls with the amount and the quality of links between data sources. We hope that Silk and other similar tools will help to strengthen the linkage between data sources and therefore contribute to the overall utility of the network. The complete Silk – LSL language specification, WoD Link Maintenance Protocol specification and further Silk usage examples are found on the Silk project website at http://www4.wiwiss.fu-berlin.de/bizer/silk/ . 14 http://pingthesemanticweb.com
16 16 Discovering and Maintaining Links on the Web of Data References 1. Berners-Lee, T.: Linked Data - Design Issues. http://www.w3.org/DesignIssues/LinkedData.html 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. Journal on Semantic Web and Information Systems (in press), 2009. 3. Bizer, C., Cyganiak, R., Heath, T.: How to publish Linked Data on the Web. http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/ 4. Bizer, C., et al.: DBpedia - A crystallization point for the Web of Data. Journal of Web Semantics: Sci. Serv. Agents World Wide Web, doi:10.1016/j.websem.2009.07.002, 2009. 5. Jentzsch, A., et al.: Enabling Tailored Therapeutics with Linked Data. In: Proceed- ings of the 2nd Workshop about Linked Data on the Web, 2009. 6. Jaro, M.: Advances in Record-linkage Methodology as Applied to the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 84(406):414-420, 1989. 7. Winkler, W.: Overview of Record Linkage and Current Research Directions. Bureau of the Census - Research Report Series, 2006. 8. Zhong, J., et al.: Conceptual Graph Matching for Semantic Search. The 2002 Inter- national Conference on Computational Science, 2002. 9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16, 2007. 10. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg, 2007. 11. Raimond, Y., Sutton, C., Sandler, M.: Automatic Interlinking of Music Datasets on the Semantic Web. In: Proceedings of the 1st Workshop about Linked Data on the Web, 2008. 12. Hassanzadeh, O., et al.: Semantic Link Discovery Over Relational Data. Proceed- ings of the 18th ACM Conference on Information and Knowledge Management, 2009. 13. Nikolov, A., et al.: Integration of Semantically Annotated Data by the KnoFuss Ar- chitecture. In: 16th International Conference on Knowledge Engineering and Knowl- edge Management, 265-274, 2008. 14. Auer, S., et al.: Triplify – Light-Weight Linked Data Publication from Relational Databases. In: Proceedings of the 18th International World Wide Web Conference, 2009. 15. Haslhofer, B., Popitsch, N.: DSNotify – Detecting and Fixing Broken Links in Linked Data Sets. In: Proceedings of 8th International Workshop on Web Semantics, 2009.
At the Dawn of Belt and Road China in the Developing World Andrew Scobell, Bonny Lin, Howard J. Shatz, Michael Johnson, Larry Hanauer, Michael S. Chase, Astrid Stuth Cevallos, Ivan W. Rasmussen, Arthu...More info »
American Economic Association An Empirical Assessment of the Comparative Advantage Gains from Trade: Evidence from Japan Author(s): Daniel M. Bernhofen and John C. Brown Source: The American Economic ...More info »
Examining the Debt Implications of the Belt and Road Initiative from a Policy Perspective John Hurley, Scott Morris, and Gailyn Portelance Abstract China’s Belt and Road Initiative (BRI) hopes to deli...More info »
Intelligence Artificial on Conference Joint International Twenty-Second the of Proceedings LIMES—AT ime-Efficient Approach for Large-Scale Link Discovery on the Web of Data ̈ Axel-Cyrille Ngonga Ngomo...More info »
HURCH C ISTORY H HURCH C H ISTORY IN THE ULNESS F IN THE ULNESS F OF T IMES OF IMES T S tudent M anual S anual M tudent RELIGION 341 THROUGH 343More info »