VolzBizerGaedkeKobilarov ISWC2009 Silk

Transcript

1 Discovering and Maintaining Links on the Web of Data 2 1 2 1 , and Georgi Kobilarov , Martin Gaedke Julius Volz , Christian Bizer 1 Chemnitz University of Technology Distributed and Self-Organizing Systems Group Straße der Nationen 62, 09107 Chemnitz, Germany [email protected] [email protected] 2 Freie Universit ̈at Berlin Web-based Systems Group Garystr. 21, 14195 Berlin, Germany [email protected] [email protected] Abstract. The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to create explicit data links between entities within different data sources. This pa- per presents the Silk – Linking Framework, a toolkit for discovering and maintaining data links between Web data sources. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on a declarative specification of the conditions that entities must fulfill in order to be interlinked; 2. A tool for evaluating the generated data links in order to fine-tune the linking specification; 3. A protocol for maintaining data links between continuously changing data sources. The protocol allows data sources to exchange both linksets as well as detailed change information and enables continuous link recom- putation. The interplay of all the components is demonstrated within a life science use case. Key words: Linked data, web of data, link discovery, link maintenance, record linkage, duplicate detection 1 Introduction The central idea of Linked Data is to extend the Web with a data commons by creating typed links between data from different data sources [1, 2]. Techni- cally, the term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web in a way that data is machine-readable, its meaning is explicitly defined, it is linked to other external datasets, and can in turn be linked to from external datasets. The data links that connect data sources take the form of RDF triples, where the subject of the triple is a URI

2 2 Discovering and Maintaining Links on the Web of Data reference in the namespace of one dataset, while the object is a URI reference in the other [2, 3]. The most visible example of adoption and application of Linked Data has 3 , a grassroots community effort to been the Linking Open Data (LOD) project bootstrap the Web of Data by interlinking open-license datasets. Out of the 6.7 billion RDF triples that are served as of July 2009 by participants of the project, 4 . approximately 148 million are RDF links between datasets As Linked Data sources often provide information about large numbers of entities, it is common practice to use automated or semi-automated methods to generate RDF links. In various domains, there are generally accepted nam- ing schemata, such as ISBN and ISSN numbers, ISIN identifiers, EAN and EPC product codes. If both datasets already support one of these identification schemata, the implicit relationship between entities in the datasets can easily be made explicit as RDF links. This approach has been used to generate links between various data sources in the LOD cloud. If no shared naming schema exists, RDF links are often generated by computing the similarity of entities within both datasets using a combination of different property-level similarity metrics. While there are more and more tools available for publishing Linked Data on the Web [3], there is still a lack of tools that support data publishers in setting RDF links to other data sources, as well as tools that help data publishers to maintain RDF links over time as data sources change. The Silk – Linking Frame- work contributes to filling this gap. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on shared identifiers and/or object similarity; 2. A tool for evaluating the generated RDF links in order to fine-tune the linking specification; 3. A protocol for maintaining RDF links between continuously changing data sources. Using the declarative , data Silk - Link Specification Language (Silk-LSL) publishers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions can apply different similarity metrics to multiple properties of an entity or related entities which are addressed using a path-based selector language. The resulting similarity scores can be weighted and combined using various similarity aggregation functions. Silk accesses data sources via the SPARQL protocol and can thus be used to discover links between local or remote data sources. The main features of the Silk link discovery engine are: • it supports the generation of owl:sameAs links as well as other types of RDF links. • it provides a flexible, declarative language for specifying link conditions. • it can be employed in distributed environments without having to replicate datasets locally. 3 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData 4 http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/ DataSets/LinkStatistics

3 Discovering and Maintaining Links on the Web of Data 3 it can be used in situations where terms from different vocabularies are mixed • and where no consistent RDFS or OWL schemata exist. • it implements various caching, indexing and entity preselection methods to increase performance and reduce network load. Datasets change and are extended over time. In order to keep links between two data sources current and to avoid dead links, new RDF links should be con- tinuously generated as entities are added to the target dataset and invalidated RDF links should be removed. For this task, we propose the Web of Data – Link Maintenance Protocol (WOD-LMP) . The protocol automates the communica- link source tion between two cooperating Web data sources: The link and the target , where the link source is a Web data source that publishes RDF links pointing at data published by the target data source. The protocol supports: • notifying the target data source that the link source has published a set of links pointing at the target source. This allows the target to track incoming links and decide whether it wants to set back-links. • the link source to request a list of changes from the target source. Based on the changes, the link source can recompute existing links and generate additional links pointing at new resources. • the link source to monitor resources in the target dataset by subscribing to be informed about changes that occur to these resources. This paper is structured as follows: Section 2 gives an overview of the Silk – Link Specification Language along a concrete usage example. In Section 3, we present the Silk user interface to evaluate generated links. We describe the Web of Data – Link Maintenance Protocol in Section 4 and give an overview of the implementation of the Silk framework in Section 5. Section 6 reviews related work. 2 Link Specification Language The is used to express heuristics Silk – Link Specification Language (Silk-LSL) for deciding whether a semantic relationship exists between two entities. The language is also used to specify the access parameters for the involved data sources, and to configure the caching, indexing and preselection features of the framework. Link conditions can use different aggregation functions to combine similarity scores. These aggregation functions as well as the implemented sim- ilarity metrics and value transformation functions were chosen by abstracting from the link heuristics that were used to establish links between data sources in the LOD cloud. Figure 1 contains a complete Silk-LSL link specification. In this particular use case, we want to discover owl:sameAs links between the URIs that are used by DBpedia [4], a data source publishing information extracted from Wikipedia, and by the Linked Data version of DrugBank [5], a pharmaceutical database, to identify medical drugs. In line 22 of the link specification, we thus configure the to be owl:sameAs .

4 4 Discovering and Maintaining Links on the Web of Data 01 02 03 http://dbpedia.org/sparql 04 http://dbpedia.org 05 1 06 1000 07 08 09 http://www4.wiwiss.fu-berlin.de/drugbank/sparql 10 11 12 13 14 15 16 17 18 19 20 21 22 owl:sameAs 23 24 25 ?a rdf:type dbpedia:Drug 26 27 28 29 30 ?b rdf:type drugbank:drugs 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 81 82 Fig. 1. Example: Interlinking drugs in DBpedia and DrugBank

5 Discovering and Maintaining Links on the Web of Data 5 2.1 Data Access For accessing the source and target data sources, we first specify access parame- ters for the DBpedia and DrugBank SPARQL endpoints using the directive. The only mandatory data source parameter is the SPARQL endpoint URI. Besides this, it is possible to define other data source access options, such as the graph name and to enable in-memory caching of SPARQL query results. In order to restrict the query load on remote SPARQL endpoints, it is possible parameter, spec- to set a delay between subsequent queries using the ifying the delay time in milliseconds. For working against SPARQL endpoints that restrict result sets to a certain size, Silk uses a paging mechanism. The maximal result size is configured using the parameter. The paging mechanism is implemented using SPARQL and OFFSET queries. Lines 2 to LIMIT 7 in Figure 1 show how the access parameters for the DBpedia data source are http://dbpedia.org , enable set to select only resources from the named graph caching and limit the page size to 1,000 results per query. The configured data sources are later referenced in the and clauses of the link specification. Since we only want to match drugs, we restrict the sets of examined resources to instances of the classes dbpedia:Drug and drugbank:drugs in the respective datasets by sup- plying SPARQL conditions within the directives in lines 25 and 30. These statements may contain any valid SPARQL expressions that would usually be found in the clause of a SPARQL query. WHERE 2.2 Link Conditions section is the heart of a Silk link specification and defines The how similarity metrics are combined in order to calculate a total similarity value for a pair of entities. For comparing property values or sets of entities, Silk provides a number of built-in similarity metrics. Table 1 gives an overview of these metrics. The implemented metrics include string, numeric, date, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two concepts within a concept hierarchy using the distance metric proposed by Zhong et al. in [8]. Each metric in Silk evaluates to a similarity value between 0 or 1, with higher values indicating a greater similarity. These similarity metrics may be combined using the following aggregation functions: • AVG – weighted average • MAX – choose the highest value • MIN – choose the lowest value • EUCLID – Euclidian distance metric • PRODUCT – weighted product To take into account the varying importance of different properties, the metrics grouped inside the AVG, EUCLID and PRODUCT operators may be

6 6 Discovering and Maintaining Links on the Web of Data Available similarity metrics in Silk Table 1. jaroSimilarity String similarity based on Jaro distance metric[6] String similarity based on Jaro-Winkler metric[7] jaroWinklerSimilarity qGramSimilarity String similarity based on q-grams stringEquality Returns 1 when strings are equal, 0 otherwise numSimilarity Percentual numeric similarity dateSimilarity Similarity between two date values uriEquality Returns 1 if two URIs are equal, 0 otherwise taxonomicSimilarity Metric based on the taxonomic distance of two concepts Returns the highest encountered similarity of comparing maxSimilarityInSet a single item to all items in a set setSimilarity Similarity between two sets of items weighted individually, with higher weighted metrics having a greater influence on the aggregated result. section of the example (lines 33 to 76), we compute In the 5 6 similarity values for the the labels, PubChem IDs , ATC , CAS registry numbers 7 codes and molecular weights between datasets and calculate a weighted average of these values. Most metrics are configured to be optional since the presence of the re- spective RDF property values they refer to is not always guaranteed. In cases where alternating properties refer to an equivalent feature (such as rdfs:label , and ), we choose to perform com- drugbank:synonym drugbank:genericName parisons for both properties and select the best evaluation by using the aggregation operator. The operator is also used to choose the maximum value of the comparisons between any of the exact drug identifiers. Weighting of results is used within the metrics comparing these exact values (line 52), with the metric weight raised to 5, as well as within the molecular weight comparison using a weighting factor of 2. Lines 11 to 20 demonstrate how a user-defined metric is specified. User- defined metrics may be used like built-in metrics. In the example, the defined jaroSets metric is used as a submetric for the maxSimilarityInSets eval- uations in lines 36-50 for the pairwise comparison of elements of the com- pared sets. In this case, the user-defined metric is mainly a wrapper around a jaroWinklerSimilarity call to achieve type-compatibility with the set com- parison interface. Property values are often represented differently across datasets and thus need to be normalized before being compared. For handling this task, it is pos- sible to apply data transformation functions to parameter values before passing them to a similarity metric. The available transformation functions are shown in Table 2. In the drug linking example, a drug’s ATC code in the DBpedia dataset 5 http://pubchem.ncbi.nlm.nih.gov/ 6 http://www.cas.org/expertise/cascontent/registry/regsys.html 7 http://www.who.int/classifications/atcddd/en/

7 Discovering and Maintaining Links on the Web of Data 7 is split into a prefix and a suffix part, while it is stored in a single property on the DrugBank side. Hence, we use the concat transformation function to con- catenate the code’s prefix and suffix parts on the DBpedia side before comparing it to the single-property code in DrugBank (lines 55 to 58). Table 2. Available transformation functions in Silk removeBlanks Remove whitespace from string removeSpecialChars Remove special characters from string lowerCase Convert a string to lower case Convert a string to upper case upperCase Concatenate two strings concat Apply word stemming to a string stem Strip all non-alphabetic characters from a string alphaReduce numReduce Strip all non-numeric characters from a string replace Replace all occurrences of a string with a replacement regexReplace Replace all occurences of a regex with a replacement stripURIPrefix Strip the URI prefix from a string translateWithDictionary Translate string using a provided CSV dictionary file After specifying the link condition, we finally specify within the clause that resource pairs with a similarity score above 0.9 are to be interlinked, whereas pairs between 0.7 and 0.9 should be written to a separate output file clause is used to limit the in order to be reviewed by an expert. The number of outgoing links from a particular entity within the source dataset. If several candidate links exist, only the highest evaluated one is chosen and writ- ten to the output files as specified by the directive. In this example, we permit only one outgoing owl:sameAs link from each resource. Discovered links can be outputted either as simple RDF triples and/or in reified form together with their creation date, confidence score and the URI identifying the employed interlinking heuristic. 2.3 Silk Selector Language Especially for discovering semantic relationships other than entity equality, a flexible way for selecting sets of resources or literals in the RDF graph around a particular resource is needed. Silk addresses this requirement by offering a simple RDF path selector language for providing parameter values to similarity metrics and transformation functions. A Silk selector language path starts with a variable referring to an RDF resource and may then use several path operators to navigate the graph surrounding this resource. To simply access a particular property of a resource, the forward operator ( / ) may be used. For example, the path ” ?drug/rdfs:label ” would select the set of label values associated with a drug referred to by the ?drug variable.

8 8 Discovering and Maintaining Links on the Web of Data Sometimes, we need to navigate backwards along a property edge. For ex- ample, drugs in DrugBank contain a drugbank:target property pointing to the drug’s target molecule. However, there exists no explicit reverse property like in the drug target’s resource. So if a path begins with a drug drugbank:drug target and we need to select all of the drugs that apply to it, we may use the ) to navigate property edges in reverse. Navigating back- backward operator ( \ would select the applicable drugs. wards along the property drugbank:target The filter operator ([ ]) can be used to restrict selected resources to match a certain predicate. To select only drugs amongst the ones applicable to a target molecule which have been marked as , we could for instance use the RDF approved path ” \ drugbank:target[drugbank:drugType drugType:approved] ”. ?target The filter operator also supports numeric comparisons. For example, to select drugs with a molecular weight above 200, the path ” \ drugbank:target ?target [drugbank:molecularWeightAverage > 200] ” can be used. 2.4 Pre-Matching To compare all pairs of entities of a source dataset S and a target dataset T would result in an unsatisfactory runtime complexity of O ( | S |·| T | ). Even after using SPARQL restrictions to select suitable subsets of each dataset, the required time and network load to perform all pair comparisons might prove to be impractical in many cases. To avoid this problem, we need a way to quickly find a limited set of target entities that are likely to match a given source entity. Silk provides this by offering rough index pre-matching. When using pre-matching, all target resources are indexed by the values of one or more specified properties (most commonly, their labels) before any de- tailed comparisons are performed. During the subsequent resource comparison phase, the previously generated index is used to look up potential matches for 8 a given source resource. This lookup uses the BM25 weighting scheme for the ranking of search results and additionally supports spelling corrections of indi- vidual words of a query. Only a limited number of target resources found in this lookup is then considered as candidates for a detailed comparison. An example of such a pre-matching specification that could be applied to our drug linking example is presented in Figure 2. This directive instructs Silk to in- dex the drugs in the target dataset by their rdfs:label and drugbank:synonym property values. When performing comparisons, the rdfs:label of a source re- source is used as a search term into the generated indexes and only the first ten target hits found in each index are considered as link candidates for detailed comparisons. If we neglect a slight index insertion and search time dependency on the target dataset size, we now achieve a runtime complexity of O ( | S | + | T | ), making it feasible to interlink even large datasets under practical time constraints. Note however that this pre-matching may come at the cost of missing some links 8 http://xapian.org/docs/bm25.html

9 Discovering and Maintaining Links on the Web of Data 9 Fig. 2. Pre-Matching during discovery, since it is not guaranteed that a pre-matching lookup will always find all matching target resources. 3 Evaluating Links In real-world settings, data is often not as clean and complete as we would wish it to be. For instance, two data sources might both support the same identification schema, like EAN, ISBN or ISIN numbers, but due to a large number of missing values, it is nevertheless necessary to use similarity computations in addition to identifier matching to generate links. Such data quality problems are usually not known in advance but discovered when a data publisher tries to compute links pointing to a target data source. Therefore, finding a good linking heuristic is usually an iterative process. In order to support users with this task, Silk provides a Web interface for evaluating the correctness and completeness of generated links. Based on this evaluation, users can fine-tune their linking specification, for example by changing weights or applying different metrics or aggregation functions. 3.1 Resource Comparison The resource comparison component of the Silk web interface allows the user to quickly evaluate links according to the currently loaded linking specification. A screenshot of this interface is shown in Figure 3. The user first enters a set of RDF links into the box at the top of the screen. Silk then recomputes these links and displays the resulting similarity scores for each link in an overview table. For further examination, a drill-down view of a specific pair comparison can be shown by clicking on one of the table rows. This drill-down shows in a tree-like fashion the exact parameterizations and evalua- tions of all submetrics and aggregations employed. This information allows the user to spot parts of the similarity evaluation which did not behave as expected. An example drill-down of a comparison between the DrugBank and DBpedia resources describing the drug Lorazepam is shown in Figure 4. As evident from the illustration, the two drug entries are matched successfully with a high total similarity score although several subcomparisons return infavorable results. For example, the comparison of the DBpedia resource’s label with the synonyms on the DrugBank side finds only a similarity of 0.867. However, since perfectly matching labels exist on both sides, the operator grouping these name- related property comparisons evaluates to a total similarity value of 1. Similarly,

10 10 Discovering and Maintaining Links on the Web of Data Comparing resource pairs with the Silk web interface Fig. 3. due to a dataset error, the section aggregating exact numeric drug identifiers contains a similarity value of 0 for the CAS registry numbers. This erroneously low value is corrected by the availability of other exactly matching identifiers in aggregation. a 3.2 Evaluation against a Reference Linkset A methodology that proved useful for optimizing link specifications is to man- ually create a small reference linkset and then optimize the Silk linking specifi- cation to produce these reference links, before Silk is run against the complete target data source. Once such a reference linkset is available, the Silk web in- terface provides a linkset evaluation component which allows the comparison of generated linksets to the reference set. This component is shown in Figure 5. Silk displays which links are missing from the generated set as well as which resource pairs were interlinked erroneously. To give an overall indication about the linkset quality, Silk also computes statistical measures pertaining to com- pleteness and correctness of the generated links. A Precision value indicates the correctness of generated links, while a Recall value measures the completeness of discovered links. Finally, the F calculates the weighted harmonic -measure 1 mean of both, providing an overall-quality measure of the linkset. 3.3 Improving the DBpedia/DrugBank Link Specification We compared 3,134 drugs in DBpedia with 4,772 drugs in DrugBank. As a result of applying the linking specification shown in Figure 1, Silk discovered

11 Discovering and Maintaining Links on the Web of Data 11 Detailed drill-down into a resource pair comparison Fig. 4. 1,227 confident links above the threshold of 0.9 and found 32 more links above the threshold of 0.7. To evaluate the quality of the retrieved links, we created a reference linkset pertaining to 50 drugs selected randomly from DrugBank and found 38 manually researched links to DBpedia. We then ran Silk a second time to find only links from these 50 selected DrugBank resources to DBpedia and compared both the generated and the reference linkset. The evaluation revealed 4 missing links and one incorrectly discovered link. This corresponded to a Precision of 0.97, a Recall of 0.89 and an F -measure of 1 0.93. To better understand why certain links are missing and why one link was incorrect, we then compared their source and target resources via the resource comparison web interface. One link was missed because of radically differing molecular weights in both datasets. Three other missing links were not discov- ered due to the fact that their CAS registry numbers did not match while at the same time no other exact identifiers were present. Finally, one link was discov- ered incorrectly since the resource labels were very similar and no other relevant property values were present in the datasets. In a subsequent tuning of the link specification, we mitigated the effect of a single mismatching exact identifier by lowering the weight for the surrounding aggregation to 3 and setting a default value of 0.85 for the IDs in the same aggregation in case the correspond- ing RDF properties were not available. This lowered the negative effect of a single incorrect identifier while preserving a high rating in this aggrega- tion whenever a matching value is found. After this improvement, only 2 links

12 12 Discovering and Maintaining Links on the Web of Data Evaluating linksets with the Silk web interface Fig. 5. were missing, which means that we now reached a Recall value of 0.95 and an F -measure of 0.96. 1 4 Web of Data – Link Maintenance Protocol Changes or additions in either of the interlinked datasets can invalidate existing links or imply the need to generate new ones. With the Web of Data – Link Maintenance Protocol (WOD-LMP) , we propose a solution to this problem. The WOD-LMP protocol automates the communication between two coop- erating Web data sources. It assumes two basic roles: Link source and link target , where the link source is a Web data source that publishes RDF links pointing at data published by the target data source. The protocol covers the following three use cases: 4.1 Link Transfer to Target In the simplest use case, a link source wants to send a set of RDF links to the target data source so that the target may keep track of incoming links and can decide whether it wants to set back-links. Afterwards, the source wants to keep the target informed about subsequent updates (i.e. additions and deletions) to the transferred links. To achieve the transfer of the initial set of links and of subsequently generated ones, a Link Notification message is sent to the target data source. This notification includes the generated links along with the URL of the WOD-LMP protocol endpoint at the source side. Single deletion of links

13 Discovering and Maintaining Links on the Web of Data 13 Link Deletion Notification by the source is communicated to the target in a message, which in turn contains the link triples to be deleted. 4.2 Request of Target Change List In this use case, the source data source asks the target to supply a list of all changes that have occurred to RDF resources in a target dataset within a spe- cific time period. The source may then use the provided change information for periodic link recomputation. The protocol also provides requesting only addi- tions, updates or deletions of resources. WOD-LMP uses incremental sequence numbers to identify resource changes. The changes are requested by the remote data source by sending a Get Changes message, which contains both the desired change sequence number range as well as the desired change type filter options. Change Notification The target replies to this with a , which lists the requested changes together with their corresponding sequence numbers and change types. If no upper sequence number is supplied, the target sends all changes to the latest change. This case of selective link recomputation requires periodic polling of the remote data source by the source but has the advantage of working without maintaining a persistent relationship between the linked data sources. 4.3 Subscription of Target Changes The protocol also supports fine-grained link recomputation by monitoring the resources in the target dataset that were used to compute links. As illustrated in Figure 6, the source informs the target dataset via a Link Notification mes- sage about a group of generated links and for each transferred link, supplies the URIs of the resources in the target dataset that were used to compute the link. The target saves this information and monitors the resources. If one of them changes or is deleted, the target notifies the source about these changes by send- ing a Change Notification message. The source may then use this information to recompute affected links and possibly delete invalidated ones. In this case, it notifies the target about deleted links with a Link Deletion Notification , which cancels the subscription of resources relevant to these links. The implementation of the WOD-LMP protocol is based on SOAP. The com- plete specification of the protocol is available at http://www4.wiwiss.fu-berlin. de/bizer/silk/wodlmp/ . The WOD-LMP protocol is used to maintain the links between DBpedia and DrugBank. Links generated on the DrugBank side are sent and integrated into DBpedia, while DBpedia notifies the DrugBank Silk instance about changes to subscribed resources. This synchronization will become especially important as DBpedia will start to utilize the Wikipedia live update stream to continuously extract data from changed Wikipedia pages. Thus, DBpedia resources will be continuously updated to match Wikipedia, while at the same time the DrugBank Silk instance will be able to maintain and recompute links to DBpedia.

14 14 Discovering and Maintaining Links on the Web of Data t target : TargetDataset source : SourceDatase intUri : string, links : Lin kList) : linkNotification(endpo nges : ChangeList) rgetSeqNum : int, cha : changeNotification(ta : linkDeletionNotificatio n(links : LinkList) Fig. 6. Subscribing to resource changes in the target data source 5 Implementation Silk is written in Python and is run from the command line. When generating linksets, Silk is started as a batch process. It runs as a daemon when serving the web interface or WOD-LMP protocol endpoints. The framework may be 9 under the terms of the BSD license. For calcu- downloaded from Google Code 10 lating string similarities, a library from Febrl , the Freely Extensible Biomedical Record Linkage Toolkit, is used, while Silk’s pre-matching features are achieved 11 using the search engine library Xapian . The web interface was realized with 12 the Werkzeug toolkit, while the link maintenance protocol endpoints use the 13 free soaplib library for the exchange of SOAP messages. 6 Related Work There is a large body of related work on record linkage [7] and duplicate detection [9] within the database community as well as on ontology matching [10] in the knowledge representation community. Silk builds on this work by implementing similarity metrics and aggregation functions that proved successful within other scenarios. What distinguishes Silk from this work is its focus on the Linked Data scenario where different types of semantic links should be discovered between Web data sources that often mix terms from different vocabularies and where no consistent RDFS or OWL schemata spanning the data sources exist. Related work that also focuses on Linked Data includes Raimond et al. [11] who propose a link discovery algorithm that takes into account both the simi- larities of web resources and of their neighbors. The algorithm is implemented within the GNAT tool and has been evaluated for interlinking music-related 9 http://silk.googlecode.com 10 http://sourceforge.net/projects/febrl 11 http://xapian.org 12 http://werkzeug.pocoo.org 13 http://trac.optio.webfactional.com/

15 Discovering and Maintaining Links on the Web of Data 15 datasets. In [12], Hassanzadeh et al. describe a framework for the discovery of semantic links over relational data which also introduces a declarative language for specifying link conditions. Their framework is meant to be used together with relational database to RDF wrappers like D2R Server or Virtuoso RDF Views. Differences between LinQL and Silk-LSL are the underlying data model and Silk’s ability to more flexibly combine metrics through aggregation functions. A framework that deals with instance coreferencing as part of the larger process of fusing Web data is the KnoFuss Architecture proposed in [13]. In contrast to Silk, KnoFuss assumes that instance data is represented according to consistent OWL ontologies. Furthermore, approaches to track changes and updates in Linked Data sources 14 , a central registry for Web of Data documents include PingtheSemanticWeb which offers XML-RPC and REST APIs to notify the service about new or changed documents. A further approach to making change information available is proposed by Auer et al. and implemented in Triplify[14]. Similar to the second WOD-LMP use case, change information is requested on a peer-to-peer basis in- stead of being aggregated into a central registry, such as PingtheSemanticWeb. This approach is also implemented by DSNotify[15], which runs as an add-on to a local data source and uses indexes to track resource changes. DSNotify sup- ports the active notification of subscribers as well as providing change data on demand. It further uses heuristics to determine the cause of a resource change and whether a deleted link target has become available under a different URI. 7 Conclusion We presented the Silk framework, a flexible tool for discovering links between en- tities within different Web data sources. The Silk-LSL link specification language was introduced and its applicability was demonstrated within a life science use case. We then proposed the WOD-LMP protocol for synchronizing and main- taining links between continuously changing Linked Data sources. Future work on Silk will focus on the following areas: We will implement further similarity metrics to support a broader range of linking use cases. To as- sist users in writing Silk-LSL specifications, machine learning techniques could be employed to adjust weightings or optimize the structure of the matching specification. Finally, we will evaluate the suitability of Silk for detecting dupli- cate entities within local datasets instead of using it to discover links between disparate RDF data sources. The value of the Web of Data rises and falls with the amount and the quality of links between data sources. We hope that Silk and other similar tools will help to strengthen the linkage between data sources and therefore contribute to the overall utility of the network. The complete Silk – LSL language specification, WoD Link Maintenance Protocol specification and further Silk usage examples are found on the Silk project website at http://www4.wiwiss.fu-berlin.de/bizer/silk/ . 14 http://pingthesemanticweb.com

16 16 Discovering and Maintaining Links on the Web of Data References 1. Berners-Lee, T.: Linked Data - Design Issues. http://www.w3.org/DesignIssues/LinkedData.html 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. Journal on Semantic Web and Information Systems (in press), 2009. 3. Bizer, C., Cyganiak, R., Heath, T.: How to publish Linked Data on the Web. http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/ 4. Bizer, C., et al.: DBpedia - A crystallization point for the Web of Data. Journal of Web Semantics: Sci. Serv. Agents World Wide Web, doi:10.1016/j.websem.2009.07.002, 2009. 5. Jentzsch, A., et al.: Enabling Tailored Therapeutics with Linked Data. In: Proceed- ings of the 2nd Workshop about Linked Data on the Web, 2009. 6. Jaro, M.: Advances in Record-linkage Methodology as Applied to the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 84(406):414-420, 1989. 7. Winkler, W.: Overview of Record Linkage and Current Research Directions. Bureau of the Census - Research Report Series, 2006. 8. Zhong, J., et al.: Conceptual Graph Matching for Semantic Search. The 2002 Inter- national Conference on Computational Science, 2002. 9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16, 2007. 10. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg, 2007. 11. Raimond, Y., Sutton, C., Sandler, M.: Automatic Interlinking of Music Datasets on the Semantic Web. In: Proceedings of the 1st Workshop about Linked Data on the Web, 2008. 12. Hassanzadeh, O., et al.: Semantic Link Discovery Over Relational Data. Proceed- ings of the 18th ACM Conference on Information and Knowledge Management, 2009. 13. Nikolov, A., et al.: Integration of Semantically Annotated Data by the KnoFuss Ar- chitecture. In: 16th International Conference on Knowledge Engineering and Knowl- edge Management, 265-274, 2008. 14. Auer, S., et al.: Triplify – Light-Weight Linked Data Publication from Relational Databases. In: Proceedings of the 18th International World Wide Web Conference, 2009. 15. Haslhofer, B., Popitsch, N.: DSNotify – Detecting and Fixing Broken Links in Linked Data Sets. In: Proceedings of 8th International Workshop on Web Semantics, 2009.

Related documents

At the Dawn of Belt and Road: China in the Developing World

At the Dawn of Belt and Road: China in the Developing World

At the Dawn of Belt and Road China in the Developing World Andrew Scobell, Bonny Lin, Howard J. Shatz, Michael Johnson, Larry Hanauer, Michael S. Chase, Astrid Stuth Cevallos, Ivan W. Rasmussen, Arthu...

More info »
Capital Volume I

Capital Volume I

Capital A Critique of Political Economy Volume I Book One: The Process of Production of Capital First published: in German in 1867, English edition first published in 1887; Source: First English editi...

More info »
Women in Love

Women in Love

Women in Love By D.H. Lawrence

More info »
Little Women

Little Women

Little Women By Louisa May Alcott

More info »
2018 0026 China FRN 7 10 2018 0

2018 0026 China FRN 7 10 2018 0

[Billing Code 3290 - ] F 8 OFFICE OF THE UNITED STATES TRADE REPRESENTATIVE Docket Number USTR - 2018 - 0026 Concerning Proposed Request for Comment Action Pursuant to Section s Modification of : Chin...

More info »
Tariff List 09.17.18

Tariff List 09.17.18

TARIFF LIST – SEPTEMBER 17, 2018 Part 1 gs of the Harmonized T ariff Schedule of the Note: All products that are classified in the 8‐digit subheadin United States (HTS) that are listed in Part 1 of th...

More info »
Untitled

Untitled

vision 6 hedule of the United States (2019) Re Tariff Sc ed Harmoniz poses ting Pur or Statistical Repor Annotated f CHAPTER 63 RA TICLES; TEXTILE AR WORN THING AND GS WORN CLO NEEDLECRAFT SETS; TICLE...

More info »
2018 History and Social Science Framework

2018 History and Social Science Framework

2018 HISTORY AND Massachusetts SOCIAL SCIENCE Curriculum – Framework FRAMEWORK 2018 Grades Pre Kindergarten to 12 -

More info »
A Survey of Current Link Discovery Frameworks

A Survey of Current Link Discovery Frameworks

Undefined 0 (2015) 1–0 1 IOS Press A Survey of Current Link Discovery Frameworks a , b ∗ a a , Michael Hartung Axel-Cyrille Ngonga Ngomo and Erhard Rahm Markus Nentwig a Database Group, University of ...

More info »
An Empirical Assessment of the Comparative Advantage Gains from Trade: Evidence from Japan

An Empirical Assessment of the Comparative Advantage Gains from Trade: Evidence from Japan

American Economic Association An Empirical Assessment of the Comparative Advantage Gains from Trade: Evidence from Japan Author(s): Daniel M. Bernhofen and John C. Brown Source: The American Economic ...

More info »
Cyber Digital Task Force Report

Cyber Digital Task Force Report

U.S. Department of Justice REPORT OF THE ATTORNEY GENERAL’S CYBER DIGITAL TASK FORCE

More info »
7340.2H  Bsc dtd 3 29 18

7340.2H Bsc dtd 3 29 18

ORDER JO 7340.2H Air Traffic Organization Policy Effective Date: March 29, 2018 Contractions SUBJ: contractions used by ed word and phrase This handbook contains the approv personnel of the Federal Av...

More info »
For the Term of His Natural Life

For the Term of His Natural Life

For the Term of His Natural Life Marcus Clarke This eBook was designed and published by Planet PDF. For more free . To hear eBooks visit our Web site at http://www.planetpdf.com/ about our latest rele...

More info »
Examining the Debt Implications of the Belt and Road Initiative from a Policy Perspective

Examining the Debt Implications of the Belt and Road Initiative from a Policy Perspective

Examining the Debt Implications of the Belt and Road Initiative from a Policy Perspective John Hurley, Scott Morris, and Gailyn Portelance Abstract China’s Belt and Road Initiative (BRI) hopes to deli...

More info »
Conjugial Love

Conjugial Love

The Delights of Wisdom Pertaining to Conjugial Love After which follow the pleasures of insanity pertaining to promiscuous love EMANUEL SWEDENBORG Translated from the Original Latin by Samuel M. Warre...

More info »
B&R Initiative 65 Countries and Beyond

B&R Initiative 65 Countries and Beyond

The Belt and Road Initiative: 65 Countries and Beyond May 2016 HELEN CHIN, WINNIE HE GLOBAL SOURCING FUNG BUSINESS INTELLIGENCE CENTRE

More info »
LIMES   A Time Efficient Approach for Large Scale Link Discovery on the Web of Data

LIMES A Time Efficient Approach for Large Scale Link Discovery on the Web of Data

Intelligence Artificial on Conference Joint International Twenty-Second the of Proceedings LIMES—AT ime-Efficient Approach for Large-Scale Link Discovery on the Web of Data ̈ Axel-Cyrille Ngonga Ngomo...

More info »
2017 NAICS Manual

2017 NAICS Manual

ORTH N A MERICAN I NDUSTRY LASSIFICATION C YSTEM S United States, 2017 EXECUTIVE OFFICE OF THE PRESIDENT AND BUDGET OFFICE OF MANAGEMENT

More info »
CHURCH HISTORY IN THE FULNESS OF TIMES Student Manual

CHURCH HISTORY IN THE FULNESS OF TIMES Student Manual

HURCH C ISTORY H HURCH C H ISTORY IN THE ULNESS F IN THE ULNESS F OF T IMES OF IMES T S tudent M anual S anual M tudent RELIGION 341 THROUGH 343

More info »