Same as Magic

When I started my journey in the world of Semantic Web Technologies and Linked Data, I couldn’t quite get what was all that fuss about the property owl:sameAs. Later I was able to better understand the idea and appreciate it when actively using Linked Data. But it wasn’t until I personally created graphs from heterogeneous data stores and then applied different strategies for merging them, when I realised the “magical” power of owl:sameAs.

The idea behind “same as” is simple. It works to say that although the two identifiers linked with it are distinct, what they represent is not.

Let’s say you what to bring together different things recorded for and by the same person Sam E. There is information in a personnel database, his profile and activities in Yammer, LinkedIn, Twitter, Facebook. Sam E. is also somebody doing research, so he has publications in different online libraries. He also makes highlights in Kindle and check-ins in Four-square.

Let’s imagine that at least one of the email addresses recorded as Sam E’s personal email is used in all these data sets. Sam E. is also somehow uniquely identified in these systems, and it doesn’t matter if the identifiers use his email or not. When creating RDF graphs from each of the sources,  URI for Sam E. should be generated in each graph if such doesn’t exist already. The only other thing needed is to declare that Sam E’s personal email is the object of foaf:mbox, where the subject is the respective URI for Sam E from in each of the data sets.

The interesting thing about foaf:mbox is that it is “inverse functional”. When a property is asserted as owl:inverseFunctionalProperty, then the object uniquely identifies the subject in that statement. To get what that means, let’s first see the meaning of a functional property in OWL. If Sam E. “has birth mother” Jane, and Sam E. “has birth mother” Marry, and “has birth mother” is declared as functional property, a DL reasoner will infer  that Jane and Marry are the same person. The “inverse functional” works the same way in the opposite direction. So if Sam.E.Yammer has foaf:mbox “sam@example.com”, and Sam.E.Twitter has foaf:mbox “sam@example.com”, then Sam.E.Yammer refers to the same person as Sam.E.Twitter. That is because a new triple Sam.E.Yammer–owl:sameAs–Sam.E.Twitter is inferred as a consequence of foaf:mbox being owl:inverseFunctionalProperty. But that single change brings a massive effect: all facts from Yammer about Sam E are inferred for Sam E from Twitter and vice versa. And the same applies for LinkedIn, Facebook, online libraries, Four-square and so on.

Now, imagine you don’t do that for Sam E, but for all your Twitter network. Then you’ll get a graph that will be able to answer questions such as “From those that tweeted about topic X within my network, give me the names and emails of all people that work within 300 km from here”, or “Am I in a same discussion group with somebody that liked book Y?”. But wait, you don’t need to imagine it, you can easily do it. Here is for example one way to turn Twitter data into an RDF graph.

Of course, apart for persons, similar approaches can be applied for any other thing represented on the web: organisations, locations, artefacts, chemical elements, species and so on.

To better understand what’s going on, it’s worth reminding that there is no unique name assumption in OWL. The fact that two identifiers X and Y are different, does not mean that they represent different things. If we know or if it can be deduced that they represent different things, this can be asserted or respectively inferred as a new triple X–owl:differentFrom–Y. In a similar way a triple saying just the opposite X–owl:sameAs–Y can be asserted or inferred. Basically, as long as sameness is concerned, we can have three states: same, different, neither same nor different. Unfortunately, a fourth state, both same and different, is not allowed, and why would that be of value will be discussed in another post. Now, let’s get back to the merging of graphs.

Bringing the RDF graphs about Sam E, created from the different systems, would link them in one graph just by using foaf:mbox. Most triple stores like Virtuoso, would do such kind of basic inferencing at run time. If you want to merge them in an ontology editor, you have to use a reasoner such as Pallet if you are using Protégé, or run inferencing with SPIN, if you are using TopBraid Composer. Linking knowledge representation from different systems in a way independent from their underlying schemas, can bring a lot of value, from utilising chains of relations to learning things not known before linking.

The power of “same as” has been used a lot for data integration both in controlled environments and in the wild. But let’s not forget that in the latter, the web,   “Anyone can say anything about anything”. This was in fact one of the leading design principles for RDF and OWL. And then even with the best intentions in mind, people can create a lot of almost “same as” relations that would be mixed with the reliable “same as” relations. And they did and they do.

The problems with “same as” have received a lot of attention. In one of the most cited papers on the subject, Harry Halpin et al. outline four categories of problems for owl:sameAs: “Same Thing As But Different Context”; “Same Thing As But Referentially Opaque”, “Represents”, and “Very Similar To”. Others worn about problems with provenance. Still, almost all agree that the benefits for the owl:sameAs for Linked Data by far outnumber the risks, and the latter can be mitigated by various means.

Whatever the risks with owl:sameAs in the web, they are insignificant, or non-existent in corporate environments. And yet, most of the data stays in silos and it gets integrated only partially and based on some concrete requirements. These requirements represent local historical needs and bring expensive solutions to local historical problems. Those solutions typically range from point-to-point interfaces with some ETL, to realisation of REST services. They can get quite impressive with the means for access and synchronisation, and yet they are all dependant on the local schemas in each application silo and helpless for enterprise-wide usage or any unforeseen need. What those solutions bring is more complicated application landscape, additional IT investments brought by any change of requirements and usually a lot of spendings for MDM software, data warehouses and suchlike. All that can be avoided if the data from heterogeneous corporate and open data sources is brought together into an enterprise knowledge graph, with distributed linked ontologies and vocabularies to give sense to it, and elegant querying technologies, that can bring answers, instead of just search results. The Semantic Web stack is full of capabilities such as owl:sameAs, that make this easy, and beautiful. Give it a try.