Same as Magic

When I started my journey in the world of Semantic Web Technologies and Linked Data, I couldn’t quite get what was all that fuss about the property owl:sameAs. Later I was able to better understand the idea and appreciate it when actively using Linked Data. But it wasn’t until I personally created graphs from heterogeneous data stores and then applied different strategies for merging them, when I realised the “magical” power of owl:sameAs.

The idea behind “same as” is simple. It works to say that although the two identifiers linked with it are distinct, what they represent is not.

Let’s say you what to bring together different things recorded for and by the same person Sam E. There is information in a personnel database, his profile and activities in Yammer, LinkedIn, Twitter, Facebook. Sam E. is also somebody doing research, so he has publications in different online libraries. He also makes highlights in Kindle and check-ins in Four-square.

Let’s imagine that at least one of the email addresses recorded as Sam E’s personal email is used in all these data sets. Sam E. is also somehow uniquely identified in these systems, and it doesn’t matter if the identifiers use his email or not. When creating RDF graphs from each of the sources,  URI for Sam E. should be generated in each graph if such doesn’t exist already. The only other thing needed is to declare that Sam E’s personal email is the object of foaf:mbox, where the subject is the respective URI for Sam E from in each of the data sets.

The interesting thing about foaf:mbox is that it is “inverse functional”. When a property is asserted as owl:inverseFunctionalProperty, then the object uniquely identifies the subject in that statement. To get what that means, let’s first see the meaning of a functional property in OWL. If Sam E. “has birth mother” Jane, and Sam E. “has birth mother” Marry, and “has birth mother” is declared as functional property, a DL reasoner will infer  that Jane and Marry are the same person. The “inverse functional” works the same way in the opposite direction. So if Sam.E.Yammer has foaf:mbox “sam@example.com”, and Sam.E.Twitter has foaf:mbox “sam@example.com”, then Sam.E.Yammer refers to the same person as Sam.E.Twitter. That is because a new triple Sam.E.Yammer–owl:sameAs–Sam.E.Twitter is inferred as a consequence of foaf:mbox being owl:inverseFunctionalProperty. But that single change brings a massive effect: all facts from Yammer about Sam E are inferred for Sam E from Twitter and vice versa. And the same applies for LinkedIn, Facebook, online libraries, Four-square and so on.

Now, imagine you don’t do that for Sam E, but for all your Twitter network. Then you’ll get a graph that will be able to answer questions such as “From those that tweeted about topic X within my network, give me the names and emails of all people that work within 300 km from here”, or “Am I in a same discussion group with somebody that liked book Y?”. But wait, you don’t need to imagine it, you can easily do it. Here is for example one way to turn Twitter data into an RDF graph.

Of course, apart for persons, similar approaches can be applied for any other thing represented on the web: organisations, locations, artefacts, chemical elements, species and so on.

To better understand what’s going on, it’s worth reminding that there is no unique name assumption in OWL. The fact that two identifiers X and Y are different, does not mean that they represent different things. If we know or if it can be deduced that they represent different things, this can be asserted or respectively inferred as a new triple X–owl:differentFrom–Y. In a similar way a triple saying just the opposite X–owl:sameAs–Y can be asserted or inferred. Basically, as long as sameness is concerned, we can have three states: same, different, neither same nor different. Unfortunately, a fourth state, both same and different, is not allowed, and why would that be of value will be discussed in another post. Now, let’s get back to the merging of graphs.

Bringing the RDF graphs about Sam E, created from the different systems, would link them in one graph just by using foaf:mbox. Most triple stores like Virtuoso, would do such kind of basic inferencing at run time. If you want to merge them in an ontology editor, you have to use a reasoner such as Pallet if you are using Protégé, or run inferencing with SPIN, if you are using TopBraid Composer. Linking knowledge representation from different systems in a way independent from their underlying schemas, can bring a lot of value, from utilising chains of relations to learning things not known before linking.

The power of “same as” has been used a lot for data integration both in controlled environments and in the wild. But let’s not forget that in the latter, the web,   “Anyone can say anything about anything”. This was in fact one of the leading design principles for RDF and OWL. And then even with the best intentions in mind, people can create a lot of almost “same as” relations that would be mixed with the reliable “same as” relations. And they did and they do.

The problems with “same as” have received a lot of attention. In one of the most cited papers on the subject, Harry Halpin et al. outline four categories of problems for owl:sameAs: “Same Thing As But Different Context”; “Same Thing As But Referentially Opaque”, “Represents”, and “Very Similar To”. Others worn about problems with provenance. Still, almost all agree that the benefits for the owl:sameAs for Linked Data by far outnumber the risks, and the latter can be mitigated by various means.

Whatever the risks with owl:sameAs in the web, they are insignificant, or non-existent in corporate environments. And yet, most of the data stays in silos and it gets integrated only partially and based on some concrete requirements. These requirements represent local historical needs and bring expensive solutions to local historical problems. Those solutions typically range from point-to-point interfaces with some ETL, to realisation of REST services. They can get quite impressive with the means for access and synchronisation, and yet they are all dependant on the local schemas in each application silo and helpless for enterprise-wide usage or any unforeseen need. What those solutions bring is more complicated application landscape, additional IT investments brought by any change of requirements and usually a lot of spendings for MDM software, data warehouses and suchlike. All that can be avoided if the data from heterogeneous corporate and open data sources is brought together into an enterprise knowledge graph, with distributed linked ontologies and vocabularies to give sense to it, and elegant querying technologies, that can bring answers, instead of just search results. The Semantic Web stack is full of capabilities such as owl:sameAs, that make this easy, and beautiful. Give it a try.

 

On Semantic Technologies

A conversation with Eddy Vanderlinden

Semantic technologies have been a temptation for me for quite some time. That was mainly due to my growing frustration with how data is managed both inside and outside corporations. Then all mainstream modelling methods used for analysis or for database design and application development, with all their charms and weaknesses, often leave me feeling that they put too many constraints for people to express things and too little for computers to understand them. That and the suspicion about the potential of Semantic Technologies is not new to me. What is new is the experience of their pragmatic application and the opportunity to see some familiar areas through ST lenses. I owe both of these to Eddy Vanderlinden. In this sort of interview, I asked him a few questions, the answers to which might be of interest to the readership of this blog.

Ivo: When was your first meeting with ST?

Eddy: In 2007, when searching for possibilities to avoid the registered dysfunctions.

Ivo: What was the most fascinating thing for you in the beginning, and how did this change? I mean, which part or capability of the ST is the main driver for you now, after so many years of practice?

Eddy: The most fascinating aspect was the data modelling aspect, which made data real information. The upgraded function for data towards information covered the common and precise understanding of concepts for all stakeholders through their relationships with other concepts. Also, the flexibility of the data model was contributing to the great benefit.

There are two main drivers added later on:

  1. The knowledge discovery possibilities through the open-world assumption. Under the condition, we adapt our state of mind from comfortably categorising “things” into a state whereby we discover through the ST technique new aspects. In philosophical terms, we should become a bit Tao: flexible, accept change as the only certainty and be attentive to capture change and utilise it for our benefit.

  2. The possibility to convert knowledge models straight into running applications so that unambiguous goals are obtained through commonly understood and mastered methods.

Ivo: There are people over-enthusiastic about Semantic Web, calling it “Web 3.0”, “the next big wave”, “the gigantic …. graph” and so forth. At the same time there are many sceptics and even people that had been once enthusiastic about it, and now seem rather disappointed. What do you think about this? Is ST rather overrated or undervalued or somehow both? And why?

Eddy: ST is to my understanding by far undervalued. Let’s see first why one could be disappointed. One reason might be because lots of people bet on the possibility that web publishers would start to use HTML tagging techniques massively so that slightly adapted search engines could simply exploit the semantics in the text. So, the search engines would not only have keywords (from the HTML header) and page titles but also tags from within the text as sources for semantic searches. The most popular standard which was proposed to execute this is RDFa. Personally, I never believed this would be a success. Not only because the user needs to put in additional effort but also because tagging does not provide relationship information between the tags. So, these people may be disappointed but can consider text analysis, spongers for email and spreadsheets as alternative approaches (see IBM‘s DeepQA).

Another group of disappointed people could be the ones who thought that putting all information on a subject into an ontology would solve all their problems. This is by far not the case. Modelling should be done with a purpose in mind. As Dean Allemang writes, it is a craft. Don’t underestimate the power of the models but accept they provide answers to the questions you want to be answered. If more than that is needed, ST has to be combined with probabilistic methods (see earlier on IBM’s DeepQA project).

Another disappointed group could be people who want massive amounts of data treated with today’s reasoners. Although reasoners became much more performing, industries with huge amounts of data (like the finance industry) should approach that data in specific ways, sometimes emulating the reasoners with alternative technologies (e.g. SPIN and rules engines of triple store suppliers). Honestly, if we don’t know what the reasoner should deliver, we cannot model effectively. It would probably mean we are not rightly involving ST.

The reason for my conviction of undervaluation is that for the above objections there are far-reaching alternatives that are opening new horizons in all fields of applications. Furthermore, there are new domains starting to adopt semantic technologies. Artificial intelligence is such a domain. To my knowledge, the most impressive result of this is IBM’s DeepQA project. I know the reluctance of ST people being associated with AI because they feel AI did not bring them much, on the contrary, they brought a lot to AI. In my opinion, implementing the probabilistic approaches, besides other AI techniques, with ST brings a lot of added value to ST. Let’s not forget ST is strongly domain-oriented, while probabilistic approaches may help generalise solutions from combined domains. I expect a lot of input also from the KM community. While this community used ontologies from the beginning (in different ways than web ontologies), the love between the ST and KM communities is not to be called ideal. When the KM community will embrace the possibilities offered by ontologies in Descriptive Logic (OWL-DL compliant), the benefits will contribute to both communities. See http://semanticadvantage.wordpress.com/category/semantic-technology/

Ivo: It seems that you don’t much associate yourself with the Semantic Web community. Is that so? What do you think are the main mistakes, or fallacies of the Semantic Web movement that would actually jeopardise or postpone Semantic Technologies getting more tangible traction?

Eddy: First, I would like to stress that the solution proposed for the operational dysfunctions owes everything to the SEMWEB community. I cannot thank them enough. If I am less associated today, the reason might be laziness. I have to update my vision of their activities again. You see, when these technologies became of strategic importance for USA and for Europe, lots of financial means were directed to the development of standards. In the beginning, the SEMWEB community was mainly busy developing standards complying with Tim Berners-Lee‘s architecture vision.

Later came the tools with pioneers as Clarks & Parsia (Pellet reasoner), Stanford University (Protégé), Berlin University (software language adaptations for ST), Zurich university (Controlled English communication), HP with Jena Laboratories (triple stores and SPARQL). Very rapidly a vast community of tool producers followed. This was really exciting. I could only participate as a tester, a commenter in small fields since the main discussion point then was on tool development.

Later commercial players joined the community: Topquadrant with a software development platform including the creation of services and web server, OpenLink Software for heavy-duty triple and quad stores and a few more. The problem is the applications are mainly being adopted by the academic and scientific world, not yet by business users. I’ll check this for updates again.

Ivo: A recurring pattern in the lessons learnt you share on your site is related to losses when conceptual data models are transformed into physical. And another is related to the missing time dimension. Can you please tell us more about those two and how they are solved by applying ST?

Eddy: On the transformation conceptual to physical, Business analysts have different tools at their disposal to represent the real-world operations in a conceptual manner. They are mainly grouped in what we know as the UML diagrams, a collection of diagramming methods. Starting from these conceptualisations, technical analysts, programmers and a collection of other specialised people (data engineers, interface designers, service engineers, …) start developing a functionality involving those concepts. I will not repeat the classic picture of the swing illustrating the difference for the user, anyone can make his own version here. This is solved in ST by software using the ontology as their data source. A perfect example is the Topbraid suite. Another popular tool is provided by Openlinks Software, not to forget the Fresnel lenses.

On the time dimension. There are only a few universal laws. Knowledge about facts is merely related to and valid for a portion of time. Meaning, an assertion is true at a certain moment or time period. E.g. an organisation, the price of an item, the specification of a product. Whoever has been involved in data mining will recognise 2 main challenges: denormalization of related data and time-dimensional information. The purpose is to reconstruct any state of the company at any moment in time. The reason is that we cannot find out cause and effect information if we can’t partition facts into the time dimension. This is done in ST, mainly in a similar way as with conventional data-management systems: adding a time-dimension property to any individual member of a class.

Ivo: What are currently the best examples of using ST?

Eddy: Cancer research, DBPedia, NASA has database of metrics methods, linked Data analysis, US government enterprise models, search engines.

Ivo: Is there something that you see as a big potential of ST which has not come to fruition, which, for some reason, nobody was able to realise so far?

Eddy: To me, it is strange the possibilities for application development are not really applied.

Ivo: You have been and are currently involved in BPM activities. What do you think ST can bring to BPM? Do you see it as a flavour of BPM (there is something called Semantic BPM already) or as an alternative approach to what most companies use BPM for?

Eddy: I am not in favour of “semantic BPM”. The reason is linked to my answer in the beginning: it requires tagging the models and their objects. I certainly see it as an alternative approach to what most companies use BPM for.

Ivo: You worked in banking for a long time. Let’s talk about Semantic Technologies as an investment. It seems that many types of innovation, if judged using DCF or other popular methods, look riskier than do-nothing strategies. What type of evaluation would convincingly justify investment in Semantic Technologies today?

Eddy: The answer focuses on semantic technologies as an investment, not as a banker investing in companies with an expected DCF of X EUR/USD. The latter requires a much broader approach than just DCF, RONIC, or similar. Notably what we may understand by “expected” cash flows, introducing Beta factors, sector and currency valuation of future trends, market valuation in an economic context,… So working in the banking sector will not help provide an answer here. Prior to the investment selection step, the proposal is to position ST in a strategic perspective analysis. That analysis would answer the questions:

A. On the portfolio management issues: A1. What new products can be offered in our portfolio with ST? A2. How will ST affect existing products in our portfolio? A3. Which new markets can be explored with the products discovered in A1 and A2? A4. How are these new markets evolving from the strategic time perspective? A5. How are the products discovered in A1 and A2 influence our competitive position in the new markets discovered in A3? A6. How is our competitive position evolving compared to A4? Suggested method: portfolio management matrix of the Boston Consulting Group.

B. On the strength of the organisation: B1. How does the application of ST reduce the risk of new competitors coming in the market? Where can the organisation start competing in new markets? B2. How does ST application affect our negotiating power with suppliers? B3. How does the ST application affect our negotiating power with customers? B4. How is the threat of substitute products in our markets mitigated through ST application? How can we form a defence against substitutes in existing markets? B5. How is our strength affected by competitors considering the price of our products and services, the quality of our products and the service level? (Inspired by Porter’s 5-forces model.) After thorough SWOT analysis, when the product/market matrix is finalised, the production simulations ran and the investments are figured out, cash flows can start to be projected on a strategic horizon (3-5 years). Eventually, other analysing techniques may be needed. Choosing semantic technologies gets a completely different perspective when compared to “do nothing” scenario if we make the exercise as mentioned above compared to a pure accounting approach.

Ivo: Now, about reasoners. Why are there so many of them? How do they differ from each other? How to choose? When to use which?

Eddy: I wished I had a clear-cut answer to that question. When building the finance ontology in 2007-2009, I tested 5 reasoners. On the correctness of the inferences: for an expected inference, I could get 4 different answers. The conclusion at that time was, the best reasoners depended on the inference needed. The reason might have been for a part of the shift from OWL 1 to OWL 2 but it makes me test the inferences at each construction of a model with different reasoners. On the speed of the inference: in the beginning, some very well-performing reasoners went commercial only. Meanwhile, some free reasoners upgraded their version, and new players entered the field. Further, I would like to remind you of the relative importance of reasoners in heavy life applications: see higher under SPIN and rule engines.

Ivo: If I start asking about ST languages, the answer might be much bigger than all answers combined so far. Maybe we can post a separate conversation about that. Still, this one could not be done completely without it. About the OWL then… OWL is (arguably) the leader in knowledge representation. Some think that the main advantage is that it is decidable, not probabilistic. Others state that the best thing about OWL is that it’s based on a solid mathematical foundation, unlike all OMG standards, which are criticised for lacking it. What do you think are the main strengths of OWL, and the main weaknesses?

Eddy: to me it is the mathematical foundation when this enables advanced inferencing. On the other side, I very much appreciate the profiles with the different syntactic subsets. There is no clear-cut solution and OWL has to be tailored. This strength is also its weakness, together with the fact we cannot reason with classes (for the moment?).

Ivo: Thank you, Eddy! Maybe there will be some more questions put here from others. Now, instead of a conclusion, and for those wondering why Web Ontology Language is abbreviated as OWL and not WOL, one familiar passage:

He [Owl] could spell his own name WOL, and he could spell Tuesday so that you knew it wasn’t Wednesday, but his spelling goes all to pieces over delicate words like measles and buttered toast.

A. A. Milne, Winnie the Pooh