Buckets and Balls

Linked Data is still largely unknown, or misunderstood and undervalued. Often, people find it simply too difficult. So I keep looking for new ways to make Linked Data more accessible. And with some success. In my training courses so far over 60% of the participants had no IT background. I hope even to increase this percentage in the future.

What seems to be most challenging is writing SPARQL queries. The specification is written for IT people. There are some great courses and books but they also target people with some or more IT experience. If anything, that scares the rest and keeps SPARQL away from the masses.

I keep learning what is challenging. A recurring problem – and an unexpected one – is the concept of variable.

What is a variable in SPARQL? Just a placeholder. But how can you imagine a placeholder? It’s abstract. We have no way of grasping abstract things unless we associate them with something physical and concrete. It’s difficult to imagine time, but once we draw it in space it gets easier. We can’t picture furniture, but we have no problem with chair.

The other issue is how a SPARQL query looks. While working with SPARQL helps to understand how a knowledge graph works, a SPARQL query doesn’t look like one. It is like with symbols in mathematics. “5 doesn’t look like five, while ||||| is five”. The problem with SPARQL is similar:

You want to query knowledge graph.
You want to learn new things.
But your query doesn’t look like knowledge graph.
It looks like lines of strings.

So, how to handle together the problems with grasping variables and with the look of SPARQL?

My suggestion is to imagine every SPARQL query as a graph of linked buckets and balls.

Variables are placeholders but abstract. We need a physical container1The idea of using containers is very powerful. The whole arythmetics and alegbra can be done using only the concept of container as demonstrated by William Bricken. to fill with things. We need buckets. And nodes are like balls. So, think of running2“Running” is also a metaphore and what it stands for can be communicated more gracefully. And that’s important. As you know, language shapes the way we think. a query as filling buckets with balls.

A graph pattern then will look like this:

A bucket ?A should be filled with those balls which have a relation R to ball B.

But it looks nicer when we abbreviate it like this:

This is a graph pattern in Buckets’n’Balls notation. The direction of the relation R is not shown but it’s always from left to right.

The process of writing and running a SPARQL query would then go through the following steps: Continue reading

  • 1
    The idea of using containers is very powerful. The whole arythmetics and alegbra can be done using only the concept of container as demonstrated by William Bricken.
  • 2
    “Running” is also a metaphore and what it stands for can be communicated more gracefully. And that’s important. As you know, language shapes the way we think.

SASSY Architecture

SASSY Architecture is a practice of combining two seemingly incompatible worldviews. The first one is based on non-contradiction and supports the vision for an ACE enterprise (Agile, Coherent, Efficient), through 3E enterprise descriptions (Expressive, Extensible, Executable), achieving “3 for the price of 1”: Enterprise Architecture, Governance, and Data Integration.

The second is based on self-reference and is a way of seeing enterprises as topologies of paradoxical decisions. Such a way of thinking helps deconstruct constraints to unleash innovation, reveal hidden dependencies in the decisions network, and avoid patterns of decisions limiting future options.

As a short overview, here are the slides from my talk at the Enterprise Architecture Summer School in Copenhagen last week.

Continue reading

Same as Magic

When I started my journey in the world of Semantic Web Technologies and Linked Data, I couldn’t quite get what was all that fuss about the property owl:sameAs. Later I was able to better understand the idea and appreciate it when actively using Linked Data. But it wasn’t until I personally created graphs from heterogeneous data stores and then applied different strategies for merging them, when I realised the “magical” power of owl:sameAs.

The idea behind “same as” is simple. It works to say that although the two identifiers linked with it are distinct, what they represent is not.

Let’s say you what to bring together different things recorded for and by the same person Sam E. There is information in a personnel database, his profile and activities in Yammer, LinkedIn, Twitter, Facebook. Sam E. is also somebody doing research, so he has publications in different online libraries. He also makes highlights in Kindle and check-ins in Four-square.

Let’s imagine that at least one of the email addresses recorded as Sam E’s personal email is used in all these data sets. Sam E. is also somehow uniquely identified in these systems, and it doesn’t matter if the identifiers use his email or not. When creating RDF graphs from each of the sources,  URI for Sam E. should be generated in each graph if such doesn’t exist already. The only other thing needed is to declare that Sam E’s personal email is the object of foaf:mbox, where the subject is the respective URI for Sam E from in each of the data sets.

The interesting thing about foaf:mbox is that it is “inverse functional”. When a property is asserted as owl:inverseFunctionalProperty, then the object uniquely identifies the subject in that statement. To get what that means, let’s first see the meaning of a functional property in OWL. If Sam E. “has birth mother” Jane, and Sam E. “has birth mother” Marry, and “has birth mother” is declared as functional property, a DL reasoner will infer  that Jane and Marry are the same person. The “inverse functional” works the same way in the opposite direction. So if Sam.E.Yammer has foaf:mbox “sam@example.com”, and Sam.E.Twitter has foaf:mbox “sam@example.com”, then Sam.E.Yammer refers to the same person as Sam.E.Twitter. That is because a new triple Sam.E.Yammer–owl:sameAs–Sam.E.Twitter is inferred as a consequence of foaf:mbox being owl:inverseFunctionalProperty. But that single change brings a massive effect: all facts from Yammer about Sam E are inferred for Sam E from Twitter and vice versa. And the same applies for LinkedIn, Facebook, online libraries, Four-square and so on.

Now, imagine you don’t do that for Sam E, but for all your Twitter network. Then you’ll get a graph that will be able to answer questions such as “From those that tweeted about topic X within my network, give me the names and emails of all people that work within 300 km from here”, or “Am I in a same discussion group with somebody that liked book Y?”. But wait, you don’t need to imagine it, you can easily do it. Here is for example one way to turn Twitter data into an RDF graph.

Of course, apart for persons, similar approaches can be applied for any other thing represented on the web: organisations, locations, artefacts, chemical elements, species and so on.

To better understand what’s going on, it’s worth reminding that there is no unique name assumption in OWL. The fact that two identifiers X and Y are different, does not mean that they represent different things. If we know or if it can be deduced that they represent different things, this can be asserted or respectively inferred as a new triple X–owl:differentFrom–Y. In a similar way a triple saying just the opposite X–owl:sameAs–Y can be asserted or inferred. Basically, as long as sameness is concerned, we can have three states: same, different, neither same nor different. Unfortunately, a fourth state, both same and different, is not allowed, and why would that be of value will be discussed in another post. Now, let’s get back to the merging of graphs.

Bringing the RDF graphs about Sam E, created from the different systems, would link them in one graph just by using foaf:mbox. Most triple stores like Virtuoso, would do such kind of basic inferencing at run time. If you want to merge them in an ontology editor, you have to use a reasoner such as Pallet if you are using Protégé, or run inferencing with SPIN, if you are using TopBraid Composer. Linking knowledge representation from different systems in a way independent from their underlying schemas, can bring a lot of value, from utilising chains of relations to learning things not known before linking.

The power of “same as” has been used a lot for data integration both in controlled environments and in the wild. But let’s not forget that in the latter, the web,   “Anyone can say anything about anything”. This was in fact one of the leading design principles for RDF and OWL. And then even with the best intentions in mind, people can create a lot of almost “same as” relations that would be mixed with the reliable “same as” relations. And they did and they do.

The problems with “same as” have received a lot of attention. In one of the most cited papers on the subject, Harry Halpin et al. outline four categories of problems for owl:sameAs: “Same Thing As But Different Context”; “Same Thing As But Referentially Opaque”, “Represents”, and “Very Similar To”. Others worn about problems with provenance. Still, almost all agree that the benefits for the owl:sameAs for Linked Data by far outnumber the risks, and the latter can be mitigated by various means.

Whatever the risks with owl:sameAs in the web, they are insignificant, or non-existent in corporate environments. And yet, most of the data stays in silos and it gets integrated only partially and based on some concrete requirements. These requirements represent local historical needs and bring expensive solutions to local historical problems. Those solutions typically range from point-to-point interfaces with some ETL, to realisation of REST services. They can get quite impressive with the means for access and synchronisation, and yet they are all dependant on the local schemas in each application silo and helpless for enterprise-wide usage or any unforeseen need. What those solutions bring is more complicated application landscape, additional IT investments brought by any change of requirements and usually a lot of spendings for MDM software, data warehouses and suchlike. All that can be avoided if the data from heterogeneous corporate and open data sources is brought together into an enterprise knowledge graph, with distributed linked ontologies and vocabularies to give sense to it, and elegant querying technologies, that can bring answers, instead of just search results. The Semantic Web stack is full of capabilities such as owl:sameAs, that make this easy, and beautiful. Give it a try.