Data-centric project requirements?

Several times last week, in different circumstances, I was asked a question containing these three words or their synonyms. That’s not new. It happened previously. But this concentration triggered the write-up that follows. Nothing original and neither is the reason to write it:

Everything that needs to be said has already been said. But since no one was listening, everything must be said again.

— André Gide

Let’s first clarify what is data-centric and then see why it doesn’t go well with project and even less so with requirements.

 

What is data-centric?

The short answer is in these three1Check out all five principles data-centric principles. principles:

  1. Data is self-describing and does not rely on an application for interpretation and meaning.
  2. Data is expressed in open, non-proprietary formats.
  3. Applications are allowed to visit the data, perform their magic and express the results of their process back into the data layer for all to share.

I find these three are the most important data-centric principles, but another reason for selecting them is that they are context-independent. The data-centric manifesto they are taken from is currently with an enterprise-only focus2 After the publication of this article, the scope of the manifesto was adapted, and now it includes the other two scales. . Yes, the problem they address is most severely felt – or rather not felt because of ignoring or misattributing the symptoms – in large organizations. Yet, what is behind these principles is equally important for personal information management and on the open web. Let’s go quickly through all three levels then, from big to small, and see what data-centric means for the world wide web, for corporate IT, and then for personal information management.

The web was designed to be a decentralized system where the agreement on a few standards, basically HTTP and HTML, enabled free choice on just about anything else. People were finally free to express themselves and to choose from where and how to get information. They got free to innovate on building new browsers, websites, and whatever web applications and services they can think of. A system like this, with a self-maintained organization, can work well and have a natural tendency for virtuous cycles. In other words, it can amplify goodness and develop its own immune system for whatever threatens its viability. All it needs is to have the right kind of enabling constraints, for example, the standards I mentioned above, and to allow autonomy of all subsystems. This is the balance between autonomy and cohesion. It works for animals, people, tribes, organizations, society, and a socio-technical system like the web.

So the web flourished as a decentralized system, where people were free to choose and create more choices. And then one day the platforms appeared. They offered good and free services. Or at least they looked good and free at first. In reality, they were (and are) neither good nor free. The platforms are not nearly as good information providers as it was the decentralized web before them. What we see is not what we are looking for, but what their algorithms decide to show us. And the services of these platforms are not free. Quite the contrary. We pay with our data, and we pay twice. Once by being their content providers and a second time by giving them our personal data. Importantly, we don’t give them only our current personal data but also future ones, by allowing them to track our online behaviour. Who’s them? I’m talking of course about IT giants like Google, but the best example of extreme centralization and lock-in is Facebook3This has many facets. Facebook can be looked at as a very successful aggregator or as a prime example of a new form of capitalism.. In this way, the web, a decentralized system, shaped by the users, turned into a hyper-centralized system, shaped by a few powerful corporations4The centralization of the web is not only about the content but also about the infrastructure. The convenience of the cloud increased the dependency of both individual users and companies on the strategy and fate of a few powerful providers, namely Amazon, Microsoft, and Google.. It also formed users’ expectations. In 2019 Facebook and Google announced that it was now possible to copy images from Facebook to Google Photos. That’s the new norm for innovation. Only a few people noted the absurdity. As Ruben Verbourgh pointed out, 50 years after being able to send video signals over a distance of 380,000km, we celebrate that we can finally move a photo by 11km (the distance between Facebook and Google headquarters). A bit dystopian, isn’t it?

Yet, the problems with this centralization are not widely understood. For example, very few people realize how platform-based political propaganda works, and that’s why it works so well. Even fewer relate it to the hyper-centralization of the web. Same with fake news and so on. Maybe the least understood of the damages is how it suffocates innovation. It’s easy to illustrate. Even when you use Google for product search, where it should excel after so many years of work, huge investments, massive feedback, and the use of language models with trillion parameters, it’s really lame. Try searching for a bike below a certain price and certain weight. You’ll get results for bikes above that, but okay, then you can fix that using the shopping filter. Currently, that will not allow you to specify the weight even though it’s available in most technical specifications published online. But even if they add it at some point, the final selection will still exclude the majority of the offerings by smaller companies. As a result, you can’t get an answer to this simple question.

Continue reading

  • 1
  • 2
    After the publication of this article, the scope of the manifesto was adapted, and now it includes the other two scales.
  • 3
    This has many facets. Facebook can be looked at as a very successful aggregator or as a prime example of a new form of capitalism.
  • 4
    The centralization of the web is not only about the content but also about the infrastructure. The convenience of the cloud increased the dependency of both individual users and companies on the strategy and fate of a few powerful providers, namely Amazon, Microsoft, and Google.