Thursday, March 25, 2021

My thoughts on the data mesh

The concept of a "data warehouse" has been around for a long time. A really, really long time. The term started being used in the 1970s, and has essentially retained the same meaning for its 50 years of existence, which is an eternity in the realm of computer science. 

The data warehouse consists of two basic components:

  1. An organizational process for ingesting data from operational data stores, cleaning and transforming this data, and then serving it from a central and accessible location within an enterprise.
  2. Database software that implements the storage, modeling, and serving of the data in the data warehouse.

The second component --- the database software --- has made tremendous progress over the past five decades. The current software is much faster, more fault tolerant, and more scalable than the original data warehousing software.

However, the first component --- the organizational process --- has been much slower to modernize, and still scales very poorly. 

The data mesh, recently proposed by Zhamak Dehghani,  is a pretty major paradigm shift that has the potential to bring this first component forward in a rare major redesign of the organizational process. 

I wrote up my thoughts in detail in a guest post on Starburst's blog. The short summary of my thesis in that post is that a major reason why data warehousing organizational processes fail is that they don't scale. The data mesh has the potential to do for the data warehouse from an organizational perspective what the parallel DBMS did for the data warehouse from a database software scalability perspective.