There has been a lot of talk of building 'apps' on the modern data stack. It's not clear who exactly is to be blame but it may be Martin Casado  and Benn Stancil . What does this even mean? Would it bring any benefits? Would it have any downsides? What are the technical challenges? I think the proposition is interesting... but these are tough questions that must be answered if the concept is to be taken seriously.
As always, it pays to get our definitions straight so let me start by outlining what I understand 'data apps' to be. Importantly, I do not think we're talking about internal analytical applications that a tool like Hex might help you build. While these are apps, and are powered by data, they are also very simple, and far smaller in scope that traditional software applications. These kinds of applications typically just involve uni-directional data flow and transformation - they don't deal with concerns like maintaining their own state, authentication, scaling, etc.
I suppose that implies that I think that true data apps must deal with all of those things. If that's the case, what makes them any different to just ordinary apps? The catch is the data. Traditional apps are built on top of state that they manage - they control and own all of their own data. Data apps are built on top of data that already exists. They may add to it, edit it, or simple use it, but the critical distinction is that they work with data that already exists.
Why is this an important distinction, and why does it only matter now? Well, quite simply, it wasn't that long ago that most organisations simply didn't own that much data and didn't consider it central to their organisations. Indeed, many organisations are still 'digitising', and will be for some time. These days, however, many organisations have a lot of their own data already, and they want to use that data to do interesting things. Sometimes they generate the data themselves through their own operations, sometimes they generate it by using third party applications. Either way, it's data that was already made and now it must be used for "other stuff".
To me, this is what differentiates a data application, and is why it is now a big deal. Applications no longer have the luxury of pretending that the data they create lives in isolation. Instead, they must learn to live within a pre-existing ecosystem of data, and the MDS is our nascent attempt to solve that problem.
TODO: fix image loading
Like everything, the answer is probably that "it depends". The premise is fairly hard to argue with; companies have lots of data, they can't use it effectively, 3rd party tools create silos and add to the problem, it would be great if they all just worked off of the same data.
One compelling advantage of applications built this way is consistency. If every application works off of the same dataset, then you remove all the reconciliation issues that any data or finance professional has come to accept as normal. Two systems failing to reconcile is fundamentally the result of there being two data sets that record the same thing, and it's inevitable that they come to disagree.
Another great advantage in removing silos is completeness. If every application has access to the same dataset, then there is no need to doing things like reverse ETL to put a number in Salesforce. Salesforce should just have access to the numbers already.
Next up on the list of advantages is portability. When there is just one source of truth for data and you own it, in theory it's easy to swap vendors. This is the dream and the ultimate rebellion against the economic moat that is the "system of record". Burn in hell Salesforce!!
There are always downsides, everything is a tradeoff. So, what do we lose in the world of data applications?
One glaring issue I see is the risk of centralisation in decision making around buying software, and in building internal applications. I see this issue playing out in a small already in the endless back and forth between those entrusted with maintaing the dbt project and the businesses folks clamoring for more models. Central planning does not really work in big organisations and there is a huge amount of power in a non technical team being able to just buy a CRM (or whatever) and move on with it. Moving fast matters in business and bottlenecking on the data team is not moving fast.
In my opinion, we need to think carefully about and ensure that different parts of the business are able to move independently on the bit of the mode that they own. Doing this without recreating the problems of silos and inconsistency is very difficult and I don't have an answer, maybe this is what the data mesh was about?
The biggest downside, perhaps, is that right now this is all just hot air and there is a vast amount of work to be done if it's ever to actually happen.
I think there are two broad ways to categorise the challenges we need to overcome. The first category is technical challenges, which is fairly self explanatory. The second I've called oragnisational, which I'll try to explain in a minute.
There are a huge number of technical challenges to overcome. I won't pretend to know what they all are, but here are three that I think are particularly challenging and interesting too.
In order for us (and 3rd parties) to build apps on top of our data, we will need to establish stable interfaces to it. This is a very old and common idea in more traditional software development, but something that has not had that much attention yet in the world of data.
Put simply, for two applications to interact, there must be an agreed upon convention for their interaction. This idea permeates every level of software, from calling conventions in ABIs, all the way up to OpenAPI specifications for web services. Ideally, this convention never changes. Never is a strong word, but successful interfaces tend to go to extreme lengths to avoid breaking their contract. When contracts are stable people have confidence in building upon them. For example, the Java language spec and runtime promises to never break old code and indeed it doesn't. You can take something written in the days of Java 1 and run it just fine using the latest Java 17 JDK. This stability is a big part of why Java is (unfortunately) such a popular choice for building serious systems.
Without some sort of equivalent idea for data it is not really possible for a 3rd party app to come in and just run on top of your warehouse. The questions is, what does this even look like for data? Do we standardise on certain data models? Do applications provide contracts for businesses to map their data to? How do we specify data contracts? Are they just be schemas? Or do we need something more akin to GraphQL or OpenAPI? Are they just dbt exposures?
To me, this questions is both important and fascinating.
It's probably best to split this up into read and write performance, as they both have different issues.
On the read side, the main problems are around latency and throughput. Cloud data warehouses have incredible read performance on analytical queries, but when you start opening up the data to the kind of interactivty you might expect out of JIRA (badum tss) you need a very different kind of read performance. Users of those applications expect interactivity, which means <100ms response times, which means you can't hit snowflake all the time, or the apps really will feel like JIRA. The second is throughput - these kinds of applications need to be able to support a large amount of concurrent use while remaining interactive.
This is relatively easy to solve, but is an important consideration. Cube.js  is a great example of how to approach this problem, although something like ReadySet  for the MDS could be even better (because you wouldn't need to manually configure pre-aggregations).
Another possible approach to this is the 'all in one database', e.g. SingleStore , which in theory will just handle this problem for us. It's not widely used though, and a better outcome for the industry (in my opinion) would be a solution that could work with any data store.
There is far more work to do on the write side. At the heart of the issue, is the fact that software engineering (engineering in general?) is about tradeoffs. The design decisions made by Snowflake that make it so good at analytics are the same design decisions that mean it is beyond useless for transactional workloads.
A couple of promising approaches I've seen are things like SingleStore or hydras.io , which both try to reconcile this gap between OLTP and OLAP in different ways. I'm rooting for the hydras team because they were on my batch in YC and they're great :).
Another possible approach might be for the data application to manage writes in its own application database and then replicate writes to the underlying 'source of truth' (data warehouse?). This is basically what hydras tries to make easy, and it's how most businesses get their application data into the warehouse anyway (often using tools like debezium ). I guess I'm trying to say there's no new idea here, but having this kind of thing done by default in a first class way, is interesting.
Once we start opening up our data to everyone, we are rapidly hit with the problem of controlling who can see what and when. Many organisations struggle with this and I'm sure many would say that they want to do it better! This problem is amplified when we start talking about 3rd party applications touching our data though. I think we can expect privacy laws and standards around data management to tighten over time (a good thing) and so making this work correctly is very important to the data application idea.
Perhaps almost as important as correct though, is making it easy. In my opinion, one of the best models for internal access control is companies running Google Workspace with their infra on GCP. Boy do they have it good when it comes to RBAC. The integration between users and groups in Worksapce and access rules across the google stack is pretty seamless. You can just say that 'this view is only visible to people in the Finance group' and it all just works as people join and leave that group.
It's easy to see how any 3rd party application providing SSO via Google could leverage the same rules. Making that work 'in general', however, for customers on any stack and authenticating via Ethereum (or something) is not easy.
This might not be the best term for this category of challenge but it will do for now. What I mean by this is that there is already momentum in how businesses handle their data and expose it to third parties. The "store of record" SaaS business model is deeply entrenched across the industry and so are all the second order effects that entails.
For example, businesses know how to evaluate traditional SaaS vendors. They know how to integrate their products into their existing operations. They know how to ingest data from their APIs and they have built entire teams responsible for piecing together holistic views across all of them.
That's just a short list, reality is far messier than that. I guess my point is, even if the tech works, making data applications a reality also means changing behavior at a massive scale. It's totally possible and has happened many times in the past, but it's not a given.
This post is quite broad and very shallow, and I hate myself for that. What it does do though, is set a little bit of context for a lot of writing that's to come. Specifically, I'm planning to drill down into all of the technical issues around making this kind of application a reality. There are many more than the three I outlined earlier.
Despite my apparent scepticism (forgive me, I'm British), I'm actually super bullish on the idea of building fully fledged apps on top of the MDS. I think there's a lot of merit to the idea - it's just nascent, and it will be both technically and organisationally challenging to pull off. The fact there's a lot of work to do just gets me excited.
Until next time!