There has been a lot of talk of building 'data apps' on the modern data stack. It's not clear who exactly is to be blame but it may be Martin Casado [1] and Benn Stancil [2]. But what does it even mean to build a data app? Are there any benefits or downsides? What are the technical challenges? I think the proposition is interesting... but these are tough questions we must answer if the concept is to be taken seriously.
There has been a lot of talk of building 'data apps' on the modern data stack. It's not clear who exactly is to be blame but it may be Martin Casado [1] and Benn Stancil [2]. But what does it even mean to build a data app? Are there any benefits or downsides? What are the technical challenges? I think the proposition is interesting... but these are tough questions we must answer if the concept is to be taken seriously.
As always, it pays to get our definitions straight. Let me start by outlining what I understand “data apps” to be. Importantly, I do not think we're just talking about internal analytical applications, such as those a tool like Hex might help you build.
While these are apps and are indeed powered by data, they are also smaller in scope than traditional software applications. Internal analytical applications, such as interactive dashboards, typically just involve uni-directional data flow and transformation. They don't deal with concerns like storing and retrieving user input, authentication, authorization, or scaling.
I suppose that implies that true data apps should have the same scope as traditional software applications. If that's the case, what makes them any different to just ordinary apps? The key distinction is the origin of the data they work with. Traditional apps control and manage all of their own data. Data apps are built on top of data that has already been created by some other process. They may add to it, edit it, or simply use it, but the critical distinction is that they work with data that they did not create.
Why is this an important distinction? And why are people talking about it now? Well, it wasn't that long ago that most organisations didn't own that much data, and most of it was siloed in third-party tools. That has now changed, though, and many businesses now centralise their data into a data warehouse, and want to use that data as part of their operations.
Applications no longer have the luxury of pretending that the data they create lives in isolation. They operate within a large pre-existing ecosystem of data, and it would be very convenient if they embraced that. Beyond a customary API, however, they typically don’t, and businesses go to great lengths to work around the limitations imposed by this. Something’s got to give...
Like everything, the answer is probably that "it depends."
One compelling advantage of data applications is consistency. If every application works off of the same data set, then you get rid of errors that are caused by there being multiple copies of data. Two systems disagreeing is something that all data professionals are familiar with, and it leads to distrust in the numbers.
Another great advantage in removing silos is completeness. If every application has access to the same data, then there is no need for processes like reverse ETL to put numbers in Salesforce. Salesforce should just have access to the numbers already.
Another possible advantage is portability. When there is just one source of truth for data and you own it, switching costs for third-party apps are lower. This is the dream and the ultimate rebellion against the economic moat that is the "system of record." Burn in hell, Salesforce!
So what do we lose in the world of data applications?
One glaring issue I see is the risk of centralising decision-making around buying software and in building internal applications. There’s a lot of power in a sales team being able to buy and start using a CRM without anyone else having to be involved. If every app requires integration with the organisation’s data, then every buying decision needs you to get the data team involved. Moving fast matters in business, and bottlenecking on the data team is not moving fast.
In my opinion, we need to think carefully about that and ensure that different parts of the business are able to move independently. Likely this involves creating clear boundaries of ownership for a businesses data model, and tooling that allows simple integrations to be as quick and easy as a couple of clicks, or calling your Salesforce account rep... Doing this without recreating the problems of silos and inconsistency is very difficult, and I don't have an answer. (Maybe this is what the data mesh was about?)
Another potential downside is a loss of separation between analytical and transactional workloads. Each requires a different data model to do its job well, and that seems unlikely to change soon. However, if we all start working off of the same data store, then what’s stopping the boundaries from blurring? This may sound innocuous, but there is a reason that it is best practice for microservices to never share a data store and to interact only through application interfaces. Without clear boundaries, people come to depend on things that they were not meant to, and the result is that the two use cases end up intertwined and unable to move independently. Sadly, you can’t rely on discipline or good faith to manage large systems, so I think this needs some careful thought. It’s quite a similar idea to Hyrum’s Law [8], which is worth checking out if you haven’t heard of it.
The biggest downside, perhaps, is that right now this is all just hot air, and there is a vast amount of work to be done if it's ever to actually happen.
I think there are two broad ways to categorise the challenges we need to overcome. The first category is technical challenges, which is fairly self explanatory. The second I've called organisational challenges, which I'll explain further down.
There are a huge number of technical challenges to overcome. I won't pretend to know what they all are, but here are three that I think are particularly challenging (and interesting).
In order for us (and third parties) to build apps on top of our data, we will need to establish stable interfaces to it. This is a very old and common idea in more traditional software development but something that has not had that much attention yet in the world of data.
Put simply, for two applications to interact, there must be an agreed upon convention for their interaction. This convention is typically referred to as an interface, or contract. This idea permeates every level of software, from calling conventions in ABIs [9] all the way up to OpenAPI specifications for web services. Ideally, the interface never changes, and successful interfaces tend to go to extreme lengths to avoid breaking their contracts. When contracts are stable, people have confidence in building upon them — a good example is the Java language spec, which promises backwards compatibility and delivers it. You can take something written in the days of Java 1 and run it just fine using the latest Java 17 JDK. This stability is a big part of why Java is (unfortunately) such a popular choice for building serious systems.
Without some sort of equivalent idea for data, it is not really possible for a third-party app to come in and just run on top of your warehouse. The questions are: What does this even look like for data? Do we adopt standardised data models? Do applications provide contracts for businesses to map their data to? How do we specify data contracts? Are they just schemas? Or do we need something more akin to GraphQL or OpenAPI? Can we just use GrahQL or OpenAPI? Is this just dbt exposures?
To me, these questions are both important and fascinating.
It's probably best to split this up into read and write performance, as they each have their own set of issues.
On the read side, the main problems surround latency and throughput. Cloud data warehouses have incredible read performance on analytical queries, but when you start opening up the data to the kind of interactivity you might expect out of JIRA (badum tss) you need a very different kind of read performance. Users of those applications expect interactivity, which means <100ms response times, which means you can't hit snowflake all the time, or the apps really will feel like JIRA. The second is throughput - these kinds of applications need to be able to support a large amount of concurrent use while remaining interactive.
This is relatively easy to solve, but is an important consideration. Cube.js [3] is a great example of how to approach this problem, although something like ReadySet [4] for the MDS could be even better (because you wouldn't need to manually configure pre-aggregations).
Another possible approach to this is the 'all in one’, or HTAP [5] database, e.g. SingleStore [5], which in theory will just handle this problem for us. It's not widely used, though, and a better outcome for the industry in my opinion would be a solution that could work with any data store.
As for the write side, there is far more work to do. At the heart of the issue, is the fact that software engineering (engineering in general?) is about tradeoffs. The design decisions made by Snowflake that make it so good at analytics are the same design decisions that mean it is beyond useless for transactional workloads.
A couple of promising approaches I've seen are things like SingleStore or hydras.io [6], which both try to reconcile this gap between OLTP and OLAP in different ways. I'm rooting for the hydras team because they were on my YC batch and they're great :).
Another possible approach might be for the data application to manage writes in its own application database and then replicate writes to the underlying 'source of truth' (data warehouse?). This is basically what hydras tries to make easy, and it's how most businesses get their application data into the warehouse anyway (often using tools like debezium [7]). I guess I'm trying to say there's no new idea here, but having this kind of thing done by default in a first-class way, is interesting.
Once we start opening up our data to everyone, we are rapidly hit with the problem of controlling who can see what and when. Many organisations struggle with this, and I'm sure many would say that they want to do it better! This problem is amplified when we start talking about third-party applications touching our data. I think we can expect privacy laws and standards around data management to tighten over time, and so, making this work correctly is very important to the data application idea.
The solution to this is role-based access control (RBAC [10]), which is pervasive in the software world but not as much in the world of data. There are islands of sanity, however. If you’re on a full Google stack (Google Workspace and BigQuery), you’ve got it pretty good when it comes to RBAC. You can just say that “this column on this view is only visible to people in the Finance group,” and it all just works as people join and leave that group. Data access control that HR can manage without even knowing they’re managing it.
It's easy to see how any third-party application providing SSO via Google could leverage the same rules. Making that work “in general,” however, for customers on any stack and authenticating via Ethereum (or whatever is hot these days) is not easy.
This might not be the best term for this category of challenge, but it will do for now. What I mean by this is that there is already momentum in how businesses handle their data and expose it to third parties. The "store of record" SaaS business model is deeply entrenched across the industry, and so are all the second order effects that entails.
For example, businesses know how to evaluate traditional SaaS vendors. They know how to integrate their products into their existing operations. They know how to ingest data from their APIs, and they have built entire teams responsible for piecing together holistic views across all of them.
That's just a short list, reality is far messier than that. I guess my point is, even if the tech works, making data applications a reality also means changing behavior at a massive scale. It's totally possible and has happened many times in the past, but it's not a given.
This post is broad and very shallow, and therefore asks far more questions than it answers. What it does do though, is set a little bit of context for a lot of writing that's to come. Specifically, I'm planning to drill down into all of the technical issues around making this kind of application a reality. There are many more than the three I outlined earlier.
Despite my apparent skepticism (forgive me, I'm British), I'm actually super bullish on the idea of building fully fledged apps on top of the MDS. I think there's a lot of merit to the idea - it's just nascent, and it will be both technically and organisationally challenging to pull off.
Until next time!
[1] https://www.youtube.com/watch?v=q1nERFM9brA&t=3458s
[2] https://benn.substack.com/p/the-data-app-store
[5] https://www.singlestore.com/
[8] https://www.hyrumslaw.com/
[9] An ABI is an ‘Application Binary Interface’. Basically, application binaries adhere to a set of convention that allows the operating system (or some other program) to know how to load and execute them. https://en.wikipedia.org/wiki/Application_binary_interface
[10] RBAC stands for ‘Role Based Access Control’. It is a common pattern/data model for assigning permissions to access resources to ‘roles’. Users can then be given a role and inherit the associate permissions. https://en.wikipedia.org/wiki/Role-based_access_control
[11] HTAP databases are ‘hybrid transactional/analytical processing’ databases. They claim to both transactional and analytical workloads in one system. https://en.wikipedia.org/wiki/Role-based_access_control