How do we design DATA INTENSIVE applications?
In this round of Cocktails, we talk to the co-founder of Rapportive and author of the critically acclaimed book “Designing Data-Intensive Applications”. We delve into the benefits of “local-first software”, a project which aims to enable both software collaboration and ownership, with the ability for users to work offline, while also improving the security, privacy, long-term preservation, and user control of data.
- Martin Kleppmann describes his role as a Senior Research Associate at the University of Cambridge and what led him to the path towards academe.
- We learn the concept behind “local-first” software and how it can change the game for ownership in cross-team collaborations.
- Kleppmann answers what helped his book, Designing Data-Intensive Applications, stay relevant in the fast-changing tech space four years after it was published.
- We find out the basic principles in making data-intensive applications reliable, scalable and maintainable.
- Kleppmann shares his thoughts on what has changed his vision for the future of data systems.
All right. And our guest for today is a researcher in distributed systems at the University of Cambridge. Previously, he was the cofounder of Rapportive, which was acquired by LinkedIn in 2012. He's also the author of Designing Data Intensive Applications described by the Chief Technology Officer of Microsoft as “required reading for software engineers.”
He is a regular conference speaker, blogger and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.
Hello! Thank you very much for having me, and thank you for that very kind intro.
All right. So, before we dive into the more technical questions, we want to know more about you. Can you please tell us about your role as Senior Research Associate and affiliated lecturer at the University of Cambridge?
Yeah, it's a role I sort of fell into by accident a little bit. So, I'd say I do a mixture of research and teaching. Research is the primary focus which means I'm spending a lot of time thinking through algorithms or trying to write some code, trying to make it better, and then writing those things up in the form of research papers and trying to get those published. And I work with various collaborators. And I can tell you a little bit more shortly about the topics that we work on. But before then, I spent a little while full-time writing my books, the book that you mentioned. And then before that I was a software engineer in the industry. So, I did the whole “Silicon valley internet companies” thing. We started a startup that we moved to San Francisco and we're part of that exciting ecosystem there for a while.
How do you make that transition from having a successful startup to a life in academe? Why would you, and what made you make that choice?
Yeah, it's a slightly unusual thing to do. For me, I think it was the right thing. So, I enjoyed the startup time in terms of getting practical hands-on experience of building systems. But after a while, I also got a bit frustrated that it was all very short-term. Like, you're always just thinking like one week ahead. “Okay. What is the next thing we need to ship? What's our next sprint?” You're always fighting fires. You're very close to the next step of what you're building. Whereas really, what I was hoping for was something where I could think a bit longer term, have the luxury to actually try and attack problems that are hard and will take some time to solve properly, but which will be valuable if we can solve them.
And so, when I left LinkedIn then in 2014 or so, first of all, I took a year off as a sabbatical to work full-time on my book. And during that time, I spent a lot of time reading, mostly background reading as research, background research for the book. And that sort of drew me into research a bit because I was doing one aspect of research at least, which is understanding the literature of what's been said before. And I had these ideas for technologies that I thought would help users gain better ownership of the data that they create. It wasn't quite well formulated at a time.
But you know, I had this feeling that cloud software was a bit of a dead-end, you know? In a week, cloud software's able to do so many wonderful things like with Google Docs, we can real-time collaborate on a document. We don't have to send it back and forth as a word document or as email attachment anymore. You can just have everyone lock in and edit at the same time. And it makes things so much more convenient. But at the same time, there's this risk that if Google decides to lock your account, then you're locked out of Google Docs and you're locked out of every document that you've ever created on Google Docs. And so all of your data is held hostage by Google in this case, essentially. And that's a huge risk that the automated system decides [if you] violate the terms of service. This happens all the time. Apparently, millions of Google accounts get closed every year, just on the basis of some automated system design that decided that you violated the terms of service, and then that's it.
You have no more access to anything you ever created with a Google account with no warning and no recourse. And I thought that's really terrible. Wanting to solve that problem is part of what got me into research. So, then I started looking at algorithms and techniques that would allow us to build collaboration software which behaves the same as Google Docs, and has the same kind of convenience. You can have several people editing in real-time and so on, but also which makes sure that every user has a copy of the data on their own computer, where nobody can take it away from them. If it's a file on your own computer, you know, that's something much more concrete, much more tangible, much safer.
So, “local-first”, the name implies that the file, first and foremost, exists on your local drive, and then it's backed up to cloud storage. We still have that sort of concept in a lot of applications today. So, as opposed to just a cloud backup, how does your concept different from that?
Yes. Well, nowadays, you can have a file in Google Doc in Dropbox or in Google Drive. So, Dropbox or Google Drive, doesn't really look inside that file. It just treats it as a sequence of bytes and you can have a word document or a markdown document or any other file format you want in there. And as long as you are the only person who's editing it, then everything is fine for life. It’s simple. The problem starts arising when you've got several people contributing to the file. And what happens for example, if I modified the file and independently, you also modified the file and now it both saved the file. And what happens in the case of Dropbox, for example, what you get is a file conflict. Dropbox will detect the file that was modified by two different users at the same time.
So it will give you two copies of the file, one containing your change, and one containing your colleague’s changes. And now it's up to you to manually merge those two things back together again. So, good luck. Maybe if you're lucky, your software provides some kind of differing view, which allows you to compare the two files. Otherwise it's going to be an extremely manual process, very labor intensive. And so that problem of having to do merges manually, we don't have that in Google Docs because Google Docs is constantly merging all of the users’ changes automatically. And so, we want to take that same concept of automatic merging of the file versions.
A similar thing kind of happens in Git. So in Git, you know, each user can work off on their own branch. You can make a commit, even if you're not connected to the internet right now, you can just do that offline on your computer. You can make as many commits as you like. And then at some point, you decide, “Okay, I'm ready to share my work. I'm going to push it to Github now,” for example, or make a pull request and then the other people can decide to merge that in. Again, we've got this kind of branching and merging type here. And again, in Git, well, the merging can happen automatically. Like, if you're editing two different files in the two different pull requests, then Git will very happily merge those. If you're editing different parts of the same file, I edit the top of the file, you edit the bottom of the file, then it'll probably still be able to merge those automatically. If you edit the same lines or very close by lines in the same file, then Git will give you a merge conflict and leave it to yourself to resolve.
But all of this merging and merge conflict detection works only when Git is working with plain text files, like source code files. If you put any other file format into Git, like anything that Git would call a “binary” file, it doesn't know automatic merging because it doesn't understand the file format. And so again, you're back to this situation of having to merge files manually. And you know, doing things as plain text is fine for software engineers. But people in real life who work on spreadsheets say spreadsheets are not plain texts, or they work on CAD drawings or the building plans, architectural building plans for a building or the score for a movie or those types of things. And, you know, you can't really well-represent those things as plain text formats. Generally there are going to be some sort of binary formats produced by some higher level software.
And so, where we're trying to get to with local-first software is that all these different types of software that produce all these different file types can continue storing their data in files on the local disk. But when several people independently modify their files, they're copies of their files, we can merge those together. And moreover, if several people want to work in real-time together, then we can also enable that sort of real-time collaboration, which is something you don't really get like [to] character by character see what somebody else is typing real time collaboration.
And the cool thing is that we can actually do all of those things using just one programming model. So, we have one technique which is called CIDT, which I can explain a bit more about if you're interested. And that allows us to do all these nice things like real-time collaboration, but while storing the file on your local disk, it allows us to do asynchronous collaboration, which is the Git-style pull requests type workflows but with automatic merging. It can allow us to work offline on that document and then merge with other users when they come back online again. And later you can even allow things like having several people in a remote location collaborating with each other over a local network while they're disconnected from the wider internet. So, just using device-to-device communication, like Bluetooth because it's also sufficient, there doesn't necessarily have to be any servers involved in this type of software at all. And I found it really cool because it just enables so many new types of workflows and new types of applications and models for collaboration that current cloud software does not have. And at the same time, it also is better for users because it reduces the risk of say, a cloud vendor going out of business, and then taking all of the data away with them.
How mature is the system, the protocol to facilitate this? Is it just something which is readily deployable today or still in the research phase? Where are you at?
You mentioned it's an open source project. Are there commercial applications for yourself or the university with this?
So, at the moment there's no commercial interest behind it. So, we as researchers are maintaining it as part of our research activities. We're not trying to make any commercial product out of it. Maybe one day at a time, it will be right to try and commercialize it. But I don't think we're quite there yet. I'll be honest with you. It's still a fairly early stage technology. It works, like we have a good test suite and it's pretty robust. It's a bit slow at the moment and it uses quite a lot of memory. So, one of my main focus areas at the moment is just to improve the performance. There we have a long way to go, but also we have some very promising approaches that we're trying. So, whether it's fast or not fast enough or not right now, it depends a bit on the application.
What type of applications do you see for it? Is it in that data privacy security type space, or is it in certain verticals? Like you mentioned CAD, for example. I can also imagine, when you're working on large Photoshop files and working in the cloud is not necessarily ideal. Are there any sort of markets you see as a natural fit for it?
It's quite broad, but of course we do need to start somewhere. So, one of the first production use cases that automerge currently has, which I find quite interesting, is with the Washington Post, the newspaper. And so they've put automerge into production in the internal tooling for updating the website. So, their main website, Washingtonpost.com is, if you look at it, it's like several columns. In each column, there are articles. Each article may have an image, or may not. It'll have a headline with varying font size, with varying texts and maybe text underneath the headline might be extra stuff. They might move the layout around or rejig it from time to time too, based on what's happening in the news.
And all of this layout is set up manually by editors. And there's a team of editors working around the clock at the New York Times that whenever some important news comes in, they will figure out where to slot it in on the home page, what old news to take out and so on. And for this, they have their own in-house piece of software that allows them to edit this. And since they have several editors working on the homepage at the same time, they need a collaboration workflow. Moreover, they don't just want, like one editor to do a click, make a change and immediately [it’s] on the live website. Instead, they have a review workflow where one editor can essentially accumulate some changes that they want to make on what you would call a private branch and get down, kind of operating on their own private copy of the homepage. And they can drag things around, see what it would look like. Once they're happy with it, they'll click a button to request the review from a colleague. The colleague will then see what this person has done, and will also see what people have done to other sections of the website like when people might be working on the new section, the other people might be working under the sports section. And so we want to merge those edits together automatically. And at some point they decide, “Okay, if you're happy with the layout now,” and they hit the “publish” button and it goes out to the live website.
So yeah, it's really nice. And what I find interesting about it is that it's, you know, it has quite a real-time collaboration element, but it also has this element of like different users working in their own private copies for a while until they're ready to share their work. And then at the point where they are ready to share it, they hit the button and it becomes part of a shared document and using automerge allowed them to seamlessly combine those worlds because automerge is perfectly happy for you to have different branches and forks of a document and for different people to have different views for a while, and then to reconcile those views when you're ready to reconcile them. Very cool.
Let's talk about your book, Designing Data Intensive Applications published in 2017. And four years on, the book is still going well, leaving positive reviews on Amazon. It's obviously maintained its relevancy over the last four years. What do you think it is about the book that has been made to sustain its relevance today?
Although people say tech is so fast changing... I found that the fundamentals actually change surprisingly slowly
And so, on that, there's a bunch of commercial products. There's a bunch of open-source projects. Everyone claims that they're the best at everything. Obviously that can't be true because nobody is always the best at everything. Each project always has its strengths and weaknesses, but a lot of projects are not very good at articulating what their strengths and what their weaknesses are. And so what I wanted to try with this book is to really figure out, “Okay, what are the fundamentals?” Essentially, like if you want to store data, there might be three different primary ways you can do it. There’s Approach A, Approach B, Approach C. Then we can say, “Okay, let's categorize the products that exist. Okay. Databases, X, Y, and Z store data, according to Approach A databases. G, E, F, and H store data according to Approach B…” and so on.
And so this now kind of helps people build up a bit of a mental map of the landscape. And so in that way, helping figure out roughly at least what set of products you should be looking at. If you have a system that for example, either needs to store launches of batches of data quickly, and then be able to query over them all, or have a system where data comes in only slowly, but then it gets queried many, many times or data where you've got data, new writes coming in and to at a fast rate, but they actually don't get query that often and so on, depending on what your access patterns are for a system and what your consistency requirements are and so on.
There are ways of figuring out which tools are better for the job and which are less good at the job. And I think part of what has made this book useful to people is that I don't try to teach people how to use a particular product because there's plenty of documentation out there. If you want to learn all the features of Postgres, that's fine. Just brief the Postgres documentation, it's perfect. What I will try to do is to help you to figure out in which circumstances you would use Postgres versus with circumstances, some totally different database system.
Where did your passion for data come from? Was it your time with your startups with Rapportive or you know, when you went in with LinkedIn and they're massive datasets with streaming services? Where did all that come from?
Nobody is always the best at everything.
Yeah, I think certainly like when we were at Rapportive, we were dealing with a moderately large data set at the time, and we did struggle with it a bit. Like, we had essentially just one big database that we tried to put everything in and trying to get the performance of that database to be as what we wanted was always a bit of a challenge. So, then I started learning a bit more about techniques for scalability that would allow us to grow that further and still do the kind of operations on that database that we needed to. And then when I got to LinkedIn I started working on their stream processing efforts. So, this was just in the early days of Apache Kafka. So Kafka had just been made open source. But this was before Confluence spun out of LinkedIn and started to commercialize Kafka.
And we were just in this exciting time of trying to figure out how do you best use these tools? Like, okay, we've got the streaming log abstraction provided by Kafka. What sort of processing primitives can we provide on top of it? How do we make them scalable and reliable? How do we make it such that LinkedIn was operating a pretty large data volume for these things? So, we wanted to be efficient. We want to make sure that we can just set up a job and have it run reliably without getting paged in the middle of the night and so on. So, there's a lot of motivation that comes from those sorts of personal experiences of trying to build systems and then later trying to learn the lessons from building those systems.
You’ve described data intensive applications as they should be reliable, scalable, and maintainable. So what approaches can people take to achieve this?
Well, it's hard to give a very short answer because essentially the book is a very long winded, 700 pages, answer to that question.
But are there some basic principles that they should be looking at?
The concrete steps that you would take to make an application scalable depend massively on what the application is on, what it needs.
As a basic principle I would try to be very conscious of exactly the operations that are happening and how often they're happening and how they can best be enabled. And so, for scalability, scalability is not a one dimensional product property. It doesn't make sense to say a system is scalable or non-scalable without saying what it is. It's scalable with respect to what generally scalability means, like you can increase something, something might be the amount of data that it stores, or the number of queries it handles per second, or the number of distinct customers who are using it, or the number of concurrent users using it at any one time or any of these various metrics of how busy the system is. And as that metric grows, you want the system as a whole to still provide reasonable performance.
And then performance, again, there's not a single property. But you could be measuring like, is it the latency of a request until the request gets a successful response? Is it the throughput in terms of like gigabytes per second? What is your metric of performance that you're trying to optimize here? And so, I think the whole domain of scalability essentially is wanting to say, “Okay, if, if I increase the load in a certain way, where load is defined in some way, that makes sense for my application.” Then I want the performance to still remain good, where performance is defined in some way. That makes sense for my application. And then once you've broken it down like data, I think then you have a degree of clarity and then you can say, “Okay, what we're trying to do is just to store the maximum amount of data possible.” And we're not going to worry about how it's going to get queried, or we're going to make sure that we make our queries really fast. And so we need to make sure that our scalability is in the query layer and so on.
So, I think that's how I would approach this really, because the concrete steps that you would take to make an application scalable depend massively on what the application is on, what it needs. But the steps that you can take in order to figure out how to do that, they're repeatable. So, the types of questions that you need to ask yourself and those other sorts of questions that the book tries to teach you to ask.
And you don't just talk about systems and architecture. You also talk about data models, which you've described as one of the most important parts of developing software. Run us through that, the importance of data models and your thought process behind those.
Yeah. When people compare data systems, often data models are like the first thing they focus on because it's just the thing upfront, really. So, when, say, there was a phase in 10 years ago, or so when MongoDB came out, there were a bunch of other document databases that presented themselves as alternatives to the relational model. And they were saying, “Okay, like, it's much nicer to group your data together into these JSON documents rather than having it spread out across a bunch of rows in a relational database.” And this is the data model question, right? And then people looked at that and they said, “Yeah, okay.” They have some points there, but actually then over time, what we've seen is that these two different data models have converged somewhat.
And so a lot of relational databases now actually have pretty good JSON support, Postgres, mySQL included. So the need for a dedicated type of database to handle the sort of document model data is not as pressing anymore now because other databases can actually do that. Conversely, in the other direction, some of the document databases have started adopting relational style query languages because they realized that that is actually a really useful feature as well. So, for a while there was this phase where people said relational and document oriented are like these enemies, that they're total opposites of each other. And then it turned out that actually, the two just merged. You can think of them as two different things, but just two different aspects of a data model that may well be implemented in the same system.
And we can apply similar arguments with other types of data models as well. So, like a graph data model is another one that I quite like. I'm personally quite a fan of graphs because I find them a very flexible way of describing data, like relationships between things in particular graphs tend to be very extensible. So, if you want to add a new property to something or a new type of relationship between different entities, it's very easy to do that. But how do you represent a graph? Well, you can represent a graph on top of a relational database, for example, that's perfectly fine. You don't necessarily need to have a specialist graph database, especially as a graph database might be able to do some things faster than the relational database. Like, if you want to do some shortest path queries, for example, or other kinds of queries that depend on variable length paths through a data set, those are things that SQL databases don't currently support very well, but they do kind of support as well.
And so then again, I feel like, okay, we've got this graph data model, which is a useful, interesting contrast to the relational model, but at the same time, there's also a bit of convergence going on where databases essentially steal the best ideas from other data models and incorporate that.
Other models have been around for a while, but they don't seem to have sort of gone mainstream. But they still seem to have a certain segment of the market as opposed to SQL data models, which obviously dominate. And you mentioned noSQL type JSON data models, which have recently become bigger in the last 10, 15 years. What is it about the graph data model, which hasn't seen sort of the same type of adoption?
I'm not sure, really. Because my feeling is that it's actually a really good fit for a large class of applications. And certainly a lot of the big companies that publish about the way they structured that data have adopted graph data structures. Like Facebook, for example, is quite vocal about the fact that everything they have, everything they store is essentially a graph. And so, you know, when you type an update or if you like an update written by somebody else, that “like” has an edge in a graph between yourself and the update that you liked. And the update that you liked has an edge in the graph to the person who wrote it and also to the three other people who attacked in that update. And then from there, you have an edge to the picture that's included in the update, which then details a link to the vertex representing the location, where the picture was taken and so on.
And this stuff fits beautifully, easily into a graph. And because it's a graph, Facebook can add new types of entities into the system quite easily and maintain all of this sort of rich interaction information. And my sense is that a lot of enterprise apps could really take a similar approach there,
In the final chapter of your book, you dedicate a chapter to the future of data systems. Are we executing on that future? Do you think, or has your vision for the future changed?
There are. Yeah, so I explore a whole bunch of more speculative ideas in that chapter and some aspects are definitely happening. So, what I was trying to think through is what does a world look like in which streaming data flows become more the center of how we design systems? And the reason I was thinking about that is if you think about a typical database query, I want to know how many socks are in stock right now of a particular color, I make a query to the database and I get back, “Okay. There are currently five pairs of socks in stock.” And then what happens if that changes? Well, the database doesn't tell me if that number changes. If somebody buys two pairs, and only three pairs are left in stock during your way, I can find that out just by repeating my query.
I don't know, I'll find out the new result, but there's no way that the system can notify me as, “Hey, you earlier queried about the socks, but the stock level for stocks has now changed.” You might want to be sure. You might want to know about that. And so this just quick these database queries are stuck, still stuck in this very request response type model. And likewise, most of the APIs that we use now, say REST APIs, but microservices have that exact same request-response model, where you make a request to service, you get a response back, but then if the response subsequently becomes outdated, there's no way of finding out, pulling again, to see if something has changed. Pulling is super inefficient. So, really some way of getting notified when stuff changes. And that notification, yeah, he really needs to go through all of the layers of the stack all the way up to the mobile app or the web browser the user is using, because why would you want stale data being displayed on somebody's screen, right?
If you have the ability to update in real time with some information that came from a database, it went through various levels of being rendered and being busy, going through business processes and stuff. Eventually it ended up in HTML on somebody's screen. And really if that information goes out of data, it would be nice to be able to push an update all the way up to the user screen, to reflect the change that has happened. And very few systems are currently set up in the way to really allow those changes in data to be propagated through all of the layers of the stack, but you get streaming systems now built in a few narrow niches.
So, one thing that is becoming quite popular is something called change data capture, where if you have a database you don't know, just like write your data to the database and retry it and like usual, but you also capture a stream of all of the changes, all of the updates that are written to them, the database, and that stream can then be put in something like Kafka where you can subscribe it.
And you can have a bunch of consumers decide what to do with that information then yeah. Maybe they will update a patch or maybe they'll update a search index, or maybe they will do some analytics, or maybe they will notify something else that some data has changed. Whatever it is, at least there's now the ability to respond, to changes for sure, to a database. But this is still quite a far way away from this bigger idea of, “Okay. We don't just capture the changes from the database.” Next step is now we push it through all of the layers of the stack, which are currently just probably REST APIs or other kinds of RPC, which don't really support a streaming type data flow.
Can you take something out of the book of these real-time collaboration apps that we were talking about earlier, such as Google Docs? Google Docs has the ability to update in real-time on somebody else's screen when something changes in the underlying document. Why don't we have that sort of capability for absolutely all software? All software updates immediately live on the screen when something changes in the underlying data. That's going to be hard to get to because so much of our software stack is currently based on this request-response paradigm and changing data.
It's going to be a very big job. So, I don't expect this thing to be fully realized in even the next 10 years. I think because it's just a bit too much of a jump for people, but I do think it's a very interesting idea to pursue and maybe bits of it will be put into practice. And at least if it inspires people to think a little bit differently about their systems, then maybe it will still have some effect
Martin Kleppmann, that's some super interesting stuff. You're working on some very interesting things. And a very interesting idea is how can our listeners follow you and what you're writing about, talking about?
Well, I have Twitter, “@martinkl”, which you're welcome to follow if you like. I occasionally write blogs, only like a few blogs a year, but I try to go into some detail when I do write something. So, on my blog, Martin.kleppmann.com, you can also find an email sign up form so that you get a little email when I write a new post. And finally, if you're interested in supporting this kind of thing financially, I did set up a Patreon account with the goal of trying to turn this into a potential career of an independent researcher, not tied to any institution necessarily, but just being able to continue doing the research and the teaching the work that I do, perhaps writing books, and the second edition of my current book is potentially in the works.
So, those types of things if you're interested in that and have a bit of money to spare, you're very welcome to chip in. And I send detailed updates to my supporters every month on what the latest works that have been happening. So, it's also a way for you to get a front row seat in the research process and see how these kinds of things happen internally and you know, how the sausage gets made. So if you find that sort of thing interesting then you might find the Patreon interesting.
Good stuff. And of course, Designing Data Intensive Applications is available on Amazon.com as well. Martin Kleppmann, thank you very much for joining us today and we wish you well on those future projects.
Great. Thank you for having me. Thank you.
- Designing Data Intensive Applications by Martin Kleppmann
- Martin Kleppmann’s blog