Since Kafka was originally developed at LinkedIn in 2011, it has gone on to become the de facto standard for large scale messaging and event processing. Whilst Kafka is open source and managed by the Apache Software Foundation, the original co-creators of Kafka went on to form Confluent to offer commercial services and features on top of the Kafka platform.
In this episode of Cocktails, we talk to a senior developer advocate from Confluent about Apache Kafka, the advantages that Kafka’s distributed pub-sub model offers, how an event processing model for integration can address the issues associated with traditional static datastores, and the future of the event streaming space.
Kevin Montalbo: Welcome to episode 40 of the Coding Over Cocktails podcast. My name is Kevin Montalbo. Joining us from Sydney, Australia is Toro Cloud CEO and founder, David Brown. Hey, David!
David Brown: Good day, Kevin!
KM: Our guest for today is a Senior Developer Advocate at Confluent, the company founded by the original creators of Apache Kafka. He is a top-rated speaker and has been speaking at several conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. Joining us for a round of Cocktails is Robin Moffatt. Hey Robin, glad you could be here with us!
Robin Moffat: Thanks for having me!
KM: All right. So, we want to start by asking you about your background at Confluent. What does a Senior Developer Advocate do and how did that experience lead you to become part of the Kafka Summit Program Committee?
RM: So, I develop products. It's a funny role because, you know, that kind of made me see what I think I do, what my parents think I do, what my coworkers think I do. And it's like, there's a different one for different roles. But people's perception of developer advocates, if you follow them on Twitter or whatever, it's like they go on airplanes, they go to all these fancy conferences and all this stuff amongst the developer community, everyone's complaining about jet lags and like, “yeah it sucks” and all this kind of stuff.
But in reality, particularly since COVID, what developer advocates do, it's all about advocating for and to developers. So, it's like working with developers to help them get on their journey with whichever software or technology or concepts. That's going to be a benefit to them and usually for the company or the foundation or whatever it is, that's implying the advocates.
So, I work for Confluent, like you say. So, I've got a big interest and I've got a big passion for Apache Kafka, which is like how I got into this role. So, it's about helping developers, architects understand, well, Apache Kafka, what Confluent can do for them and how to get the most out of it. And also, where it does it work for them. So, it's one of these funny things. It's not marketing, it's not sales, it's not engineering, it's not product. It's kind of all of them and none of them, and then other stuff as well.
So, it's helping them understand what it is. Kind of like, you're just doing a talk to promote it, but it's also helping them understand what it isn't and what they can use it for and what they shouldn't use it for, and just helping them have a happy time with it, if it is going to be the right thing for them. So, that’s what a developer advocate does.
DB: Sorry to interrupt you there, Robin. Did I hear you say you also represent the developers? Like almost, as on behalf of the developers to Confluent? Like, presumably this is what the community is asking for. This is the feedback they're giving me. Did you take on that sort of role?
RM: Yeah, it's exactly that. Yeah. So, that's kind of a very deliberate part in my opinion, of being a developer advocate. So, I think any good one will have that ability to actually go back to the company or whoever is kind of employing them to be able to say, “Look, this thing here doesn't work,” or like, you may be thinking this thing's great, but no one wants that. What everyone wants is this. And obviously that's kind of like, “Well, everyone's got their opinion.” And so that kind of product and engineering, they do a great job, they’re kinda like, working for what's appropriate for us.
But for developer advocates, it's not about saying, “Hey, here's our latest feature. And this is why you must use it.” It's about saying, “Here's this cool thing that we've just developed. We think it'll help you like this.”
And developers might say, “No, it won't because XY said I need to take that back to the product managers and say, `So we’ve combined their feedback, we're getting looks like this.’” And that's really, really valuable feedback because that's not just from internally. That's actually from people using their software, day-to-day saying like, “These are our pain points.” So, a good developer advocate is always listening, always feeding back. It’s not just kind of like telling people stuff. It's like helping people with stuff outbounds, but it's also gathering that feedback and taking it back also. So, that's a key part of the role.
DB: You said that in a pre-COVID world, you would've been doing that presumably in conferences and the like. You would have been doing a lot of traveling. So, how does that feedback loop occur in a post-COVID world? Is it community forums?
RM: Yeah. So, there's lots of conferences and yeah, definitely. So, I suppose pre-COVID, I was traveling about a third of the time and there's conferences, there's talks, but I’ve always been a big fan of kind of like the online world, like way back in the day. And I'm showing my age, like I was into IRC and stuff like that. And I've always found it fascinating the way people can connect online like that.
So, I guess nowadays there's lots of Slack Workspaces, there's Discourse forums. We set up one of those recently at Confluent. We launched that late last year. And that's again, being really useful for bringing developers together. And kind of just going back to the advocate thing as well. It's also about meeting developers where they're at. So, we’ve got at Confluent a Slack Workspace, we've got a forum, but there's also things like Twitter, there's Reddit, which is just like developers are there so advocates will kind of like go and meet them there. So, you get a lot of feedback just chatting to people online.
Sometimes it's public conversations. A lot of the time, you'll have very fruitful conversations from someone complaining about something publicly. And like, we all like to take to Twitter and rant about stuff, but actually following up with people on that privately. And not like those anodyne company responses like, “Hey, sorry, we heard you had a problem. Can you DM me your account number, blah, blah, blah.” That's like a corporate, boring, awful thing. But actually saying like, “Oh, that sounds like that sucks. Like, what was the problem?” And sometimes they'll ignore you, sometimes they’ll be cross. But a lot of the time, it's like, “Oh, well actually, here's the issue.” And you can actually help them through, even if it's just working out “Well, here's the docs page. Maybe you missed that out.” And then, they have a happy time. And if there isn't a docs page then that's good feedback to take back to the team in terms and say, “This thing is kind of unclear and perhaps we should document it better.”
DB: That's hard because you're almost getting dragged into becoming technical support.
RM: Yeah. And it's a funny, fine line between how much of advocacy is actually just supporting people on the forums. Because in a sense, if all you're doing is asking questions on a forum, that is just a part but that's also building up your knowledge of the area. And one of the things that I've found quite useful is, answering a bunch of stuff gives you an understanding of the areas that people have pain in, which then actually gives rise to really useful conference talks. Because if people are always having trouble with us asking about it, and then you can write a conference talk which talks about that thing, you can’t quite guarantee the audience because everyone always has problems with this stuff. So, your conference almost writes itself because it defines the area that you want to talk about.
DB: And who would you report to within the organization? Would that be a product owner or the developer team? Would you get involved in the sprint planning itself? How do you bring that feedback back?
RM: So, developer relations as a whole, as a discipline, tends to vary between it reports into marketing, or I could report into product. At Confluent, it kind of moves around over time. It's reported to both over time. Sometimes it sits under the engineering. But a lot of the time that feedback loop comes from building up relationships with the product managers, with the engineering teams and just going back to them directly and saying, “Look, there's this thing.” And I guess as companies get bigger, maybe that gets more formalized. But certainly in smaller companies, it's just like reaching out directly and building that personal relationship.
DB: Well, I guess we should start talking about Kafka. So look, to begin with, what advantages does Kafka offer being a distributed, published, subscribed system versus a traditional request-response model?
RM: So Kafka, the way I like to put it is events. “Let us model what's happening in the world around us.” So, events describe things and events are like, something happens and “What happened there? What time did it happen?” And that's how a lot of the data we work with originates by modeling and working with data in that kind of way. Then you’re actually capturing it in a way, kind of like there's very low friction in terms of the conceptual way of dealing with that. And as soon as you start to kind of bucket it into other ways of doing it, it's fine, but there's trade-offs to be made. So, that's why capturing events as they happen is a great way to go about doing it in terms of request-response. Request-response is great in some cases, but a lot of the time you want to actually have an asynchronous approach to things which are put in that request-response. You start to kind of block up things, waiting for response to it, and something else fails.
And you kind of get that knock-on effect, that’s domino-like. All the things are blocking, waiting for that one endpoint to respond. So, by using Kafka as your broker, you can actually put messages onto a topic and then other services deal with those asynchronously. So, you have more loose coupling between your systems. They're still coupled in a sense, because they're still working with each other. But you're doing it asynchronously, which in a lot of cases is the right way to do it, but not always.
And that's the thing, it depends. It's working out which way is what you actually need in your system. Do you need that direct coupling that deliberately doesn't do anything until I've had a response? Or is it more a case of saying this customer has just placed an order? So, I'm going to, say, the place an order, I put that onto a message queue onto a topic, and then anyone else who needs to know that if I place an order, whether it's the inventory service or the shipping service or the fraud detection service, they can subscribe to that.
RM: That's a topic and they can find out about that, but then placing an order isn't dependent on each one of those responding and saying, “Yep, I've heard about it. You can actually just put it onto that topic and those services receive that message when they're able to.”
DB: Yeah. I'm interested in the use cases and architecturally, the deployment models for a Kafka including a pub-sub model, like you’re talking about there. But first, I'd like to talk about data integration because in previous writings on your blog, you've talked about data integration built on a traditional static data store will inevitably end up with a high degree of coupling and poor scalability. How can switching to an event processing model through integration overcome that issue?
RM: So, in terms of scalability, I suppose, in terms of integration, the way that we've historically built systems is, I've got this in one place and I want to end in this other place. I'm going back many, many years. It's like, well, that's fine. I'd like one great big mainframe. I would maybe copy it from one subsystem to another subsystem and then move on for a few years. It's like, well, we've got this one, great big central transactional server and another great big central data warehouse. I would just copy the things between them. And that's kind of point to point, and that's fine.
And then fast forward a few more years, and I guess we're talking about 10, 15 years ago and suddenly the whole thing exploded. And suddenly there were numerous different databases to choose from numerous different cloud services to choose from. People were running software on the desks and there was no longer the purview of just this elite kind of data team.
It was like anyone who could spin up a server or had a credit card, could now start storing data and producing data and wants you to extract data or send data. And so you ended up with this huge, huge spaghetti bowl of tightly coupled mess. But you say, “I want to get data from this place to this place.” And someone else will say, “Well, I also want data from this place, so I'm going to copy it to here, but I can't copy it here until this feed is run.” And then that feed breaks and like 10 people start screaming and we already knew about one of them and nine other people piggyback onto one to the back of that.
So, the point around using something like Kafka for integration is that when an event happens, it gets published onto a topic. It doesn't get deleted from that topic until the person who created that topic, they've defined how long to keep that data, which could be, we want to keep it based on time. Like, “Let's keep this data here for 10 years or 10 days or whatever is appropriate to that business case or based on science. Let's keep the last slide, 10 terabytes worth of that particular topic.” Or “Indeed let's keep it forever.” It depends entirely on the particular piece of data or the entity that you're working with. Anyone else who wants that data can subscribe to that topic and independently read from it.
So, you can have very, very near real time exchange of data between like data gets produced, like an order gets written and these other services can read from that and know about it almost instantaneously. You can have other systems, maybe it's like an audit system or a machine learning model at once, kind of like get some training data they can read from that. They could hook up to it once a day and say, “Give me all of the new data.”
But the point is the data's there on that topic in Kafka for anyone to read, who's got permission to access it. So, it's a much more loosely coupled way of saying here's some data that got created and now anyone who needs that data can access it, but without building these tight couplings together. So, it makes it more loosely coupled. It also makes it more scalable because Kafka is a distributed system. So, as you have more data and more throughputs, you add in more and more Kafka brokers and you get more scalability from X and your consuming systems can consume in parallel. So, it's much better that way also.
DB: Yeah. I mean, when I think about event processing engines, like Kafka, obviously have models like Internet of Things, devices which are generating lots of events or LinkedIn or whatever it may be, which is where there's some event occurring on the website where they just need to track vast amounts of data as people are doing stuff. But of course, those transactional databases that you are talking about are still incredibly important within an enterprise. So, how do you stream events from those large SQL databases into Kafka?
RM: So, there's two different approaches. One is the application that wrote the data to that database, right? It's Kafka. So, it depends on why we are writing it to a database? Are we writing it to the database just because that's what we've always done? We've always written data to a database. So, we'll keep on writing to a database or actually do we say, “Well, we didn't actually get it in a database in the first place. We only put it into a database as a way of exchanging it with other systems.” In which case you say, “Well, given, if it's appropriate for the project, let's just change it and write its Kafka instead.”
A lot of the time, that's not an option, at least initially. Initially everyone's like, “No, we're not changing anything.” Or “The scope of this project isn't to actually do that.” So, you can use something or change data capture, which lets you take data from the database and stream it into anywhere else, including Kafka. So many, many different databases support this way of doing. It's like the details differ, but Oracle's got a redo log that can get the data out of the transaction log that’s been like in my SQL, Postgres, all the relational databases. I've got the tape, the concept of a transaction log and you can actually capture the events such as an insert and update, even deletes. You can capture that data out of the database and you can stream it into the places, including Kafka.
DB: Yeah. Great. And a lot of people would be familiar with Kafka, but I think few perhaps would be familiar with the ksqlDB, which is also produced by Confluent. Now it's described as a database for building stream processing applications on top of Kafka. Tell us what is the purpose of ksqlDB and how does it differ from a traditional SQL database?
RM: Yeah, so ksqlDB is really cool. It's one of the things that, when I joined Confluent, was just in its infancy and it's fantastic to see how it's grown. It's like a database for building event driven applications, but ksqlDB is also a way of doing stream processing, declaring stream processing using SQL. So, when I came in, I suppose I was in the big data space. My background is in analytics. And then I started reading about Hadoop and things like that. And there was this thing called Spark, all this stuff, and I felt I must be left out because I couldn't write Java, I couldn’t write Scala or any of that kind of stuff that people use for doing stream processing. And then I started reading about ksqlDB and got my hands on it.
I was like, “This is really cool.” I can take a stream of events and I can say, “I'd like to transform that stream of events into another one.” I could filter it and I can aggregate it. I could even join it to another stream. And I could express that using SQL. And SQL is like my bread and butter because that's what my background is. So, that's one of the things that ksqlDB lets you do; you create these queries in SQL and the continuous queries, when you create this query, it actually continues running on the server. If a server stops, when it restarts, the query keeps on running. So, you continue in processing these streams of data.
So, one of the key purposes of ksqlDB is that you can build these stream processing applications that you're expressing using SQL. So, if you start thinking about things like ETL and the old days, you would kind of pull some data out and then you transform it and then put it somewhere else or you pull it out and put it somewhere else and then transform it depending on if you're doing ELT or ETL, you can actually do this concept of streaming ETL by taking data out of a system.
So, like from a transactional database, like you just started minutes ago as that data is coming out of the database, you can be transforming it and enriching it and doing the stuff you want to do to it. And then you can store that into a Kafka topic for other systems to use or services to use. You can also stream it on downstream using something like Kafka Connect and it'll push it out to another system.
One of the other really cool things that ksqlDB also does, and this is where “DB”, the name comes from because it used to be called ksql. And then it got renamed ksqlDB because it actually stores data. It builds a state store internally. So, ksqlDB itself is built on Apache Kafka. So, it reads data from Kafka topics, it writes data to Kafka topics.
Apache Kafka is this persistence layer. But within it, it has their state store, which I think uses RocksDB in the background, but their state store actually builds up the state. So, if you're building an aggregation, like I want to know how many orders were processed in the last 10 minutes, you can build up this kind of cumulative aggregators and actually hold that internally, which is pretty useful because you have that staple aggregation. It can be scaled down to cross ksqlDB workers and that kind of handles that automatically. But you can actually query the aggregate directly. So, you can dowhat's called a polar query against ksqlDB either directly, or you can use the REST API or the job of the client. So if you've kind of taken a step back from it, it means that you can say, “I've got this data coming in from anywhere.”
I produce it to Kafka directly, or I pull it from a database stream or I pull it from anywhere else. I can run a query, which is going to build up an aggregation saying, “What was the maximum order value in this time period?”, or “How many do we process?” Whatever. It holds that states continually updated within ksqlDB. And then an external application can query ksqlDB and say, “How many orders have we currently processed in this 10-minute window?” So, you don't need an additional cash or store elsewhere. You've just got your data being created. It's going into a Kafka topic. And then ksqlDB is maintaining that states on top of it.
DB: Interesting. So as I understand it, ksqlDB l is basically an interface to Kafka streams to be able to query or set up Kafka streams using SQL syntax. The DB element, like you're saying, is this persistent data store related to that where you can presumably set up the stream to point to another stream to output it to a persistent data store in a Kafka stream. Are there limits in terms of that time window where you can query that data?
So yeah, ksqlDB, you’re right, it runs as a Kafka streams application. And as a user, you don't need to know any Java, which is great because I don't know any, so you just write sql. And then in terms of retention and stuff like that, you can define that when you're creating a what's called a table.
So within ksqlDB, you have the concept of streams. So, you can select against the stream. You can also build a table. So you could say “create table” as selecting your aggregates. And that table is backed by a Kafka topic. And you can say within that table, “What retention period do I want on that data?” And that's going to come down to the business use case that you're writing an application for.
DB: How about relationships?
RM: Yeah. So you can do joins within that. The latest version of ksqlDB is 0.19, which dropped, I think yesterday, now supports foreign key joins also. So yeah, you can do joins between streams and tables and tables and tables and streams and streams. And so you can do that as well.
DB: Can you also join out to an external SQL database, like an Oracle, Postgres database?
RM: In a sense, because you can integrate those into Kafka. So, the pattern to follow would be to pull that data into a Kafka topic. Kind of economical example would be, I've got all the data coming in from like my platform and my website platform. It's writing audit data into a Kafka topic, and it's got like a foreign key reference out to the customer. So, it doesn't have the full record. It's just like mostly normalized. I've got my customer data in a database. So, you pull that data from your database into a Kafka topic.
And within that Kafka topic, you take that, you model it as a ksqlDB table. So there's this thing called the stream table duality. We can get into it if you want to, but it's basically how you semantically deal with the data. Is it a stream of unbounded stream of events or is it a key value information? So, you can take that data from the database that you're integrating into Kafka as a continuous stream of changes at a snapshot, you can then join that to that event stream. So, you can join to data and external sources. We just make sure you pull that data from an external source into Kafka first.
DB: Right. You previously alluded to microservices with your pub-sub example, with your order example, posting new orders to a message queue and having services subscribing to that queue, and executing some logic based on that. So, can you explain a bit more in depth about how a Kafka streaming engine can facilitate a microservices architecture?
RM: Yeah. So, it's this idea of being able to exchange the information within the services. So you've got your model for each microservice and instead of building around this idea of request-response. So, you know, your order service sends a request over to the fraud checks service, and then does nothing until it gets a response from that. It can actually put a message onto a Kafka topic. So, you're asynchronously doing these relationships between your different microservices, your fraud check service.
In fact, maybe that should be actually a request-response because you maybe don't want the order to proceed until it's been fraud-checked. So it depends on the kind of the business process behind that. Something like an inventory update, perhaps you'd simply say the order service puts out a message saying, “We've just sold this thing and we've allocated this thing.”
So your inventory service will be subscribing to that topic. It would find out about that, it could update its own internal data store, but you start building out your services in that way, using Kafka as the broker between them and because the data is retained on Kafka. So, this is one of the key differences between Kafka and message queue solutions that people may be familiar with because Kafka’s got element, it behaves kind of like a message queue, but it's not like a drop-in replacement for other ones. It’s a broader technology than that because it retains the data, not only can systems that you've initially designed build around it. you have your order service, puts your message onto a topic and your fraud-check service and your inventory service reach from that, other services can also come along subsequently and also build on that data.
So in terms of microservices, one of the key things to consider is also around the schemas of your data and how you're going to manage that. And the compatibility between that. Because if you're doing request-response, you've got kind of like the message formats that you can exchange. And that's your API that you're gonna support. If you start working asynchronously with messages onto a topic, the schema of those messages becomes your API.
So, you need to make sure that when you place an order, when you say like this order was placed at this timestamp, and it goes to this address here, or something like that, those fields are understood by the other services to be in a particular format and where they exist, are they optional and things like that. So that's why something like a schema registry becomes really important because on day one, you say, “Well, I'm writing this proof of concept and I understand what the fall out of this data is.” So it's kind of obvious. I'm not bothered with any of that.
And then on day a hundred, you get some new developers in, or maybe you go on holiday and people start saying, “I guess this is just like orders, so we'll just put some data on here, which looks like that.” And then things start changing. Someone changed the timestamp. And instead of doing it as a bar chart, they've been producing Posix, Epoch 1970, and things start to break. So, you have to have that compatibility guarantees and how you're gonna work with things like schemas, which act as the API between the different microservices.
DB: It's funny because in a lot of this stuff, you sort of think about it as not being important because the system is so flexible. And we had the advocate from Mongo DB on the podcast some weeks ago. And he said exactly the same thing about “You guys, you really do need to put some thought into your schema development here.” And it's interesting you saying the same thing.
RM: I was just going to completely agree with that point, right? It's the flexibility and the freedom and the ease of it, which is so attractive to developers because it's like, “I can just pick this thing up and do things,” but it's kind of juggling with knives in a sense. It's like, “Go for it, be my guest.” But if you're going to do it, make sure you're aware of the pain that can come along down the line.
And sometimes people just have to learn by doing it the less than brilliant way and just being burnt by it. And then they're like, “Okay, next time around, we're going to do this.” And it's always that interesting trade off in how you build systems. Do you go with something that makes tons and tons of guardrails and it's like super restricted, but you can't make any mistakes, which is usually really tedious and boring? Or do you say, “Here's this completely greenfield. Knock yourselves out, but you're on your own.” And it's kind of getting the right balance between what's gonna be productive, but it kind of supports that in the long run versus what's just not going to be such a great idea.
DB: Look, to finish off, I'd like to get some thoughts if you have any on what the future holds for event streaming. We've had some guests in the program before, which should have had some big grand visions for event streaming, but I guess you're more on the ground with developers and what they're doing and what they're looking for. Do you have any thoughts as to where this industry is headed?
RM: So, I think it was quite easy to get caught up. And as things like on Twitter, you’ve got these echo chambers where, because you follow people who are interested in stuff that you are, it's quite easy to sometimes feel that everyone is at the same point. Of course, everyone's building streaming systems. Of course, everyone's doing live streaming ETL. When actually in practice, I think people are just starting to catch up. I guess cloud makes sense for running our workloads then because why would you run it yourself, if you can get someone else to do it for you.
And so I think what the near term future and the midterm future holds is people realizing that because events model the world around us, starting with an event streaming platform, like a Apache Kafka, like Confluent for working with your data and whether you're building integration pipelines, whether you're building applications, that's a really, really powerful thing to do because you're not losing any fidelity in your data.
You're actually working with events, which is how data originates a lot of the time. From there, you can go and build states and stick it in a relational database or a NoSQL store or do whatever you already do. But I think that shift in mindset from, “This is how we've always done things the last 50 years,” and then “Let's chuck away schemas and call it SQL or whatever.” It's all part of the same way of doing things. But working with events has that fundamental piece on which you then build, I think it's a mind shift which is starting to happen, but it's still got quite a long way to go. I truly believe that it's a very, very powerful foundation on which to actually build systems.
DB: Yeah. And we hear the same thing from our other guests. We're just at the starting point of this, whether it be event streaming or microservice or digital transformation or any buzzword that you decide on. A lot of these concepts and technologies and architectures are starting to mature now. And yes, we're starting to understand how they are used and how they're deployed. And for the next 10 years, most enterprises are just going to be busy doing it.
RM: Which is good, because that's what the whole point of all this technology is. It's fun, but hopefully it's also going to develop, deliver value. It's not just shiny boxes.
DB: Yeah. Robin, it's been a pleasure having you on the program. How can our audience follow you and keep in touch with what you're writing about?
RM: So I'm always on Twitter. I'm “@rmoff” on Twitter. I've got a blog, Rmoff.net. You can also check out Confluent Cloud, the Confluent Blog where I write lots of stuff also.
DB: Brilliant. Thanks for joining us on the program today.
RM: Thanks so much for having me.
Book a demo to see how our fully integrated platform could revolutionise your organisation and help you wrangle your data for good!Book demo