Increasing traffic volumes and deeper complexity – these are just some of the common challenges organizations face as their applications grow. In order to manage the growth of applications, they have to be designed to deal with traffic spikes and minimise downtime. Architects need to consider availability and managing risk as part of the application design.
Joining us today for a round of Cocktails is an experienced expert on doing just that. Listen to his sound advice on how organizations can design their applications to scale by improving availability, designing for failure, and managing risk.
Kevin Montalbo: All right. Joining us today for a round of Cocktails, of course, is Toro Cloud CEO and founder David Brown. Hi, David!
David Brown: Good day, Kevin!
KM: And our guest for today is an experienced and recognized industry thought leader in cloud computing where he uses his over 30 years of experience in architecting and building high-demand SaaS applications.
He helps companies modernize their applications via focusing on scaling and availability, facilitating cloud migration, DevOps transformations, and risk-based problem analysis.
He’s the author of the book "Architecting for Scale", published by O’Reilly Media. He’s also a renowned speaker and industry expert, widely quoted in several publications, and has been a featured speaker at events across the globe.
Today, he joins us for a round of Cocktails. Lee Atchison, we’re glad to have you on the podcast! Welcome.
Lee Atchison: Thank you. It's great to be here. And I even brought my cocktail with me, ready to go.
DB: It looks like a Martini with an olive. Is it?
LA: Yeah. That's exactly what it is. Yes
DB: Very appropriate.
KM: Yeah, that's perfect. So, while we were reading your bio, we noticed that it mentioned that during your seven years, as a senior manager at Amazon, you led the creation of the company's first software download store. You also created AWS Elastic Beanstalk and managed the migration of Amazon's retail platform to a new service-based architecture. So can you tell us more about your experience of migrating Amazon to the service-based architecture?
LA: Sure, sure. Yeah. So when I started at Amazon, it was in 2005 and they were, you know - by all modern definitions - a small company. You know, there were still hundreds and hundreds of thousands of probably overall employees, but it was about a hundred engineers in general, whoever working on the retail side of the company, which is a lot smaller of course, than it was today. But even at that level, they were running into problems. The biggest problem they were running into is that they had designed the system so that every transaction that went through the website had to go through this one application called Obidos. And the word Obidos comes from a spot - and I've never actually verified this - but the story that was told is it's a spot on the Amazon river where all the tributaries come in and form a funnel for the main part of the river.
And at the time they built it, they thought, "Hey, this is great. We'll funnel everything into one location and that'll make it easier to build this eventually," and all that great stuff. Well, of course we all know now that that's the exact opposite of the right way to build an application. And they ran into that problem. And by the time I came along with about a hundred engineers working on essentially some way shape or form this one piece of code - and there were other services - but this one piece of code was touched a lot by everybody. They tried deploying it twice a week, but there was always something: some team that didn't run a test suite or some team that made a change and broke something, and so it rarely actually got deployed. And it was a major problem with the application and it made making changes to the application very slow.
So, what they did was they built a new system called Gurupa, which again, I didn't check this, so the way I understand it is this Gurupa is the part of the river after the narrowing where it widens back out again. So the whole idea was, “Let's build this application and instead of having one funnel point, we instead have an infrastructure where we can plug in all these modules that are all independently developed, independently deployed, independently tested and you know, services, essentially, frontend services as well as backend services. And we'll change the entire website over to this new model and throw away the old model.” So over the course of about, I think it was probably about a two-year project, they moved the entire Amazon website from Obidos to Gurupa. And I came on board working for, at the time the person who started that whole project.
But by the time it was over, I was running the team that was doing the coordination to get all of that activity done and moving over to the new architecture. And at the end, we had something like a hundred-something engineers working over the course of months and months, just getting to the point where this one day, we did some migrations little by little, but we mostly did most of the changes in one 24-hour period of time, where we had this war room, we got together and you know, we had metrics on the wall and phones back to other teams, and yes, we had real phones and we call back to real people on other teams. And we were all in this room together and doing this migration, you know, country by country over the course of 24 hours all over one night, one whole day.
And it was great. I mean you know our CEO stopped in and then thanked us and said "hi," and all that sort of stuff. I got to shake his hand. And that was the first time I ever met him. But it was a great time to make all that happen. And at one point we estimated that 0.1% of all internet traffic changed that day because of this one project. And we had this term that we used that we called the New York Times event. And the whole idea is New York Times events were bad things. That’s when there was an article posted in the New York Times that said Amazon screwed up again. And we tried to avoid those. And our whole goal was to try and make it through the entire day without generating a New York Times event.
And we did, you know? Nobody noticed it and that was a perfect foreign standpoint. It was very smooth and it just happened. And I was like, “That was a great project first.” That was my first project at Amazon. That was a great project to be running actually by the end, I was running that. And it was just a fantastic experience. Also, it shows that some of the things we were learning about the internet back then. After it was done, over the course of about a month or so, traffic started trailing off and we started having problems with, you know, we got less search results as to what was happening. And this is getting towards November now with black Friday and, you know, big time for shopping and our search traffic was dropping off and started to drop off rather dramatically.
Well, that's when we learned what search engine optimization is all about. And literally this was, you know, we have a team that did this, but we really didn't understand it. It was a time when people were to figure out what Google was doing with search results, of what really drove search results and what didn't. And the thing that we didn't know at that time, obviously everybody knows this now, but at the time we didn't know is that the URL was an important part of search results. And we had failed to do the redirections correctly in a way that made Google happy.
And so Google stopped saying, "Well, they're not around anymore, so we're not going to send traffic to Amazon anymore." It ended up being a minor, well, not so minor, but a major emergency with a minor fix that we put in the night before Thanksgiving, just to try and get the traffic to come back up again. And it worked and everybody was fine, but there were some very stressful moments right before Thanksgiving with all of that happening. And it's amazing, some of the things you learn that we take for granted now, but back then, were new things that people were learning about how the internet worked.
DB: You learn by cutting your teeth. Is this where your passion for service-orientated architectures came from?
LA: Yes, very definitely. Yeah. I learned a lot. I mean, you hear all sorts of good things and bad things about Amazon, and yes, it's a pressure cooker environment and yes, it's hard to work for. It's the best job I ever had that I will never do again. But I learned so much about service-based architectures about scaling, what a high availability really is about and why it's important to maintain availability and what happens when you don't. And I can tell all sorts of other stories, but when you're sitting, you know, eating Thanksgiving dinner, looking at a monitor because there's a problem on the website and you have to have it fixed, you know it's affecting millions of dollars of orders. Right then, at that moment you're looking at a chart that shows the number of dollars in orders currently coming in, and you’ve seen that chart going down like this, and you get a perspective of what availability is about and how important it is. And so I learned a lot about the importance of those things and what to do about them too, from those years, as well as going into AWS and learning about it there as well.
DB: Well, let's jump into some of the concepts of high availability and architecting for scale. So, it can all be boiled down to a service-oriented architecture and now microservices in a modern architectural world?
LA: Sure. So, I think the simplest answer is probably no. I don't think you can build a modern application today that is highly scalable and highly available without using services and without using service-oriented architecture techniques, whether it's microservices or traditional services, whatever it is, but that's not sufficient. You have to use services, but there's a whole lot more to scaling than just services. And probably the biggest single thing that's involved in high availability and high scaling besides just architecting in a service oriented way is architecting your organization in the correct way.
And you know, organization matters. It really does matter. And how you structure your organization really does make a difference on both scaling and availability. And in the book, I talk about a concept called STOSA, which stands for a Single Team Oriented Service Architecture. It is a term I made up as part of the book, but it's kind of caught on a little bit and it's been very, very valuable. And it's a concept about how you organize your development team or essentially you organize your company in such a way that you can build a scalable and highly available application. So a lot of it overlaps a lot with DevOps topics, with SRE topics, with a lot of those sorts of things, but it basically talks about, you know, ownership of services. Building service-oriented architecture is one part of the problem, assigning ownership and defining what ownership means is another part of it. Service level agreements and inter-service service level agreements are incredibly important. And I know that Google likes to use the terms SLOs and SLIs. I use the term SLA for everything because the term SLO implies that an internal agreement and SLL is less important somehow than an external SLA. And they're not, they're just as important.
So I'd like to use the term SLA everywhere in an application, whether it's an agreement you're making with a customer or agreement that one service makes with another service on how to perform, it's the same term for both. And so defining SLAs between teams, you know, when a problem is occurring, where the problem is coming from, and who's responsible and knowing who's responsible, isn't a blame thing. It's about finding the problem and knowing what it takes to fix it and knowing who to work on it. And you need to do those sorts of things. So you can scale the organization as well as scale the application.
DB: The STOSA concept, is that an extension of Conway's law?
LA: So they are certainly related, but I don't think it's directly. But I do think there's some relationship there. However you build your organization, even using STOSA or not using STOSA, will dictate how your application ends up getting structured. That's definitely true. That's really Conway’s Law says, but I think, you know, what STOSA is really talking about is the best practices and the methods for how the different teams within the organization interact in order to make that happen. So, I guess that's kind of an extension of Conway's law. It's related, but somehow it seems a little bit different.
DB: It’s providing a blueprint and action plan for implementation of organizing teams.
DB: Let’s talk about some of the specifics of scalability, so you know, we have event-based architectures, microservices, there's different ways of architecting services. Within each of these, sometimes we have to manage state, maintain state. Something of, "This person did this," and remember that state in the next service. So how important is it the way applications are gonna manage state and how does that affect their ability to scale?
LA: Yeah, so state is kind of the enemy of scaling in many respects. You know, it's like, the more states you have, the harder it is to scale. And in fact, the easiest application that scales is a stateless or the easiest service to scale is a stateless service. And so state is very, very important. There are a lot of different models for how to handle state and how to handle data in general, within the application. I'm a firm believer that with some exceptions, that state needs to be part of the service that is the most responsible for that data. And it's part of the ownership model, and just throughout the application, you do not have a centralized state distributed correctly. So, that tends to indicate you have fewer stateless services and more stateful services. It seems counterintuitive to the idea that a stateless service scales better so why have more stateful servers? But by spreading the state around that way, each individual service has less state that it has to worry about and can deal with the parts of its state that it needs to worry about and scale it appropriately. And it fits very well into the scaling model, the organization’s scaling as well, too. The more data you have, the more complex the data interactions are, the more you split it up and create barriers between it, the more thoughtful you have to be in how the data interacts with each other. One of the biggest problems you run into with state, with data in general is when you lose track of the conductivity of the data, to the point where you don't know if there's any queries in your application, where this state is being queried against this data.
And you don't know if there's joins that are going on in weird ways and so that makes it very difficult as you're scaling, that database has split that data apart. You can't do it easily because you don't know who's using it, or how it's being used, but by pre-splitting the data, if you will, and putting the data next to the owner, next to the person that knows the data the most, and it's using the data the most, and then build an API between your services for accessing that data, you make very clear boundaries upfront, and that makes the scaling a lot better as time goes on.
DB: You’ve also mentioned SLAs between teams. And often when we're talking about scaling an application, which we’re trying to avoid failure but should architects actually be designing for failure when they're building an application for scaling?
LA: So, you know, absolutely. I'm a big fan of chaos testing. And the original model that Netflix put out was golden and something I very much promote anytime I'm working with an organization is, I talk about chaos. Testing is an integral part to the application. I often hear a lot of companies talk about chaos testing as something they do in dev environments or staging environments. And no, you want to do chaos testing in production, in live running production systems. You want to do what sounds crazy. You want to disable services in production. You want to turn them off. You want to break them. You want to do odd things to them, and you want to do that on a regular basis. You want to do it continuously because you want to see how your system responds.
You want to be able to do that at a time when people are paying attention to this system, people in the company are paying attention to the system, so they can understand how failure works in the system and can do the proper techniques to prevent that from happening. You know, and I think in the case of a programmer, who's putting in a change for something and you know, it's service A and one of the things service A does is it talks to service B and someone says, "Well, what happens to service B is down." And, you know, often the responses, I can't worry about that. Now I've got a deadline. Service speed never goes down. We'll worry about that. When that problem comes up and we'll just put this out and things will just assume it's going to work well, you've just introduced technical debt in the system because service B will fail at some point.
And now you have an unknown issue that's occurred. But if, you know, as an engineer, service B fails regularly and you have to deal with it. You're not going to make those decisions. You're going to say, "I need to deal with this now, because right after I deployed, at some point in time service, speed's going to fail and I'm going to have to deal with it." And, you know, it's going to happen. You're not going to defer those decisions as often, you're going to make the right choices upfront. And if you don't make the right choices, either intentionally or accidentally, you're going to notice, and you're going to find the problems earlier. And with more attention on them than at two o'clock in the morning, six months later, when some random thing happens and you don't know what's going on. So I'm a firm believer in chaos testing. I'm a firm believer on build for failure. And I'm also a firm believer in often fail fast and find the problems quickly.
DB: Take some confidence to do chaos testing and production environments there. Right?
LA: It does. It does. It's a very tough pill, almost an impossible pill for a company who's never done it to say, "Okay, now we're going to do it in production." So you usually have to build them up to that. You say, "Okay, let's, let's start with staging. Let me show you how it works. And let's do something simple. Let's bring down a server that's a spare. Let's see what happens. I'll show you how you respond to that." And you build it up little by little and it's a lot easier to build chaos testing at the beginning than it is later on. Not only from a technical standpoint and building it in, but from a cultural standpoint. The other thing is if you're developing a new application, build these sorts of scaling concepts in from the beginning, these availability concepts in from the beginning, because it's so much easier to think about chaos testing before you have a customer than it is after you have a million customers.
DB: Well, you mentioned that concept of introducing technical debt. And I guess, you know, we could probably have a whole podcast on just introducing technical debt, having to deal with these issues later. But that's the concept, right? It's by failing to recognize or address something early, you're just pushing it down the track that you're gonna have to eventually review it and it could have more implications than it had.
LA: Yep, absolutely. And, sometimes you do that intentionally and planned to insert technical debt for project reasons. And, you know, there's pros and cons to that. And there's times that you just have to do that. And as long as you do it, you know, technical debt by itself, isn't bad. Technical debt that you are unaware of, that's what's bad. So if you're putting technical debt into the system as a business decision, and you're knowingly doing it, or you have a plan to remove it later, you know, that's part of risk management and an important part of risk management, but when you don't know about it, and you're doing things that are causing technical debt to be inserted, and you have no idea that you're doing it and the effects are going to be unknown later, that's the sort of technical debt in particular, that's very, very bad.
DB: That's why you've mentioned risk management a couple of times. And I know you dedicate a section of your book to risk management. What are the sort of concepts? ‘Cause we keep talking about risk. So what is it, what are the things we're talking about with risk and we're talking about scalability?
LA: Yeah. So the main thing we're talking about with risk is to understand and plan for risk in advance. So, you know what's going on and you can make plans for it. So I talk about building a risk matrix and the teams I work with, I expect every team for every service to build a risk matrix that defines all of their known risks and their thoughts on unknown risks and where the unknown areas are that become part of the risk matrix. And assign both the severity and a priority to all of them. And then, you sort them in, you organize them, you categorize them and knowing the risk that plays into your planning process for how you're going to add new functionality later. And that's essentially your documentation, if you will, of your technical data, of your risk plan.
Now, any team that has a risk matrix that's empty that says, "Well, we were really, really good. We've got all the problems solved. There's no risk here at all." You know, first of all, they're lying, or at least, they don't understand what's going on, but that means they haven't thought hard enough and they need to spend more time to come up with it. And you know, you should not have a goal or an expectation that you have no risk in your application, but you want to come to the state where you don't have as little unknown risks as possible. Known risk is okay, as long as you know about it, have a mitigation plan associated with it. The other thing that I didn't mention is for every line item in the risk matrix, you have to have a mitigation plan. If this risk does fire, what do you do?
So if this problem does occur, we have a plan in advance. Maybe it's part of your runbooks or processes or certain circumstances, whatever it is, you know in advance what you're going to do if that risk actually occurs. So you have risks, you can plan for it. Risk is a natural part of the development process of an application, natural process of everything. It's just the unknown risks or risks that you don't have a plan for that you want to avoid. So by recognizing the risks, seeing it, organizing it, prioritizing it and then planning for it, you can be prepared when problems occur. That'll improve availability and that'll improve scalability because a lot of risks fire as you scale up. And that's a common point when risks actually occur.
DB: You're right. That's the relationship to scale.
LA: Right? Yes. Scaling and availability go very much hand in hand. It's more often than not availability issues often have, as a root cause, some form of scalability issue and scaling issues normally come back to availability. So, when I talk about architecting for scaling, I'm really talking about architecting for availability and because they're so intertwined that they really are the same thing.
DB: Yeah. Availability despite the throughput, the increasing throughput. So that's right. So you've talked about a labeling system for service tiers within a microservices architecture to avoid potential disasters. Can you run us through this concept?
LA: Sure, sure. So I have what I call service tiers and actually this isn’t something I made up. This is something we had at Amazon that worked very effectively and I've just carried that through in other places. And what we do is we have four tiers assigned: tier one through tier four. Tier one are your most business-critical services. And tier four are your least business-critical services. You assign each service to a tier, an example of a tier one business critical service is a service that is, you know, if this service fails, your application will fail. You don't have any choice. A classic example of that is a log-in surface. You can’t log into a website application if that application is not useful to you.
So that's a good example of a tier one service. There's obviously lots of examples of that. Example of a tier four service is a service that is not mission critical in any way, shape or form. If it goes down, you can run your application without anybody noticing or any customer noticing for some non insignificant period of time. And a good example of a tier four service is a backend reporting service, something that reports information about what's going on with the system. I'm not trying to say that reporting and measurement isn't important, but it doesn't have a direct effect on the performance of the application at that moment. So you associate a tier with every single one of your services, and you use those tiered numbers in your process and your system processes, and a couple of different ways. One way is in your problem severity assignment process.
So when you associate a ticket or a problem with the service, it's got a severity that affects you how critical this issue is, but it's also assigned to a service and a service has a tiered number. When you're trying to prioritize what things are important to the company or to a team, if a team owns more than one service, you use those two numbers together. Obviously a high severe and a high tier service is the most critical thing to fix and a low severity, low tier is the least important to fix. But what about a sub two problem, and a tier three service or a sub three problem and a tier two service? And by having both those dimensions, you can make plans for which problems you fix first and what order problems are fixed and how important different parts of the application are.
The other place to use service tiers is in the connection between services. So services talk to other services and they talk to other services at different tier levels. So if you have a mission critical service, a tier one service that ends up communicating with a tier four service, a non-mission critical service, you better have a storm plan in place for what happens when that tier four service is down because guaranteed it's going to be down more often. You will find, you know, a satisfactory metric for your service to be down. So you need to be able to figure out how you could operate the service in event that that service is down or unavailable or problematic or whatever. That service that you're calling has to be optional. The opposite, if you have a tier four service calling a tier one service, you can pretty much ignore problems because if that tier one service is down, there's a whole lot of other problems going on in the system.
The fact that the management reports aren't going out isn't that big of a deal. And now those are two extremes. The details are in the middle interactions. You’d be amazed at the number of times you find places where, let's say a tier two service has connections with an awful lot of tier three services that you didn't really realize were tier three services. And that combination in particular is pretty common. So, it gives visibility to these interactions that can affect availability and across teams. And it's kind of part of STOSA that way as well too as far as organizing, how things work. It’s part of the SLAs to go with things, but it's just a way of labeling and understanding the importance of a service so you can apply processes to make sure that you're dealing with the interactions correctly.
DB: The second edition of “Architecting for Scale” was published by O'Reilly last year, which I think is four years after the first edition was published. I'm guessing quite a lot has changed. In fact, I was reading some comments by yourself as to what you incorporate in the second edition, which was those advances in the industry, such as serverless computing and in your tours and speaking engagements, you spoke to a lot of experts and you incorporate some of their feedback into the second edition of the book. What changed in that four year period? What can people expect to see in the second edition?
LA: Sure, sure. So a lot of things changed in the book, and I'll definitely get to answer your question specifically, but one of the things that I've learned was the way the first edition wasn't as structured, wasn't as optimal. And so the book is dramatically reorganized into five different tenants that correspond better to topics that individual people will want to talk about. It's one of the major changes that occurred, but I added a lot of content and a couple of key areas that related to changes in the industry. There is a bunch of content related to serverless that applies. Serverless was something that was around four years ago, but it's higher on the curve now in acceptability and it’s used in a lot more cases and perhaps more cases than it should be, but it's used in many, many cases.
Edge computing is something that's grown a lot in importance and prevalence in the last four years. Artificial intelligence and machine learning algorithms and how central they are to key applications now. You no longer see them in fringe areas like voice recognition. You've seen them in mainstream data processing within the heart of applications now. You know, even just with cloud, it seems weird to say this, but cloud is mainstream now. And four years ago, it may have seemed that way, but it really wasn't. And it really wasn't in some industries and it really wasn't in some parts of the world. You know, for instance, four years ago, Europe was very, very hesitant to use the cloud. Asia accepted the cloud very, very prevalently. The US pretty much did too. And some industries and not in other industries.
Europe was very anti-cloud for a long time. They just didn't trust the cloud and they didn't trust the security aspects of the cloud. And most of that's shifted within the last few years and it's much more accepted, but there are still some holdouts, you know, and I in my very last business trip was to a company in this space and one of the big holdouts - and that was a year ago before the pandemic hit is where that was - but one of the big holdouts still is is private banking in Switzerland. That's a huge industry that flatly refuses to use the cloud. It's just not in their mindset. That's just not something that's critical to their use case. And as such, how many regions does AWS have in Switzerland? At the time they had zero because it just wasn't a market for them.
They just couldn't break into that very well. I'm not sure if it's still zero or if they've actually launched one now or not, but you get the idea. And so, the cloud is much more mainstream now than it was four years ago and it used to be the conversations back then were, "Okay, we’ll consider the cloud, but tell us why we need it and how it's going to work." And now it's a matter of, if you're not using the cloud, tell me why you're not, and what you're going to do instead. That's a drastic shift in the conversations that occurred in a lot of companies.
DB: Lee Atchison, just one of those books that every architect should have in their library. Thank you so much for joining the podcast today. And it would be fantastic to have the opportunity to talk to you again on some of the areas in more detail that we've sort of just briefly covered today.
LA: More than glad to. Obviously I love talking about these topics and enjoyed being here and thank you for inviting me.
DB: Great. Thanks Lee.
LA: Sure. So Leeatchison.com, L-E-E-A-T-C-H-I-S-O-N.com, is the best place. I have all my writings on there and links to other things and to what I'm doing. And you can buy the book at Amazon. You can also get it if you're an O’Reilly Safari member, you get the book for free in the O’Reilly Safari membership package. And it's available on other platforms as well, but Amazon is the main place.
KM: All right, cool. Thanks very much, Lee.
Book a demo to see how our fully integrated platform could revolutionise your organisation and help you wrangle your data for good!Book demo