Pattern Pattern
PODCAST

Logging in Action with Phil Wilkins

coc-ep-69-og

As systems continue to become more and more distributed, optimising and organising logs becomes much needed as well.

In this episode, Oracle Cloud Developer Evangelist and author of "Logging in Action" Phil Wilkins leads us in a discussion on unifying logs and proper log management through the use of Fluentd. He also talks about the importance of making log entries easier to understand and achieving "Clear Language."

Episode outline

  • What is unified logging and what has caused its emergence?
  • What is the difference between log analytics and unified logging?
  • How does Fluentd help with unified logging?
  • How do you prepare data to be ingested by Fluentd?
  • Is Fluentd the only game in town?
  • How can we generate better log data?
  • What’s the difference between audit events and log events?
  • What is the importance of achieving "human and machine-readable" language?

Transcript

Kevin Montalbo

Welcome to Episode 69 of the Coding Over Cocktails podcast. My name is Kevin Montalbo. Joining us is Toro Cloud CEO and Founder, David Brown. Good day, David!

David Brown

Hi, Kevin!

Kevin Montalbo

All right. Our guest for today has spent over 25 years in the software industry with a breadth of experience in different businesses and environments from multinationals to software startups and user businesses to consultancies working with UK and internationally known brands and organizations. He's the author of logging in action with fluent D Kubernetes and more, which teaches readers, how to record and analyze application and infrastructure data using fluent D. We'll talk about that book today, and if you stick around until the end of the show, you'll learn how you can get a copy for yourself. Ladies and gen Elman. Joining us for a round of cocktails is Phil Wilkins. Hey Phil. Welcome to the show.

Phil Wilkins

Thank you! Good day, everybody.

Kevin Montalbo

All right, good day! So, let's begin. Systems have been producing logs for decades. What has caused the emergence of unified logging?

Phil Wilkins

 That's the key to log unification; it’s to gather that information together to a position where you can see what is going on end to end in your solution.

So, logging has always been around. But what we've seen over probably the last 20 years is systems becoming more and more distributed in their construction. So, we'd need to bring the logs together. But as virtualisation was introduced, we saw that extend even further. And then in the last five to 10 years, containerisation has driven that even further, particularly microservices.

So, now you've got one application that might be spread across multiple environments and to understand what's going on, you need to bring the logs from each of those environments together to be able to get a holistic picture. And that's the key to log unification, it’s to gather that information together to a position where you can see what is going on end to end in your solution.

David Brown 

And when you talk about servers running in parallel or a distributed system or cluster of servers and unifying logs across those, is it unifying the logs across the same services are also, across tiers of the application, unifying logs across, say, the network tier, the database tier, the application tier, is that what unified logging is about as well?

Phil Wilkins

Yes, so, many people are on a journey of varying levels of maturity. And what you've described is almost like the path of maturity for log unification. To start with, just get your app servers talking, that's probably the easiest. But your info guys are going to be wanting to also look at what you are doing and what's happening on the server infrastructure. Your DBAs are going to want to know how hard you are pushing their database. So, if they see database performance issues, it's like, “is there something that's happening in my database?” or ”your consumer saturated me with requests,” and the more you can bring that information together, the better your understanding is of your landscape, as an application or server does not operate in isolation.

David Brown

Yes, of course. So what is the difference between log analytics and unified logging?

Phil Wilkins

So log analytics, like most analytical processes, is about processing a volume of data, analysing, looking for trends quite often and patterns in the data, whether that's a log entry or transactional records. The log unification is about getting the logs together. So, all these independent components that are contributing to the sum total of your environment and solution you need to get those together and bring that data in, into a single place to be analysed. 

Now, one of the tricks with more contemporary unification tools is that rather than just grabbing the data and putting it into a big pot to be later analysed, you can start to do some event-based processing and that enables you to be a lot more reactive now. So rather than waiting an hour for the next analytical run to happen, perhaps in your environment the unification tool can go, “Oh, that's an exception and I've been given some rooting rules to say ‘This exception is particularly important. I'm going to go and ping Joe logs and tell him that that exception has happened now.’ ” 

David Brown

I'm glad you brought up the tooling because your book talks extensively about Fluentd, which is an extensible open source framework for data collection that filters and routes logs for their consumption. When you look at a diagram of Fluentd, how it routes logs from the source to their destinations, it looks kind of like a middleware ESB-type tool. But it's specifically designed for log analytics. And you mentioned it can do more than just routing of logs. It can actually do some event-based processing as well. So can you just run us through how Fluentd can help with this unified logging?

Phil Wilkins

Sure. So, the first thing at unifying your logs is to be able to gather the log content up from a vast pool of resources. That could be your system logs, SNMP traps in your infrastructure through to many, many different types of application logging formats. We're in the world of polyglot now, so your entering enterprise might be running the combination of .Net solutions and Java and Node, and the list goes on.

And they don't all work in the same log formats. Being able to cope with that's important. And then once you've started to ingest those, you need to do a number of things. One, you've gotta decide whether the log event is of help or use. Sometimes, particularly with more brittle solutions, you may be deployed and everything's running smoothly, but there are debug logs being put into your logging.

And people become nervous of any change. So, rather than changing that system, if it's running and its log configurations, or even in the code, to change the log thresholds, it's easier to say, “Okay, we will put rules to filter that out into the unification process.” So we don't take it any further than where we've grabbed it from. And you're not polluting your aggregated views of all the logs with any undue noise. So, you can do that. You can route it to different systems as well. In large organisations you'll get specialist teams that are dealing with monitoring of your solution. Traditionally your SysAdmins will work with tools like Nagios that are focused more on the infrastructure. Other tools will be more focused on application logs. If you could feed into Logz.io and things like that, you are more oriented towards supporting the AppDev and the AppOps teams. 

And they want a different set. They're less interested in the minutiae of what's happening on the server, unless it's significant, and want to know more about what's happening at the application and the database layer in terms of how their SQL is performing and what SQLs are being executed. So you can start to root out to the different tools, the right events, rather than saying, “Okay, everybody's gotta use this one tool. There's an enterprise-wide edict [that]says that we shall only use this.” So, you can start to think about your Best of Breed if you want. So, that's one of the key use cases.

The one that I like to show, to get people thinking about it, is more of the social alerting, if you like, or the collaborative mechanisms where you can tease out specific events, which are forewarners of something significant. And you can then filter that out when they occur and send a signal to someone, say, “Look, I've had an event I've recognised as being a warning to a bigger problem.” But if you're getting there quick because you’ve got, say, five minutes before things go belly up, you can get in there and prevent it rather than cure it, which is a lot more useful.

David Brown

Interesting. Yeah. And I'm just thinking about how we get the data into Fluentd in the first place. So, I imagine with popular frameworks, there are connectors that’re pre-built for Fluentd to ingest the logs from popular systems. What if my system is not available as an out of the box connector, what do I need to do to prepare my data so they can be consumed by Fluentd?

Phil Wilkins

So, the simplest and the most common approach to that sort of thing is to just let your application run, because typically that's a file or rotating file. Sometimes, people will write to a database. And what you can do is, rather than point and connect your application directly to Fluentd through an appender and your logging framework, which is the more optimal route, you can set Fluentd up to say, “Alright, I'm going to trail that file,” or “[I’m] going to go every minute and grab the latest entries in that database.”

You know, when you are developing, if you're working with a Linux environment, you'll be familiar with the idea of “tail-f”, where you are just literally watching the end of a log file as events go through. Well, Fluentd’s got some fairly sophisticated connectors that are able to do that for you, and therefore, it's hoovering up the events as they go. And that way you make no invasive change on the application which is ideal for those sensitive, brittle use cases; those legacy systems that everybody's terrified of touching, but are so critical to your business, that the sooner things are going to ride, the better.

David Brown

And, is it a client survey, or is there an agent running on my server? A Fluentd agent which is collecting and sending those logs to a Fluentd server, or am I streaming the logs to a server?

Phil Wilkins

So, you can set it up. And this is the beauty of Fluentd because it's hard to deal with, particularly, a lot of microservices and IoT. And you can deploy it as a central solution and you can either stream to it, which you can receive the streams or if your network will allow it, it can reach out and connect. But the more common model is to put agents, use Fluentd in its agent model and deploy it closely to the application. 

And in the world of microservices, you can see this happen in a number of ways. You can deploy Fluentd as a side car, using that kind of deployment pattern. If you're using a service mesh, then there's an element of Fluentd engaged with ISTIO for example. But even on your legacy environments, you could put a small footprint agent Fluentd node in your server right next to it and run it as a parallel process because it's such a small footprint. 

And this is a version of Fluentd, I call it like the “little brother,” because it uses the exact same principles, but it's stripped back. And it's the kernels written in C called FluentBit. FluentBit has got such a small footprint, it is very easy to actually deploy into Internet of Things devices. And rather than do any of that processing, filtering, it's designed to just grab and forward. So it is a true agent in that sense, but Fluentd can act as the server as well as an agent.

David Brown

You also wrote that, “log processing is only as good as the logs that are generated,” that's a quote from your book. [It] sounds like the old adage of “garbage in, garbage out.” So, how do we generate better log data?

Phil Wilkins

Yes, you couldn't have summarised it better, to call it “garbage in, garbage out.” If you are writing logs and just treating them as quick hacks to help you debug it and do local testing, then your logs are going to be difficult to understand, if not meaningless a year, two years, five years down the line when you are no longer involved with it. And you know, the messages you've put in there are a bit unique to your understanding. So, the best thing you can do, whether you write it to a file or using Fluentd or something else even, is to think about the content that you're putting into the message and make sure that it's meaningful, but data aware. If you are dealing with a financial system, just dumping the entire transaction could create some real headaches because you could be writing sensitive data into the log.

So, you know, you have to start thinking that logging is almost as important as your transaction itself, that you are processing. And the more semantically-meaningful and the more insight that you offer into that log, the better. So, pumping out the key variables into a log entry that affect how your application is behaving is always going to make your logs more useful and trying to show or provide it in a structured manner will mean that it's going to be an awful lot easier to start expressing these rules, whether that's in the unification layer of Fluentd or even downstream, when you start to do log analytics. If you understand the structure of the data being logged, it's an awful lot easier to tease out meaningful assets and tease out activity.

David Brown

[That] makes sense. What's the difference between an audit event and a log event?

Phil Wilkins

So, very much this same thing, an audit event differentiates itself normally by the fact that it's going to be used beyond just understanding operational state and application behaviour so far as you can record audit events through your logging framework as well, that they're there to not only perhaps help you understand your application and what it's up to, but you will use it to support evidence of compliance and things like that, or dealing with security things. So, who signed  in and out when, for example, you can characterise as an audit event just that someone's logged in or the login logic is running. That's more of just a log traditional event log because you can't tease out the meaning from it. 

So, the audit is very much to be at a trail and show what people are doing and what's happening in your system. So, if someone comes and says, they think there's a data leak, you can go to your audit trails and go, “Okay, this is what actually went on, and I can account to governance bodies that my system is running true and correct.”

David Brown

Why is it important to distinguish between audited events and log events? Are we treating them so differently in terms of a unified lobby?

Phil Wilkins

In the short term, you're probably not going to treat them that differently, but the key difference is because all it’s supporting compliance, you're likely to have rules about how long you record that information for and you might need to store it slightly separately. So, it's easier to pull it out to present it. If you have to show evidence of compliance activities.

David Brown
You've devoted a section to your book about “Achieving Clear Language,” which was succeeded by  “human and machine-readable.” Can you expound on the importance of these factors, clear language and human and machine readability?

Phil Wilkins

The more meaningful, the easier it is to understand in the eyes of another person, the better.

Yeah, so all developers have probably done it some time or another, got really bored writing log entries and put something funny in there. You know, it's going to throw an exception. So, I put “Geronimo!” in there or something like that, just to lighten the process, because it can get tedious if you were to write lots of very dry log entries. But you know, if you do that and leave that there, that really doesn't tell anybody anything meaningful when it’s not you. I will know what it means because I wrote it and know where to go looking for it and perhaps it might reflect a particular exception that was annoying me during testing. But you know, for an Ops team, even in the DevOps environment where the developer is involved in the operations sooner, you're going to move on to a new project or a new product and someone's gotta keep your solution alive.

So yeah, the more meaningful, the easier it is to understand that statement in the eyes of another person, the better it's going to be. So, we need to think about that. And therefore, we do need to be aware of our semantics and our technical language. If you're going to use specific terms, then that's great because it helps in understanding the meaning, because if in the language of accounting, you know a particular type of transaction, you call it that transaction. But make sure that there is  a dictionary of your terms, if you like. And you can add more meaning when you're dealing, particularly with error scenarios, by using things like unique error codes. And that allows you to attach far more comprehensive information, explaining what the causes of this error can be, what is the remediation.

So, you're giving more meaning again, 100 percent. “Oh, I've caught an exception, here's the stack trace, move along.” So, adding that detail is really helpful. And then in terms of making it machine readable as well, this comes back to the workload involved in processing it, if you make it easier to process the log events. So give it structure, then the easier it is in the event stream make it actionable. So, if I get an event with this attribute, which has a particular value, that's a lot easier than trying to run our red rejects across a stream of consciousness text, to tease out and say, “Actually, I need to tell someone now about this rather just send it to the log analytics platform,” or “route the account of this type of event in the last 30 seconds into Prometheus.”

David Brown 

In that regard, what are you suggesting? Like a JSON format or an XML or a simple CSV, so long as it's machine readable?

Phil Wilkins 

Yeah, it's all down to the culture of the organisation. Some are better than others. JSON's better than XML because it's less verbose but still carries the readable meaning. CSVs, if you're doing that, at least you can see each of the values, but you put more cognitive workload on the consumer that's looking at it because you've got to know what each column is in a CSV. You really don't want to repeat the header every time you write a CSV. You can do it, but it just is harder to read.

David Brown

And Fluentd is dealing natively in JSON format, right?

Phil Wilkins

Internally, it processes everything as JSON. So, every log event will get a very basic JSON event structure applied to it, even if you're only sending it text. Because what it does is it takes the log event and treats that as an element called a message and it'll attach a timestamp to it. And you can link the metadata to that as well. Then you can start doing things like examining the payload in your configuration, because it's able to interpret JSON very easily.

David Brown

Right. We've talked about clear language, achieving clear language. What about context? So, can you explain what is the context of logging?

Phil Wilkins

So, the context is it’s all about really trying to take the person looking at the log and understanding in what conditions did it occur when we have errors. To help you diagnose things, you need to know what's going on around it. Imagine yourself in a, but you can't see, you hear a lead thump now, is that a tree Foer? Who's that, a wild animal running around in the forest that's about to steam roll you? If you give more context the better you are able to understand that problem or the situation you're in and therefore what to do. 

You hear that, that tree go down well, if you can also feel that the wind blowing against your face really strongly, you know, you, you've probably got a storm and that might be just a tree falling over because it's been blown down whilst it's not great unless it happens right next to you, not a problem, but if that thumb is sounding more like flash against something hard you're probably gonna want to go run somewhere or try because that could be a bear coming for you.

Yeah. In that, in that, in that forest. And what I'm trying to say is the more information you can associate with the event when you record it, the easier it is to diagnose and determine that part of action. So if you’re throwing a you know, if we take that to a, a database connection issue being thrown you know, what's the URL URL, sorry of that database. So is it a particular database that's causing the problems? And that helps not only in the short term, but also in the log analytics phase of, okay, is it the same database that to throw a wobbly and cause me a connection issue? Is it the same that's causing me problem because actually's developing a, but it's intermittent.

David Brown 

It's interesting. Like there's so much value in what you're saying. And do you find that logging in this attention to logging the language, the machine readable formats, the context is logging getting as much love as it should in the developed community?

Phil Wilkins

To be honest, I think we can always do better. We're always under pressure to deliver and get things out the door.

To be honest, I think we can always do better. We're always under pressure to deliver and get things out the door. So we tend to write logging when we're thinking about and trying to working on testing own application. So we tend to think about it from the viewpoint of what do I need now, rather than looking at the application again. Well, is this code could be alive in 10 years? What, you know I don't want to have someone asking me about code. I wrote years and years ago, if I'm still in the same organisation or worse some things gone wrong in the middle of the night. And the first tier of support have just got me outta bed at two in the morning after I've been out and had my cocktails.

And they're asking me, what does this log message mean? Because they're trying to figure out how to get an application back on its feet. And they got you know, management, screaming at them that a major system is down and they lose business is losing revenue. So yeah, the, the more you do to, to your logs to make it easier for, for people to deal with those situations. And, and, and most importantly, the unexpected because that's, you know, we write code to deal and deal with the expected conditions. It's the, the unexpected ones are always the, the issue, the more we do to help ourselves and think about those the better our lives are gonna be.

David Brown 

And your book talks extensively about Fluentd references. Fluentd is a tool set for unified logging. But the book itself goes into the principles of logging, obviously in great detail as well. Is Fluentd the only game in town, or are there other solutions for unified logging?

Phil Wilkins 

Unify? No, there have been there are quite a lot of solutions out there. Log unification is a newer ideas. So there are, are a smaller set of products out there for it, but probably the biggest one that lines up with front D that people have heard of is Logstash through this, this spring sorry, the ISIC organization and, and is part of the, the on elk stack. And you can swap the log for, for fluent and it becomes the, or so there are options out there in that direction. You can go to the more classic aggregation model and log analytics. There are plenty of well known products that do that. Splunk is probably one of the best known commercial ones which again, has this agent model with the ability to interpret and grab a lot of different data sources. But the key differences is it tends to work on the basis of pump it back to the Splunk storage and process things there.

David Brown

What about the public cloud providers? What are they doing in this space?

Phil Wilkins 

So they're interesting for the hyperscalers Google AWS. I'm not so familiar with Azure, Oracle. They all have actually built support or actually leverage big chunks of the Fluentd tool set under the herd Oracle, for example actually a lot of it and can ingest fluent D events. So they give you endpoint in your account and you can file your event straight at, as if you're talking to a affluent D node mm-hmm <affirmative>, which makes life really easy. GCP actually was one of the first to start adopting the front D framework and, and building it into part of their log F logging mechanisms on their cloud native platform. So, so it's quite with, with fluent D as well. And a lot of these providers are offering, you know, means to consume fluent D events.

Phil Wilkins

Some of them also allow you to actually pull out of their environment using Fluentd connectors log events as well which makes life really easy. So if you're using a PAs or a SA service where you can't get in at the monitoring that's going on, you might be able to pick up some of the behavioural information through achieve going and examining its log collection using for, and pump it into whatever system that you want to use. And that's really useful when you getting into a multicloud or a high use case,

David Brown

Really interesting stuff. The book is called “Logging in Action” published by Manning. Phil, how can our listeners keep  in touch with what you're writing about? Do you use particular social media channels or a blogging platform?

Phil Wilkins

So I'm on WordPress and I blog across a number of subjects, including adding extra bits of tidbits and information that support that work with the book. And I can be found on two addresses either Philp monster.org. So that's my, so it's MP3 org or blog MP3 org. That's pretty easy to remember. The other one is cloud hyphen native info which is perhaps a more meaningful one for, for most a lot less it's a given the, the content I write about a lot of the time.

David Brown

Fantastic. Phil Wilkins. Thank you very much for your time today.

Phil Wilkins

Thank you.


Listen on your favourite platform


Other podcasts you might like

cta-left cta-right
Demo

Want a ringside seat to the action?

Book a demo to see how our fully integrated platform could revolutionise your organisation and help you wrangle your data for good!

Book demo