Logging - information vs noise

11 June 2021 — approx 6 min read.

In this post, we will try to curate best practices that we should during application development for the logging track

A log file is logically equivalent to a flight recorder in an aircraft. It keeps track of everything happening inside the application context (and sometimes system or external environment-related information). Thus, whenever we need any insight into the application (be it any functional scenario investigation or troubleshooting any technical issue), the log file is our door to the mystery land.

But for a moment, let’s try to visualize the situation that we might end up in if we do not pay attention to any logging best practices or follow any templates.

We start to log everything without considering the appropriate logging levels, with no indicators to distinguish technical log messages from functional messages depicting the application workflow. It will quickly lead our log file to a collection of jigsaw puzzle pieces that we first need to arrange to drive any meaningful information. On top of that, in case of any production issues( when we are already running short of time - almost always), this is an additional overhead.

In the absence of any logging guidelines, every developer writes log statements that align with his/her coding style. This makes it quite challenging to use log aggregators like ELK stack or Splunk that expect a set pattern while parsing log statements.

So it is of paramount importance that we log correct, relevant, and easily interpretable information using appropriate log levels in the log file. It should be like going through the lifecycle story of any event or business workflow with additional relevant information.

In this post, we will try to curate best practices that we should during application development for the logging track:

Logging libraries

As there are many logging libraries available these days, we should choose the one that allows loose coupling between our application code and the logging library implementation. Furthermore, it enables us to switch between different implementations when and if required.

Relevant information

We should always log only the relevant information according to the context. Of course, this takes a bit of practice and learnings from past experiences and varies from application to application. But there are a few standard guidelines that are almost always applicable like:

Avoid logging sensitive information like credentials or other domain-specific data.
If log files need to be shared externally, sanitize those to ensure only allowed information is published.
Always assign a unique request identifier to all client requests. It will allow both the service provider and client to keep their investigations in sync while troubleshooting any issues. The request identifier will also enable you to trace the request lifecycle across the services when working with distributed components like micro-services.
Log the request before it is pushed to any validation or execution flow to ensure that the request details can always be traced, even if it was rejected.
Always add contextual information like timestamp, request identifiers, workflow identifiers, thread id, etc., in the log statements.
If you are maintaining some cache or local buffers in your service, make sure to log its size at regular intervals to detect any bottlenecks. We should also log the appropriate cache hit or miss scenarios that will help us to tune the cache accordingly.
When relying on log files to estimate service throughput, we should log relevant- request, response details to ensure we are targeting the correct scenario.
When logging exceptions, we should include relevant details like request params, scenario, impacted records, and most importantly, stack trace to indicate the exception’s origin.
Prefer separate log files for KPIs and instrumentation metrics (availability, request counts, latency, etc.) than any other application logs.

Logging levels

Info - Log statements with info level are primarily used for logging functional or business workflows. It should be the choice of logging level when we need to trace the lifecycle of any request.
Debug - Developer-centric information that can be switched on when required. This optional level (turned off by default in most cases) allows us to write detailed information about any scenario we might not usually need. It is a common practice to enrich, debug log statements with additional information to allow easy troubleshooting.
Error - Anything which is not in line with the expected behavior caused by either a technical issue or any business exception should be logged with an error logging level. It enables the aggregators or monitoring tools to highlight the same and raise alerts when required.
Warning - Used to log technical or workflow scenarios that did not go as expected and might lead to a problem if ignored.

Structure

We should follow a consistent log pattern throughout the application. It allows the users predict what to expect in the log file and where to look for the details.
We should always separate log messages from parameters to enable log aggregators to extract meaningful information. For example, consider the following statements (statement ii is preferred as it is more intuitive and is easily parsable by log aggregators):

processing message id: 101 from user: Newton for the scenario: a and region: x
processing messgae: [id: 101, user: Newton, scenario: a, region: x]

Availability

Rollover strategies: We should pay attention to our log rollover strategies (based on time or number of files generated). If the application creates a lot of logs, those might not be available when required if their rollover or purging strategy is not well defined.
Logs inside containers: Container being ephemeral does not maintain a state, and thus, any log files generated inside the containers will not survive restarts or container crashes. We should always use volumes or external storage like bind mounts for the generated logs when running applications inside containers.

Overhead

If we have chosen to write logs on external storage like a filer or network drive, this will add to the overall latency, which we should consider. Similarly, when streaming log events over a network, we must be prepared for external factors like network failures.
When choosing a logging framework, synchronous vs. async logging stack plays a vital role in the overall application performance.
Logfile size is another overhead that we need to consider as this will contribute to the storage or disk requirements for an application.

Log Aggregation

We should use log aggregation tools, especially when working with distributed components like micro-services, to give us a single view of the application state. With unique request IDs, it becomes easy to trace the state of any event or message across the services and reduce valuable dev time spent in the analysis process.

@jvmaware