Welcome to the chapter on Telemetry.
Before providing any formal definition for observability, we can immediately see that a core need is the ability to observe.
Putting this into the context of software, we need the ability to see what is happening in our code and to our running systems.
So, before we discuss the nuances of observability, we should all have some familiarity with the term "telemetry" and how that provides us with the visibility required.
Telemetry is data collected about anything in our systems.
This includes things like infrastructure - which is servers and databases, networking - like HTTP requests, and of course, our own software.
The system produces data as it is running.
Some of this data will appear to be auto-generated because it's built into the software that we depend on.
For example, when you run a test suite, there are a lot of log messages written to the screen.
This is a type of telemetry which is introduced by the creators of our testing tools.
This can help us investigate our own suites, and also helps them during their own development.
Other data is very unique to our own systems.
This data is the real gem of telemetry because it is what we specifically identified as useful in our own context.
A great example of this is a custom error message for when a test fails.
Once this data is generated, we need to make it available for querying, so we need to make sure we store it somewhere.
The most common way to do this is to write the data into a specific file on the server.
Another way is to write the information as a web page that engineers can view.
This is really only useful if the system already is serving web pages.
But at this stage, we will have data that is being collected about our system and our dependencies, and that data will be getting stored so that we can view it.
But if these systems are at all scaled or remote, this is not a viable solution for long.
Each engineer cannot go to one of ten, or a hundred, or a thousand servers to look at each one individually to see its telemetry.
This would be even more impossible if you are working in a web scale or serverless environment.
Because of this, most people today consider telemetry complete only when it is made available in a central location.
There are a few ways we can collect this data into central locations, and we will go through one example later in this course.
There's no finite rules for what telemetry data is.
It's really anything that can provide you insights into your system.
Yet over time, there have been a few common formats people have relied on.
These formats can be described by focusing on their business and technical use cases.
The first, most common use case, is needing to understand the state of the system in a specific moment.
This is usually not only a specific moment, but even more so, it is understanding how a specific piece of logic or line of code behaved.
An example is when an if
statement doesn't behave the way you expect, or maybe you have one of those famous off-by-one errors in a loop.
In these cases, it's helpful to know the value of certain variables, what lines of code ran and in what order, and what the outcome of any given function was.
This problem space is commonly supported by logs.
Logs come in all different shapes and sizes, but there are a few characteristics that are worth calling out.
First of all, they always have a timestamp.
This is how we know the order that things happen in.
It also clearly showcases that logs are a point-in-time action.
In addition, they can support really rich data points.
Logs can have things like what version of the software is running, what user ID is making the request - things like that are really juicy when you're trying to actually debug something.
One optional characteristic to look at is their shape.
Traditionally, logs would be a line of text with fields delimited by something like spaces.
More recently, people have focused on moving towards a structured approach.
With structured logs, we do not need to parse a line of text, which can be difficult.
Instead, we have a key-value pair which is easy for us to use for filtering.
But not all problems need the nitty gritty detail of logs.
Another really common use case is wanting to understand how something has behaved over time.
This can help us identify regressions or prove how improvements have actually helped.
An example of questions we may ask here is, how has the duration of our test suite changed over time?
Or, what level of consistency do we have in our test results?
These types of questions often get answered by metrics.
Some common characteristics of metrics is that they have a name that stays consistent over time, and that the value is always numeric.
While always numeric, some types of values are like thermometers - they can go up and down and track differently over time.
Others are a bit more like stopwatches, where they can only go up.
Another final characteristic of metrics is that they have a few key tags or fields that can store key identifiers.
They can't have as many specifics as logs, but storing things like the test suite name or the developer who made the commit can help us track down some understanding about changes in these trends.
These tags let us break down the higher level numbers into categories and can be useful when trying to run calculations like "percentages passing".
So, if metrics help us identify when there's a problem and logs help us solve that problem in specific, what's left?
Well, sometimes we know that there is a problem, but we don't actually know where it is or what's causing it.
In these situations we may be asking questions like, which service is causing all these tests to fail?
Or which bit of our continuous integration pipeline should we prioritize making faster?
In these cases, we can use a trace visualization.
Traces show us all the working parts behind a high level request.
For example, it would show the stages of a continuous integration pipeline - for example, build, automated unit tests, linting, deploying to test environments, etc.
At minimum, a trace will have a single item which indicates how long something took to complete and can include the same rich data that logs can.
This would be described as either the parent or the root of the trace.
But most of the time, what you'll see is also children or sub-spans underneath this, which detail what bits of the system did work for how long.
The idea is that a trace can help us narrow into what we need to start debugging and where to look for the logs.
These different types of visualizations - the deep detailed logs, the long-term graphing of numbers and the waterfall display of system relationships - will appear in most observability tools you use, though the data behind them can be very different.
Getting to know how to collect data in a shape that lets us use these visualizations in an effective way is the most important part of the journey towards understanding observability.
Today, it is most common to store this data independently - to power these three different visualizations by three different databases.
This is because the speed to generate the display is so important that the focus has been on a data structure that allows these to be very quick.
However, having three different datasets powering these three different visualizations has had its drawbacks.
In particular, it is time consuming and noisy in the codebase to generate all these different pieces of telemetry.
It is also challenging to learn the different query languages and to line up from one visualization tool to another when you want to move between things like trending graphs into the logs.
By the end of this course, you will have hands-on experience with at least two different types of data collection so you can compare and contrast the different strategies yourself.