Tuesday, July 2, 2019

Visualizing Status Codes

If you follow me on LinkedIn, you will have noticed that I changed jobs and moved from Houston to Austin. I'm now working for LogicMonitor as a Sales/Monitoring Engineer. It's a great job and I'm enjoying it a lot more than previous jobs. One of the advantages of this job is that I should have much more opportunity, desire, and content for blog posts.

If this is your first time here, know that this blog is not written for you. It's written for me. I increasingly need more and more reminders of how to do things. That goes especially for things that I devise since no one else knows it unless I tell them. This blog is primarily a place for me to keep those things written down.

Anyway, on to this blog post. LogicMonitor monitors IT infrastructure. After collecting data through various mechanisms, it stores the data in a big database in the cloud and then provides a cloud hosted front end website to display the data. Part of the display is graphs. Many times, the metrics being graphed lend themselves to being plotted on a Cartesian coordinated graph. However, sometimes the metric being polled is a status code. A good example of this is license status on Sophos' XG Firewall. This metric is found at .1.3.6.1.4.1.21067.2.1.3.4.1.
asSubStatus OBJECT-TYPE
    SYNTAX          SubscriptionStatusType
    MAX-ACCESS      read-only
    STATUS          current
    DESCRIPTION     " "
    ::= { liAntispam 1 }
The syntax is "SubscriptionStatusType", which is an enumerated type meaning that only a number is returned, but that number has a meaning depending on the different values returned. Looking at the syntax definition in the MIB will help illustrate:
SubscriptionStatusType ::= TEXTUAL-CONVENTION
        STATUS            current
        DESCRIPTION       "enumerated type for subscription status"
        SYNTAX INTEGER {
                 trial          ( 1 ),
                 unsubscribed   ( 2 ),
                 subscribed     ( 3 ),
                 expired        ( 4 )
        }
So each different value returned indicates a particular state of the license subscription. It's not like a percentage where 100% is good and 0% is bad and there might be values in between. It's not like a rate, where a high number is fast and a low number is slow. It only has discreet values and values in between don't actually have any meaning.
Normally, without putting in much effort, someone might easily create a graph that just plots this number, putting time on the x-axis and the value retuned on the y-axis. This results in what you see here:
As you can see, it's not very helpful. It's a flat line because the status has been the same for the entire time range. That's ok. However, there's no real indicator of meaning. Some effort was made to add the meaning to the data point description (which appears in the tooltip). However, the enumeration is so long that it doesn't really fit in the tooltip. What does a 3 mean? Also, it's not illustrated here, but what happens if the value changes from 3 to 4? There would be two flat lines, one at 3 before the change and one at 4 after the change. But what would be shown at the change? Would it be a vertical line from 3 to 4? Would it be slightly slanted? Also not illustrated here, but what happens when larger timeframes are chosen and values are aggregated together (most often using an average)? Imagine that line at 3 that transitions to 4. What if that happened in middle of the quarter and you viewed it at the end of the quarter? If this status was polled every hour, that would mean 2190 data points to display! That's too many. Almost every graphing solution would attempt to decrease the data points by grouping points and averaging every group. In the case of a quarterly timeframe, it might simplify by averaging all data points for a single day together. This could be fine for most days, except for the one where there was a change. That would show an average of 3's and 4's, yielding a value of 3.5. WTH does 3.5 mean? It gets worse if you have a 2 that transitions to a 3 which then later transitions to a 4. You could end up with an average of 3, indicating no problem at all!?!?  It's not intuitive; and graphs need to be intuitive.

So, what do we do?

Well, we might be tempted to normalize the data. This is actually a very good idea. Let me explain: normalizing the data transforms it into a scale that is more intuitive. For example, we might say that we will normalize the data using the following rules:
  1. A value of 3 is good, so we'll call that 1
  2. Any other value is bad, so we'll call that 0
Pretty cool. That's a pretty good one. Any time everything is ok, we would plot a 1. Any other status is undesirable and we would plot a 0. Transitions are still ugly and potentially troublesome. If we make one tweak, it could allow us to put some context around the resulting values. What if we changed it to 100% instead of 1 and 0% instead of 0? If we did that, we could actually put some additional meaning behind the resulting values. Thing about it, if everything is good, you're plotting 100%. What does that mean? It means that for 100% of the timeframe displayed, the status was good. If the status is 4, we'd see a line down at 0% meaning that for 100% of the timeframe displayed, the status was not good, or conversely: the status was good 0% of the timeframe. We could also create another data point which is the inverse of our normalized data:
  1. If value is 3, plot 100, else plot 0
  2. Plot (100 - the value from above)
Doing this would also let us use a more intuitive graph called a stacked area graph. We could plot our normalized data using a pleasant color like blue or green and plot the inverse data using a warning color like red or orange. This would give us two series of data that compliment each other and when plotted as a stacked graph would look like the second graph here (the first graph is a status code plot for reference):
See how much more intuitive that is? The first graph in the above picture is a plot of the raw status code. How easy is it to know when things aren't good (without reading the axis labels)? Doable but not instantly intuitive. Imagine you had 12 of these kinds of graphs on a single dashboard on a wall. Would you want to take the time and effort to read the axis labels of each one to know how things are going? No. Now look at the second graph. It's pretty easy to tell that there were two different problems between 8pm and 11am and 1pm and 5pm.  We could even remove the y-axis labels and you'd still probably be able to tell with a glance how things are doing. Imagine 12 different graphs like this. How easy is it to see if there's a problem? Just look for the red!

There are only two drawbacks. Have you noticed? We see that there are two problems, but are they the same problem? Actually, they're not. Also, are they the same severity of problem? The morning problem is that status goes from "subscribed" to "expired". The afternoon problem is that the status goes from "subscribed" (did you notice that it returned to a good status at noon?) to "trial". The morning problem is worse than the afternoon problem.  There's a way to visualize this so that it all looks good. Let me explain:

Essentially we want to normalize the data but still keep as much detail as possible. We'll need to have four series, each with its own color. For any one data point, we'll only have a value of 100% in one of the series. All the others will be 0%. Meaning that for the timeframe that data point represents, whichever series has a value of 100% indicates the status for that moment. Let's look at the normalization rules:
  1. If the status code is 1, return 100, else return 0 (100 here means Trial status)
  2. If the status code is 2, return 100, else return 0 (100 here means Unsubscribed status)
  3. If the status code is 3, return 100, else return 0 (100 here means Subscribed status)
  4. If the status code is 4, return 100, else return 0 (100 here means Expired status)
This is what it would look like (graph on the right, first two shown for reference):
Notice how the problem in the morning is highlighted with a red and the problem in the afternoon is highlighted with a yellow? Easy to tell that there are two problems, that they are different, and that the morning problem is the more severe. 

This is what the final version would look like in the LogicMonitor web gui (not interesting I know since the status code didn't change the whole time I was building this):


Here's how it's built in the GUI:

No comments:

Post a Comment