Friday, July 5, 2019

Discovering Enumerated Properties

tl;dr - the scripts are here and here.

In the same vein as the previous post, this post will talk about SNMP enumeration. While the previous post talked about polling enumerated values and how to display them as intuitively as possible, this post will talk about polling more static values and using them as properties.

There are two levels of properties that can be obtained via SNMP: device level properties (that pertain to the whole device) and instance level properties (that pertain to a particular thing on the device, of which there may be more than one).  Polling those properties is easy; this post will go over how to improve the quality of the data stored so that the data can be intuitively used.

Polling properties vs. data points

Polling properties is easy. Knowing which properties to poll as properties and which to poll as data points is a separate discussion entirely. Suffice it to say that things that represent a characteristic about an object should be properties and things that represent a behavior should be data points. Depending on the tool, only data points can be alerted upon, so that may influence a decision to make something that would normally be a property also a data point.  In other cases, sometimes data points can't be used to influence enablement of monitoring; so sometimes data points need to be properties too.
Assuming you're past the point where you've looked at the MIB and figured out which OIDs should be properties and which should be data points, the next thing to look at is whether or not the OIDs you will be storing as properties have enumerations.

Enumerations

An enumeration is used in SNMP to keep the simple in Simple Network Management Protocol. Instead of passing back a string comprised of ASCII characters, a map is created that connects a meaningful string to a single integer, which is passed back to the NMS. Passing back a single integer is much simpler than passing back a string of characters of varying length.
For example, the admin status of an interface is found at 1.3.6.1.2.1.2.2.1.7. It's not a particularly good example of a property since it lends itself more to a data point, but having it as a property can allow for advanced filtering which may only be available using properties.
ifAdminStatus OBJECT-TYPE
    SYNTAX  INTEGER {
                up(1),       -- ready to pass packets
                down(2),
                testing(3)   -- in some test mode
            }
    MAX-ACCESS  read-write
    STATUS      current
    DESCRIPTION
            "The desired state of the interface.  The testing(3) state
            indicates that no operational packets can be passed.  When a
            managed system initializes, all interfaces start with
            ifAdminStatus in the down(2) state.  As a result of either
            explicit management action or per configuration information
            retained by the managed system, ifAdminStatus is then
            changed to either the up(1) or testing(3) states (or remains
            in the down(2) state)."
    ::= { ifEntry 7 }
You can see in the definition of the OID that it has a custom syntax that allows for one of three values. When this OID is polled, either a 1, a 2, or a 3 is returned. It's up to the NMS to interpret the different values to understand the meaning. If the NMS tool has MIB compilation, this may have a certain level of automation. If it doesn't, interpretation is up to you. Obviously, storing a 1, 2, or 3 as a property could be sufficient, but it would be better to store the interpreted meaning making use by humans easier and more intuitive.

Interpretation of an instance property enumeration

So how do we interpret this? It's actually pretty easy, but it involves some scripting to process the returned data. It also involves some research into the MIB to find out all the meanings. Let's start with a simple case of interfaces. In most cases a MIB will contain a "Table" OID with an "Entry" child OID containing all the instances with whatever OIDs go along with the instances. In the case of interfaces (ignoring the ifXTable), the table is called "ifTable" at 1.3.6.1.2.1.2.2. You'll see that there is a child OID called "ifEntry" at 1.3.6.1.2.1.2.2.1.  This kind of table typically has an index along with perhaps some properties and data points as columns. Each row is an instance. We're going to ignore the data points for now and focus on the OIDs that would do well stored as properties. MTU(4), speed(5), and MAC address(6) are good items to store as properties. They require no interpretation. However, ifType(3) and ifAdminStatus(7) require some interpretation to be useful.
LogicMonitor's multi-instance datasource can be configured to use a groovy script (yes, it's a real thing) to do auto-discovery, which is the mechanism that discovers poll instances and sets properties per poll instance. The concept is pretty simple, use the groovy SNMP libraries to retrieve the data, use groovy to interpret the data, then just print the data to standard output, one line per poll instance.
I recently wrote a script to do this. It's all self documented with explanations and sample output and everything. Some points to consider at the following lines:
  1. This line defines the address of the Entry table. Everything we will be polling happens to live under this branch of the OID tree.
  2. This is where we define which column of the ifEntry table contains the name that we should be using for each of our instances.
  3. This line shows the information from the MIB added to the script so that the script can interpret the meaning from the returned value for ifAdminStatus
  4. This line shows the information from the MIB added to the script so that the script can interpret the meaning from the returned value for ifOperStatus
  5. This line shows that we're still polling ifMtu as a property, but there is no interpretation available for the value. We could alternatively put ["1500":"Default (1500)"] instead of [:] to tell the script to add some meaning to the most common value of MTU
  6. ifPhysAddress is the MAC address and needs no interpretation
  7. This line shows an interpretation that isn't defined in the MIB. Instead of just passing the raw value through, we can provide our own interpretation of the speed to give some more intuitive values for common speeds. If the speed of the interface isn't in our map, the speed itself will be stored as the value. If it does happen to match on eof
  8. This is where we start to define the enumeration for ifType. Turns out ifType has over 200 different enumerated values. Each of these is defined in the script so that the proper type name can be stored as a property.
  9. Here's where you can see some sample output against a device here in my house. Notice that there's one line for each port (both physical and logical).
  10. This line is a good example showing the interpreted values of ifAdminStatus, ifOperStatus, ifSpeed, and ifType.

Interpretation of device level properties

Polling and storing device level properties is a bit simpler mainly because looping through instances is not required. We do have to provide each OID, the name we want the property stored under, and the interpretation, if any. All of this comes from the MIB. The script to do this is here. This example comes from the mGuard MIB, which is from some work I did recently. However, the OIDs can be replaced with any OID from any MIB (notice there's not a baseOID), as long as the OID returns a single value because the script does an SNMP get on that OID.

Conclusion

That's about it. These two scripts can be used to add real meaning to device and instance level properties. The only things that have to change are the data that come from the MIB itself. Perhaps one of these days I'll get around to writing a MIB parser that will output this information for all OIDs in the MIB. Yeah, when I have time.

Tuesday, July 2, 2019

Visualizing Status Codes

If you follow me on LinkedIn, you will have noticed that I changed jobs and moved from Houston to Austin. I'm now working for LogicMonitor as a Sales/Monitoring Engineer. It's a great job and I'm enjoying it a lot more than previous jobs. One of the advantages of this job is that I should have much more opportunity, desire, and content for blog posts.

If this is your first time here, know that this blog is not written for you. It's written for me. I increasingly need more and more reminders of how to do things. That goes especially for things that I devise since no one else knows it unless I tell them. This blog is primarily a place for me to keep those things written down.

Anyway, on to this blog post. LogicMonitor monitors IT infrastructure. After collecting data through various mechanisms, it stores the data in a big database in the cloud and then provides a cloud hosted front end website to display the data. Part of the display is graphs. Many times, the metrics being graphed lend themselves to being plotted on a Cartesian coordinated graph. However, sometimes the metric being polled is a status code. A good example of this is license status on Sophos' XG Firewall. This metric is found at .1.3.6.1.4.1.21067.2.1.3.4.1.
asSubStatus OBJECT-TYPE
    SYNTAX          SubscriptionStatusType
    MAX-ACCESS      read-only
    STATUS          current
    DESCRIPTION     " "
    ::= { liAntispam 1 }
The syntax is "SubscriptionStatusType", which is an enumerated type meaning that only a number is returned, but that number has a meaning depending on the different values returned. Looking at the syntax definition in the MIB will help illustrate:
SubscriptionStatusType ::= TEXTUAL-CONVENTION
        STATUS            current
        DESCRIPTION       "enumerated type for subscription status"
        SYNTAX INTEGER {
                 trial          ( 1 ),
                 unsubscribed   ( 2 ),
                 subscribed     ( 3 ),
                 expired        ( 4 )
        }
So each different value returned indicates a particular state of the license subscription. It's not like a percentage where 100% is good and 0% is bad and there might be values in between. It's not like a rate, where a high number is fast and a low number is slow. It only has discreet values and values in between don't actually have any meaning.
Normally, without putting in much effort, someone might easily create a graph that just plots this number, putting time on the x-axis and the value retuned on the y-axis. This results in what you see here:
As you can see, it's not very helpful. It's a flat line because the status has been the same for the entire time range. That's ok. However, there's no real indicator of meaning. Some effort was made to add the meaning to the data point description (which appears in the tooltip). However, the enumeration is so long that it doesn't really fit in the tooltip. What does a 3 mean? Also, it's not illustrated here, but what happens if the value changes from 3 to 4? There would be two flat lines, one at 3 before the change and one at 4 after the change. But what would be shown at the change? Would it be a vertical line from 3 to 4? Would it be slightly slanted? Also not illustrated here, but what happens when larger timeframes are chosen and values are aggregated together (most often using an average)? Imagine that line at 3 that transitions to 4. What if that happened in middle of the quarter and you viewed it at the end of the quarter? If this status was polled every hour, that would mean 2190 data points to display! That's too many. Almost every graphing solution would attempt to decrease the data points by grouping points and averaging every group. In the case of a quarterly timeframe, it might simplify by averaging all data points for a single day together. This could be fine for most days, except for the one where there was a change. That would show an average of 3's and 4's, yielding a value of 3.5. WTH does 3.5 mean? It gets worse if you have a 2 that transitions to a 3 which then later transitions to a 4. You could end up with an average of 3, indicating no problem at all!?!?  It's not intuitive; and graphs need to be intuitive.

So, what do we do?

Well, we might be tempted to normalize the data. This is actually a very good idea. Let me explain: normalizing the data transforms it into a scale that is more intuitive. For example, we might say that we will normalize the data using the following rules:
  1. A value of 3 is good, so we'll call that 1
  2. Any other value is bad, so we'll call that 0
Pretty cool. That's a pretty good one. Any time everything is ok, we would plot a 1. Any other status is undesirable and we would plot a 0. Transitions are still ugly and potentially troublesome. If we make one tweak, it could allow us to put some context around the resulting values. What if we changed it to 100% instead of 1 and 0% instead of 0? If we did that, we could actually put some additional meaning behind the resulting values. Thing about it, if everything is good, you're plotting 100%. What does that mean? It means that for 100% of the timeframe displayed, the status was good. If the status is 4, we'd see a line down at 0% meaning that for 100% of the timeframe displayed, the status was not good, or conversely: the status was good 0% of the timeframe. We could also create another data point which is the inverse of our normalized data:
  1. If value is 3, plot 100, else plot 0
  2. Plot (100 - the value from above)
Doing this would also let us use a more intuitive graph called a stacked area graph. We could plot our normalized data using a pleasant color like blue or green and plot the inverse data using a warning color like red or orange. This would give us two series of data that compliment each other and when plotted as a stacked graph would look like the second graph here (the first graph is a status code plot for reference):
See how much more intuitive that is? The first graph in the above picture is a plot of the raw status code. How easy is it to know when things aren't good (without reading the axis labels)? Doable but not instantly intuitive. Imagine you had 12 of these kinds of graphs on a single dashboard on a wall. Would you want to take the time and effort to read the axis labels of each one to know how things are going? No. Now look at the second graph. It's pretty easy to tell that there were two different problems between 8pm and 11am and 1pm and 5pm.  We could even remove the y-axis labels and you'd still probably be able to tell with a glance how things are doing. Imagine 12 different graphs like this. How easy is it to see if there's a problem? Just look for the red!

There are only two drawbacks. Have you noticed? We see that there are two problems, but are they the same problem? Actually, they're not. Also, are they the same severity of problem? The morning problem is that status goes from "subscribed" to "expired". The afternoon problem is that the status goes from "subscribed" (did you notice that it returned to a good status at noon?) to "trial". The morning problem is worse than the afternoon problem.  There's a way to visualize this so that it all looks good. Let me explain:

Essentially we want to normalize the data but still keep as much detail as possible. We'll need to have four series, each with its own color. For any one data point, we'll only have a value of 100% in one of the series. All the others will be 0%. Meaning that for the timeframe that data point represents, whichever series has a value of 100% indicates the status for that moment. Let's look at the normalization rules:
  1. If the status code is 1, return 100, else return 0 (100 here means Trial status)
  2. If the status code is 2, return 100, else return 0 (100 here means Unsubscribed status)
  3. If the status code is 3, return 100, else return 0 (100 here means Subscribed status)
  4. If the status code is 4, return 100, else return 0 (100 here means Expired status)
This is what it would look like (graph on the right, first two shown for reference):
Notice how the problem in the morning is highlighted with a red and the problem in the afternoon is highlighted with a yellow? Easy to tell that there are two problems, that they are different, and that the morning problem is the more severe. 

This is what the final version would look like in the LogicMonitor web gui (not interesting I know since the status code didn't change the whole time I was building this):


Here's how it's built in the GUI: