Thursday, August 15, 2013

Using Distributions to show Performance of Multiple Objects on a Time Scale

Many people building custom views in NV will no doubt build one of two types of views: Details Trend or Management TopN.  Unfortunately, this bypasses some of the cooler views like the distribution views.  Consider this scenario: I have multiple third party devices and the manufacturer has provided a special MIB to monitor CPU utilization (instead of doing the smart thing like publishing their CPU statistics into the hrprocessor or UCDavis MIB OIDs).  So, I now have the opportunity to build a custom dataset to pull in the CPU utilization for these devices.  (Side note, i should probably republish my instructions on how to build a custom dataset.)
After I build the dataset, I'll start building my views.  Let's suppose that the vendor has only provided the CPU utilization as an average of all the CPUs on the device or that the device will only ever have one CPU.  The end result is that there is only one poll instance per device for that dataset.  This means that I'll only really build views on the device level and configure the views to drill down to the device page instead of the poll instance page.  After building the appropriate trends on the device page, I'd go to an overview page and build a table or bar chart to show the devices with the highest CPU utilization.  All of this is great and normal and is what most people do when building views for this kind of data.
The problem with stopping here is that there is no way to look at multiple devices over a period of time and see how the devices were performing within that timeframe.  The reason for this is that a TopN table or bar chart will display the rollup (usually the average) of the metric within the timeframe.  In the case of my custom dataset, I'd see the average CPU utilization over the last hour, last day, last week, etc.  This is ok as long as I pick one of the standard timeframes.  Notice what happens when you pick last 4 hours in NPC.  A table or bar chart will only do last hour.  That's because NV hasn't pre-calculated rollups on a 4-hour basis.  So, it becomes important to show the performance of the metric over time showing the values within the timeframe, be it a standard rollup period or not.
That's where distribution views can help.  While they don't necessary show the value of each one of the poll instances analyzed, they do categorize the metric into groups.  For example, I could build a distribution view to group the metrics like: 0-25%, 25-50%, 50-75%, 75-95%, and over 95%.  In this case, NPC would look at all the data during the timeframe (if last hour with 5 minute polling, it will look at 12 data points for each poll instance included in the context) and categorize each data point into one of the buckets I've defined.  The end result is a trend plot over time showing how many devices are in which buckets for each point in time.
Users need to be instructed in the proper way to interpret the view.  If the view is setup properly, the undesirable buckets will have more extreme colors (reds and oranges).  When a user sees a time period in which a larger number of devices are in the undesirable buckets, they should understand that a large number of devices has experience higher CPU utilization.  If 10 devices' CPU utilization goes from 20% to 60%, the bars before the increase will show 10 devices in the 0-25% bucket while the bars after the increase will show 10 devices in the 50-75% bucket.  NPC also calculates the percentage of total devices in each bucket.  So, if half of my devices are in the 50-75% range, a mouseover will reveal 50% in that bucket.
This visualization can be equated to creating a pie chart for each poll cycle.  If you look at one poll cycle for all the devices and created a pie chart with 5 slices, it would be easy to understand how many devices need attention.  Imagine taking the crust off the pie, stretching it out flat and stacking it next to the pie crusts for the other poll cycles in the period.
One disadvantage to the distribution charts is that they lack drill down.  So, while a distribution is good for a summary page, a table showing the rollups over the same timeframe will be helpful to identify which devices are experiencing the higher CPU utilization.  This table would allow drill down to the device page where the individual trend plot could be analyzed individually.  It could also be compared to the rest of the data being gathered by NV for the device.