Friday, November 1, 2019

Counting Resources Per Group

This week I was able to revisit something I'd made for a previous customer, and this time I think I made it even better.

In both cases, the customers were MSPs who wanted to generate some statistics around how many objects were in particular groups. They wanted to easily bill their customers for services delivered. The first iteration worked just fine, but ended up being difficult to implement for the other customer because the groups were not only named different, but the group structure itself was nothing similar.

The Datasource Itself

You'll need the datsource file imported into your portal for any of this to work for you. I'm not publishing the datasource configuration file, nor the dashboard configuration file yet, for several reasons. Once I do, I'll update it here.

The second iteration was an opportunity to innovate so I stuck to the "S" principle in SOLID: Single responsibility. I decided that my datasource shouldn't try to do very specific things regarding groups. The end result is that I started by creating a datasource that pulls a list of all groups and gets the counts for every group. This would put all the data in the hands of the user and let them filter out the things that aren't needed.

Filtering to only report on certain groups

In order to make filtering easier, I added two instance level properties to the datasource definition: depth and fullPath.  Before I go into that, let's see what an example folder structure might look like. This example will be referenced throughout this post.

Example Folder Structure

/ (root)
├── Clients
│   ├── Acme
│   │   ├── Collectors
│   │   ├── Network
│   │   ├── Security
│   │   ├── Server
│   │   └── Virtual Machines
│   └── NetQoS
│       ├── Collectors
│       ├── Network
│       ├── Server
│       └── Virtual Machines
├── Devices By Type
├── Minimal Monitoring
└── My Devices
    ├── Office A
    └── Office B

fullPath

fullPath contains the actual path of the folder. This can be used in a filter to only include folders under certain other folders. For example, since I know my customers' devices are sorted into the Acme and NetQoS folders, I could add a filter like this:

auto.fullPath CONTAIN "Clients/"

This would limit discovery to only those folders containing the string "Clients/", which are the following:

Clients/Acme
Clients/Acme/Collectors
Clients/Acme/Network
Clients/Acme/Security
Clients/Acme/Server
Clients/Acme/Virtual Machines
Clients/NetQoS/
Clients/NetQoS/Collectors
Clients/NetQoS/Network
Clients/NetQoS/Server
Clients/NetQoS/Virtual Machines

This gets us halfway to where we want to be. It included the main client folders (Acme and NetQoS), but also included the groups under them (children of Acme/NetQoS, grandchildren of Clients, and descendants of the root).

Depth

Depth refers to the distance from the root of each folder. Setting a filter like this:

auto.depth == 2

would get the following folders:

Clients/Acme
Clients/NetQoS
My Devices/Office A
My Devices/Office B

Combining the two folders would get only the main client folders (Acme & NetQoS).  At any rate, any combination of these two filters should allow the user to filter down to just the folders for which we want counts.  If no filters were applied, it would give data on each and every group.

The Data

The data comes from the LM API. Normally, data in LM is collected from a target and stored under the resource in LM that corresponds to that target. This can be most directly/simply done by adding your LM portal to LM as a device. In my case, since my portal is lmstuartweenig, I added "lmstuartweenig.logicmonitor.com" as a device, using the expert mode. I then disabled everything that was discovered by default (because I don't really need to monitor that, to each his own).

Alternatively, you could avoid adding another resource into your portal by having the datasource collect the data from the API, but storing/associating the data with one of the collectors. This saves you 1 license.  Either way, you need to have a resource in your resource tree to which which the data can be associated. In my case, I added my portal as a resource. The only difference is whether or not you burn a license.

Once you've identified which resource with which the data will be associated, you need to add a few properties. These properties serve two purposes: 1) it signals LM to enable the datasource and where to associate the data and 2) it provides some key information that the datasource will need to access the API. Add the following properties to the device:
billing_api.accountThis is the account name for your portal. In my case, since my portal address is lmstuartweenig.logicmonitor.com, I'll store "lmstuartweenig" in this property.
billing_api.idGenerate an API token and store the ID here.
billing_api.passStore the API Key here.

Once this is done, you can wait 15 minutes and the datasource should have discovered the groups and started collecting data. At this point, you may want to go to the datasource definition and tweak the filters for your purposes.

Datapoints

The following datapoints are gathered for each folder:
DatapointDescription
host_countThe total count of all hosts in the group (including hosts in sub-groups)
gcp_countHow many GCP devices (not sure if it includes sub-groups, I think it does)
azure_countHow many Azure devices (not sure if it includes sub-groups, I think it does)
aws_countHow many AWS devices (not sure if it includes sub-groups, I think it does)
Total_CountSimply the sum of host_count, gcp_count, azure_count, aws_count

Adding Billing Information

Since one of the original purposes of this data source was to bill customers, it made sense to build in some billing capability. This component is optional and will show "No Data" until it is configured (which I'll detail below).
Since the resulting set of folders normally would represent customers, and since some customers sometimes have different prices for services (gold customers may pay more, for example), and since different type of things could have different costs, you can add a few properties to have the data source calculate some costs as part of the data source with high flexibility.
These properties can be added at the device level. Doing so, would apply the same unit cost to every customer unless a unit cost was applied at the instance level. Let me be clear: the device and instances I'm talking about here are the ones to which the data is associated. Putting these properties on the groups you want to track will have no effect. In my case I added these properties to the lmstuartweenig.logicmonitor.com resource, then added some overrides on the instances under the "Resource Group Member Counts" and "Website Group Member Counts" datasources.

Specifically, you can add the following properties:
PropertyDescription
billing_api.aws_costUnit cost of AWS instances
billing_api.azure_costUnit cost of Azure instances
billing_api.gcp_costUnit cost of GCP instances
billing_api.resource_costUnit cost of devices/resources

Once any number of these properties are added, the value of these properties will feed into new datapoints, populated using the formulas shown below:
DatapointFormula
aws_total_costaws_count X billing_api.aws_cost
azure_total_costazure_count X billing_api.azure_cost
gcp_total_costgcp_count X billing_api.gcp_cost
host_total_costhost_count X billing_api.resource_cost

Using these newly populated datapoints, you can create a dashboard to display all these metrics along with their corresponding cost. Additionally or alternatively, you could create reports containing all the billing data. I recommend dashboards sent out as reports because they are prettier and easier to digest than a flat table of data. Dashboards also allow you to simplify the verbiage.

One other little caveat: this datasource by default only pulls 50 groups. If you have more than 50 groups in your resource and/or website structure, you'll need to add a property called billing_api.query_limit with a value greater than the total number of folders in the larger structure, recursively.

Creating the Dashboard

There's a lot that goes into creating a dashboard. Luckily, I've done all the work for you so you can just import the dashboard. However, when importing, you need to understand how the data is filtered down to one customer. The dashboard makes use of tokens, which are just variables that you create on the dashboard level. Particularly, you need to create the ##customer## token where the value is the name of the customer. It's important that the value of this token match the name of the folder containing this customer's resources and websites. Meaning that in order for website data to show up along with resource data, the folders should be named the same (technically they don't have to be the same because the filter looks for any discovered folders containing the value of ##customer##).

In my case, my website folder structure is a bit simpler than my resource group structure:
Websites/
├── Acme
├── My\ Stuff
└── NetQoS

However, since I have a folder for Acme and a folder for NetQoS and the ##customer## token can cover both my resources and my websites.  After selecting the dashboard definition file to import, All I do is change the dashboard name and update the token. As long as the data is there, it should show everything. Every widget (and you can see the configuration of each one) makes reference to the token so they all filter to only show data concerning the customer of interest.

Additional (Accidental) Uses

One interesting discovery is that if you're only presenting the data through the dashboard, you can do all the filtering on the dashboard level. Meaning: you could actually discover and track every group's metrics, but by properly using the dashboard token, you can filter out only to the one group you're interested in showing.

Another discovery: If you choose your token wisely, you can get multiple customer's data together on one dashboard to get a breakdown of the counts per customer.  The token would have to be one that would match on all the customer groups that I want to show. Luckily in my case, I have my customer groups under "Clients". The token would then be "Clients-" (the "/" in the path is replaced with a "-" when the instance is named). 

Notice in the legends of the pie charts, that it shows one slice for NetQoS' resources and a different slice for Acme's resources.  FYI: ignore the oddness in the top trend plots as a configuration change right before this snapshot caused some historical data to be lost and start gathering again). 

Tuesday, September 3, 2019

Groovy SNMPwalk Helper Functions

tl;dr - Go here.

I've been doing a lot of SNMP polling through Groovy lately. One of the methods I use most often is Snmp.walkAsMap(), which returns a map that looks like this:
[8.6:2, 4.10:1500, 8.7:1, 8.8:1, 4.12:1500, 4.14:1500, 4.16:1500, 3.44:6, 22.5:0.0, 22.4:0.0, 
22.3:0.0, 22.2:0.0, 22.1:0.0, 22.8:0.0, 22.7:0.0, 22.6:0.0, 17.16:164136, 16.44:6639007, 17.12:8634086, 
18.8:0, 17.14:8745523, 18.7:0, 17.10:21076, 10.6:0, 10.5:0, 10.4:0, 10.3:0, 10.2:3968863511, 10.1:8980402, 
22.44:0.0, 18.6:0, 18.5:0, 7.1:1, 18.4:0, 7.2:1, 18.3:0, 7.3:1, 18.2:0, 7.4:1, 18.1:0, 7.5:1, 10.8:1684285, 
7.6:2, 10.7:4181064938, 7.7:1, 7.8:1, 15.44:0, 16.12:2114735865, 16.10:4077101, 16.16:16467729, 
16.14:2361323502, 22.14:0.0, 22.16:0.0, 9.44:2 days, 14:54:42.04, 21.6:0, 21.5:0, 21.4:0, 21.3:0, 
21.2:0, 21.1:0, 22.10:0.0, 22.12:0.0, 21.44:0, 21.8:0, 21.7:0, 5.10:4294967295, 4.44:1500, 5.12:4294967295, 
5.14:4294967295, 11.10:0, 5.16:4294967295, 10.44:1620014, 11.12:12487957, 17.8:30812, 11.14:12673362, 
17.7:17502386, 6.1:, 17.6:0, 6.2:90:e6:ba:59:1b:18, 11.16:136053, 17.5:0, 6.3:00:50:56:c0:00:01, 
17.4:20226, 6.4:00:50:56:c0:00:08, 17.3:20229, 6.5:52:54:00:9b:4b:e0, 17.2:37200831, 6.6:52:54:00:9b:4b:e0, 
17.1:44466, 6.7:02:42:39:1b:cc:e3, 6.8:02:42:55:44:3f:06, 10.10:0, 20.7:0, 20.6:0, 20.5:0, 20.4:0, 20.3:0, 
20.2:0, 20.1:0, 10.12:30849041, 10.14:131456967, 10.16:77954842, 20.8:0, 5.1:10000000, 16.8:17045376, 
5.2:100000000, 16.7:189459674, 5.3:0, 16.6:0, 5.4:0, 16.5:0, 5.5:0, 16.4:0, 5.6:10000000, 16.3:0, 5.7:0, 
16.2:1009509757, 5.8:0, 16.1:8980402, 6.12:4e:c1:4e:e6:07:90, 5.44:4294967295, 6.14:0a:2c:4f:3a:eb:5c, 
6.16:12:db:e0:82:ca:a2, 19.44:0, 6.10:06:31:50:31:a5:fb, 15.12:0, 14.44:0, 15.10:0, 21.16:0, 15.16:0, 
21.14:0, 15.14:0, 20.44:0, 21.12:0, 1.12:12, 1.10:10, 15.1:0, 4.1:65536, 4.2:1500, 4.3:1500, 15.8:0, 
21.10:0, 4.4:1500, 15.7:0, 4.5:1500, 15.6:0, 4.6:1500, 15.5:0, 4.7:1500, 15.4:0, 4.8:1500, 15.3:0, 
15.2:0, 14.10:0, 20.16:0, 14.14:0, 20.14:0, 13.44:0, 14.12:0, 20.12:0, 1.16:16, 1.14:14, 20.10:0, 14.16:0, 
6.44:62:d0:3f:14:ef:85, 7.12:1, 7.14:1, 7.16:1, 7.10:1, 14.2:0, 14.1:0, 3.1:24, 3.2:6, 3.3:6, 3.4:6, 3.5:6, 
14.8:0, 3.6:6, 14.7:0, 3.7:6, 14.6:0, 3.8:6, 14.5:0, 14.4:0, 14.3:0, 2.14:veth9a64fa1, 2.12:veth2bc8fbd, 
1.44:44, 2.10:veth92ca0b0, 19.14:0, 19.16:0, 19.10:0, 19.12:0, 18.44:0, 13.3:0, 13.2:0, 13.1:0, 2.1:lo, 
2.2:NVIDIA Corporation MCP77 Ethernet, 2.16:veth497df7f, 2.3:vmnet1, 2.4:vmnet8, 2.5:virbr0, 2.6:virbr0-nic, 
2.7:br-6a2604a91ac1, 13.8:0, 2.8:docker0, 13.7:0, 13.6:0, 13.5:0, 13.4:0, 18.16:0, 8.16:1, 8.14:1, 
17.44:18857, 18.12:0, 18.14:0, 18.10:0, 8.12:1, 7.44:1, 8.10:1, 13.10:0, 12.44:0, 13.14:0, 13.12:0, 
3.14:6, 2.44:vethd608288, 3.12:6, 3.10:6, 12.4:0, 12.3:0, 12.2:2662, 12.1:0, 1.1:1, 1.2:2, 1.3:3, 1.4:4, 
1.5:5, 1.6:6, 1.7:7, 1.8:8, 13.16:0, 9.1:0:00:00.00, 12.8:0, 9.2:0:00:00.00, 12.7:0, 9.3:0:00:06.31, 12.6:0, 
9.4:0:00:06.31, 12.5:0, 9.5:0:00:09.32, 9.6:0:00:12.32, 9.7:3:03:43.67, 9.8:0:00:18.32, 12.10:0, 11.44:4506, 
12.12:0, 3.16:6, 12.14:0, 12.16:0, 9.16:3:03:43.67, 9.14:3:03:43.67, 19.8:0, 19.7:0, 19.6:0, 8.44:1, 
9.12:3:03:43.67, 9.10:0:00:18.32, 11.5:0, 11.4:0, 11.3:0, 11.2:28692938, 11.1:44466, 19.5:0, 19.4:0, 
19.3:0, 8.1:1, 19.2:0, 8.2:1, 19.1:0, 8.3:1, 11.8:6882, 8.4:1, 11.7:25297372, 8.5:2, 11.6:0]



This great and usable, but I found myself constantly wanting to display the data easier. I also wanted the data to be structured a little more hierarchically, grouping by row in the SNMP table.  I also wanted to make it easier to address individual pieces of the data. I built a couple helper functions that transform the data and make it easier to address. They can be found here.




The output of the snmpMapToTable() function looks like this:
[6:[8:2, 22:0.0, 10:0, 18:0, 7:2, 21:0, 17:0, 6:52:54:00:9b:4b:e0, 20:0, 16:0, 5:10000000, 15:0, 4:1500, 
3:6, 14:0, 2:virbr0-nic, 13:0, 1:6, 12:0, 9:0:00:12.32, 19:0, 11:0], 10:[4:1500, 17:21076, 16:4077101, 22:0.0, 
5:4294967295, 11:0, 10:0, 6:06:31:50:31:a5:fb, 15:0, 1:10, 21:0, 14:0, 20:0, 7:1, 2:veth92ca0b0, 19:0, 18:0, 
8:1, 13:0, 3:6, 12:0, 9:0:00:18.32], 7:[8:1, 22:0.0, 18:0, 10:4181064938, 7:1, 21:0, 17:17502386, 
6:02:42:39:1b:cc:e3, 20:0, 16:189459674, 5:0, 15:0, 4:1500, 14:0, 3:6, 2:br-6a2604a91ac1, 13:0, 1:7, 
12:0, 9:3:03:43.67, 19:0, 11:25297372], 8:[8:1, 22:0.0, 18:0, 10:1684285, 7:1, 21:0, 17:30812, 
6:02:42:55:44:3f:06, 20:0, 16:17045376, 5:0, 15:0, 4:1500, 14:0, 3:6, 13:0, 2:docker0, 1:8, 12:0, 
9:0:00:18.32, 19:0, 11:6882], 12:[4:1500, 17:8634086, 16:2114735865, 22:0.0, 5:4294967295, 
11:12487957, 10:30849041, 6:4e:c1:4e:e6:07:90, 15:0, 21:0, 1:12, 14:0, 20:0, 7:1, 2:veth2bc8fbd, 
19:0, 18:0, 8:1, 13:0, 3:6, 12:0, 9:3:03:43.67], 14:[4:1500, 17:8745523, 16:2361323502, 22:0.0, 
5:4294967295, 11:12673362, 10:131456967, 6:0a:2c:4f:3a:eb:5c, 21:0, 15:0, 14:0, 20:0, 1:14, 7:1, 
2:veth9a64fa1, 19:0, 8:1, 18:0, 13:0, 3:6, 12:0, 9:3:03:43.67], 16:[4:1500, 17:164136, 16:16467729, 
22:0.0, 5:4294967295, 11:136053, 10:77954842, 6:12:db:e0:82:ca:a2, 21:0, 15:0, 20:0, 1:16, 14:0, 
7:1, 19:0, 2:veth497df7f, 18:0, 8:1, 13:0, 3:6, 12:0, 9:3:03:43.67], 44:[3:6, 16:6639007, 22:0.0, 
15:0, 9:2 days, 14:54:42.04, 21:0, 4:1500, 10:1620014, 5:4294967295, 19:0, 14:0, 20:0, 
13:0, 6:62:d0:3f:14:ef:85, 1:44, 18:0, 17:18857, 7:1, 12:0, 2:vethd608288, 11:4506, 8:1], 5:[22:0.0, 
10:0, 18:0, 7:1, 21:0, 17:0, 6:52:54:00:9b:4b:e0, 20:0, 16:0, 5:0, 4:1500, 15:0, 3:6, 14:0, 2:virbr0, 
13:0, 1:5, 12:0, 9:0:00:09.32, 11:0, 19:0, 8:2], 4:[22:0.0, 10:0, 18:0, 7:1, 21:0, 17:20226, 
6:00:50:56:c0:00:08, 20:0, 5:0, 16:0, 4:1500, 15:0, 3:6, 14:0, 2:vmnet8, 13:0, 12:0, 1:4, 9:0:00:06.31, 
11:0, 19:0, 8:1], 3:[22:0.0, 10:0, 18:0, 7:1, 21:0, 6:00:50:56:c0:00:01, 17:20229, 20:0, 5:0, 16:0, 
4:1500, 15:0, 3:6, 14:0, 13:0, 2:vmnet1, 12:0, 1:3, 9:0:00:06.31, 11:0, 19:0, 8:1], 2:[22:0.0, 
10:3968863511, 7:1, 18:0, 21:0, 6:90:e6:ba:59:1b:18, 17:37200831, 20:0, 5:100000000, 16:1009509757, 
4:1500, 15:0, 14:0, 3:6, 13:0, 2:NVIDIA Corporation MCP77 Ethernet, 12:2662, 1:2, 9:0:00:00.00, 
11:28692938, 19:0, 8:1], 1:[22:0.0, 10:8980402, 7:1, 18:0, 21:0, 6:, 17:44466, 20:0, 5:10000000, 
16:8980402, 15:0, 4:65536, 14:0, 3:24, 13:0, 2:lo, 12:0, 1:1, 9:0:00:00.00, 11:44466, 8:1, 19:0]]





It may not look much better, but if you add some carriage returns and tabs you get this:
[
6:[
  8:2, 22:0.0, 10:0, 18:0, 7:2, 21:0, 17:0, 6:52:54:00:9b:4b:e0, 
  20:0, 16:0, 5:10000000, 15:0, 4:1500, 3:6, 14:0, 2:virbr0-nic, 
  13:0, 1:6, 12:0, 9:0:00:12.32, 19:0, 11:0], 
10:[
  4:1500, 17:21076, 16:4077101, 22:0.0, 5:4294967295, 11:0, 10:0, 
  6:06:31:50:31:a5:fb, 15:0, 1:10, 21:0, 14:0, 20:0, 7:1, 2:veth92ca0b0, 
  19:0, 18:0, 8:1, 13:0, 3:6, 12:0, 9:0:00:18.32], 
7:[
  8:1, 22:0.0, 18:0, 10:4181064938, 7:1, 21:0, 17:17502386, 6:02:42:39:1b:cc:e3, 
  20:0, 16:189459674, 5:0, 15:0, 4:1500, 14:0, 3:6, 2:br-6a2604a91ac1, 13:0, 1:7, 
  12:0, 9:3:03:43.67, 19:0, 11:25297372], 
8:[
  8:1, 22:0.0, 18:0, 10:1684285, 7:1, 21:0, 17:30812, 6:02:42:55:44:3f:06, 20:0, 
  16:17045376, 5:0, 15:0, 4:1500, 14:0, 3:6, 13:0, 2:docker0, 1:8, 12:0, 9:0:00:18.32, 
  19:0, 11:6882], 
12:[
  4:1500, 17:8634086, 16:2114735865, 22:0.0, 5:4294967295, 11:12487957, 10:30849041, 
  6:4e:c1:4e:e6:07:90, 15:0, 21:0, 1:12, 14:0, 20:0, 7:1, 2:veth2bc8fbd, 19:0, 18:0, 8:1, 
  13:0, 3:6, 12:0, 9:3:03:43.67], 
14:[
  4:1500, 17:8745523, 16:2361323502, 22:0.0, 5:4294967295, 11:12673362, 10:131456967, 
  6:0a:2c:4f:3a:eb:5c, 21:0, 15:0, 14:0, 20:0, 1:14, 7:1, 2:veth9a64fa1, 19:0, 8:1, 18:0, 
  13:0, 3:6, 12:0, 9:3:03:43.67], 
16:[
  4:1500, 17:164136, 16:16467729, 22:0.0, 5:4294967295, 11:136053, 10:77954842, 
  6:12:db:e0:82:ca:a2, 21:0, 15:0, 20:0, 1:16, 14:0, 7:1, 19:0, 2:veth497df7f, 18:0, 
  8:1, 13:0, 3:6, 12:0, 9:3:03:43.67], 
44:[
  3:6, 16:6639007, 22:0.0, 15:0, 9:2 days, 14:54:42.04, 21:0, 4:1500, 10:1620014, 
  5:4294967295, 19:0, 14:0, 20:0, 13:0, 6:62:d0:3f:14:ef:85, 1:44, 18:0, 17:18857, 
  7:1, 12:0, 2:vethd608288, 11:4506, 8:1], 
5:[
  22:0.0, 10:0, 18:0, 7:1, 21:0, 17:0, 6:52:54:00:9b:4b:e0, 20:0, 16:0, 5:0, 4:1500, 
  15:0, 3:6, 14:0, 2:virbr0, 13:0, 1:5, 12:0, 9:0:00:09.32, 11:0, 19:0, 8:2], 
4:[
  22:0.0, 10:0, 18:0, 7:1, 21:0, 17:20226, 6:00:50:56:c0:00:08, 20:0, 5:0, 16:0, 
  4:1500, 15:0, 3:6, 14:0, 2:vmnet8, 13:0, 12:0, 1:4, 9:0:00:06.31, 11:0, 19:0, 8:1], 
3:[
  22:0.0, 10:0, 18:0, 7:1, 21:0, 6:00:50:56:c0:00:01, 17:20229, 20:0, 5:0, 16:0, 4:1500, 
  15:0, 3:6, 14:0, 13:0, 2:vmnet1, 12:0, 1:3, 9:0:00:06.31, 11:0, 19:0, 8:1], 
2:[
  22:0.0, 10:3968863511, 7:1, 18:0, 21:0, 6:90:e6:ba:59:1b:18, 17:37200831, 20:0, 
  5:100000000, 16:1009509757, 4:1500, 15:0, 14:0, 3:6, 13:0, 
  2:NVIDIA Corporation MCP77 Ethernet, 12:2662, 1:2, 9:0:00:00.00, 11:28692938, 19:0, 8:1], 
1:[
  22:0.0, 10:8980402, 7:1, 18:0, 21:0, 6:, 17:44466, 20:0, 5:10000000, 16:8980402, 15:0, 
  4:65536, 14:0, 3:24, 13:0, 2:lo, 12:0, 1:1, 9:0:00:00.00, 11:44466, 8:1, 19:0]
]




As you can see, it's getting easier to see what's going on. At this point, it'd be nice to order by instance ID and also order by metric ID.  One of the helper functions I built does just that, pprintSnmpWalkTable:
Wildvalue: 1:
Data sorted by column ID
  1.##WILDVALUE##: 1
  2.##WILDVALUE##: lo
  3.##WILDVALUE##: 24
  4.##WILDVALUE##: 65536
  5.##WILDVALUE##: 10000000
  6.##WILDVALUE##: 
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:00.00
  10.##WILDVALUE##: 8980402
  11.##WILDVALUE##: 44466
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 8980402
  17.##WILDVALUE##: 44466
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 2:
Data sorted by column ID
  1.##WILDVALUE##: 2
  2.##WILDVALUE##: NVIDIA Corporation MCP77 Ethernet
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 100000000
  6.##WILDVALUE##: 90:e6:ba:59:1b:18
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:00.00
  10.##WILDVALUE##: 3968863511
  11.##WILDVALUE##: 28692938
  12.##WILDVALUE##: 2662
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 1009509757
  17.##WILDVALUE##: 37200831
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 3:
Data sorted by column ID
  1.##WILDVALUE##: 3
  2.##WILDVALUE##: vmnet1
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 0
  6.##WILDVALUE##: 00:50:56:c0:00:01
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:06.31
  10.##WILDVALUE##: 0
  11.##WILDVALUE##: 0
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 0
  17.##WILDVALUE##: 20229
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 4:
Data sorted by column ID
  1.##WILDVALUE##: 4
  2.##WILDVALUE##: vmnet8
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 0
  6.##WILDVALUE##: 00:50:56:c0:00:08
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:06.31
  10.##WILDVALUE##: 0
  11.##WILDVALUE##: 0
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 0
  17.##WILDVALUE##: 20226
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 5:
Data sorted by column ID
  1.##WILDVALUE##: 5
  2.##WILDVALUE##: virbr0
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 0
  6.##WILDVALUE##: 52:54:00:9b:4b:e0
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 2
  9.##WILDVALUE##: 0:00:09.32
  10.##WILDVALUE##: 0
  11.##WILDVALUE##: 0
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 0
  17.##WILDVALUE##: 0
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 6:
Data sorted by column ID
  1.##WILDVALUE##: 6
  2.##WILDVALUE##: virbr0-nic
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 10000000
  6.##WILDVALUE##: 52:54:00:9b:4b:e0
  7.##WILDVALUE##: 2
  8.##WILDVALUE##: 2
  9.##WILDVALUE##: 0:00:12.32
  10.##WILDVALUE##: 0
  11.##WILDVALUE##: 0
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 0
  17.##WILDVALUE##: 0
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 7:
Data sorted by column ID
  1.##WILDVALUE##: 7
  2.##WILDVALUE##: br-6a2604a91ac1
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 0
  6.##WILDVALUE##: 02:42:39:1b:cc:e3
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 3:03:43.67
  10.##WILDVALUE##: 4181064938
  11.##WILDVALUE##: 25297372
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 189459674
  17.##WILDVALUE##: 17502386
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 8:
Data sorted by column ID
  1.##WILDVALUE##: 8
  2.##WILDVALUE##: docker0
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 0
  6.##WILDVALUE##: 02:42:55:44:3f:06
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:18.32
  10.##WILDVALUE##: 1684285
  11.##WILDVALUE##: 6882
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 17045376
  17.##WILDVALUE##: 30812
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 10:
Data sorted by column ID
  1.##WILDVALUE##: 10
  2.##WILDVALUE##: veth92ca0b0
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 4294967295
  6.##WILDVALUE##: 06:31:50:31:a5:fb
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 0:00:18.32
  10.##WILDVALUE##: 0
  11.##WILDVALUE##: 0
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 4077101
  17.##WILDVALUE##: 21076
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 12:
Data sorted by column ID
  1.##WILDVALUE##: 12
  2.##WILDVALUE##: veth2bc8fbd
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 4294967295
  6.##WILDVALUE##: 4e:c1:4e:e6:07:90
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 3:03:43.67
  10.##WILDVALUE##: 30849041
  11.##WILDVALUE##: 12487957
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 2114735865
  17.##WILDVALUE##: 8634086
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 14:
Data sorted by column ID
  1.##WILDVALUE##: 14
  2.##WILDVALUE##: veth9a64fa1
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 4294967295
  6.##WILDVALUE##: 0a:2c:4f:3a:eb:5c
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 3:03:43.67
  10.##WILDVALUE##: 131456967
  11.##WILDVALUE##: 12673362
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 2361323502
  17.##WILDVALUE##: 8745523
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 16:
Data sorted by column ID
  1.##WILDVALUE##: 16
  2.##WILDVALUE##: veth497df7f
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 4294967295
  6.##WILDVALUE##: 12:db:e0:82:ca:a2
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 3:03:43.67
  10.##WILDVALUE##: 77954842
  11.##WILDVALUE##: 136053
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 16467729
  17.##WILDVALUE##: 164136
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0
Wildvalue: 44:
Data sorted by column ID
  1.##WILDVALUE##: 44
  2.##WILDVALUE##: vethd608288
  3.##WILDVALUE##: 6
  4.##WILDVALUE##: 1500
  5.##WILDVALUE##: 4294967295
  6.##WILDVALUE##: 62:d0:3f:14:ef:85
  7.##WILDVALUE##: 1
  8.##WILDVALUE##: 1
  9.##WILDVALUE##: 2 days, 14:54:42.04
  10.##WILDVALUE##: 1620014
  11.##WILDVALUE##: 4506
  12.##WILDVALUE##: 0
  13.##WILDVALUE##: 0
  14.##WILDVALUE##: 0
  15.##WILDVALUE##: 0
  16.##WILDVALUE##: 6639007
  17.##WILDVALUE##: 18857
  18.##WILDVALUE##: 0
  19.##WILDVALUE##: 0
  20.##WILDVALUE##: 0
  21.##WILDVALUE##: 0
  22.##WILDVALUE##: 0.0



For me, this is pretty easy to read and find individual values that I'm looking for. The three lines that resulted in the above output were these:
walkResult = Snmp.walkAsMap(host, Oid, props, timeout)
entryRaw = snmpMapToTable(walkResult)
pprintSnmpWalkTable(entryRaw)



Accessing the data is pretty easy now:
ifEntryRaw.each {wildvalue, data ->
 println("""Interface ${wildvalue}:
 ifAlias: ${ifXEntryRaw[wildvalue]["1"]}
 ifDescr: ${data["2"]}
 ifType: ${data["3"]}
    """)
}



This gives us the following output:
Interface 6:
 ifAlias: virbr0-nic
 ifDescr: virbr0-nic
 ifType: 6

Interface 10:
 ifAlias: veth92ca0b0
 ifDescr: veth92ca0b0
 ifType: 6

Interface 7:
 ifAlias: br-6a2604a91ac1
 ifDescr: br-6a2604a91ac1
 ifType: 6

Interface 8:
 ifAlias: docker0
 ifDescr: docker0
 ifType: 6

Interface 12:
 ifAlias: veth2bc8fbd
 ifDescr: veth2bc8fbd
 ifType: 6

Interface 14:
 ifAlias: veth9a64fa1
 ifDescr: veth9a64fa1
 ifType: 6

Interface 16:
 ifAlias: veth497df7f
 ifDescr: veth497df7f
 ifType: 6

Interface 44:
 ifAlias: vethd608288
 ifDescr: vethd608288
 ifType: 6

Interface 5:
 ifAlias: virbr0
 ifDescr: virbr0
 ifType: 6

Interface 4:
 ifAlias: vmnet8
 ifDescr: vmnet8
 ifType: 6

Interface 3:
 ifAlias: vmnet1
 ifDescr: vmnet1
 ifType: 6

Interface 2:
 ifAlias: enp0s10
 ifDescr: NVIDIA Corporation MCP77 Ethernet
 ifType: 6

Interface 1:
 ifAlias: lo
 ifDescr: lo
 ifType: 24

Friday, July 5, 2019

Discovering Enumerated Properties

tl;dr - the scripts are here and here.

In the same vein as the previous post, this post will talk about SNMP enumeration. While the previous post talked about polling enumerated values and how to display them as intuitively as possible, this post will talk about polling more static values and using them as properties.

There are two levels of properties that can be obtained via SNMP: device level properties (that pertain to the whole device) and instance level properties (that pertain to a particular thing on the device, of which there may be more than one).  Polling those properties is easy; this post will go over how to improve the quality of the data stored so that the data can be intuitively used.

Polling properties vs. data points

Polling properties is easy. Knowing which properties to poll as properties and which to poll as data points is a separate discussion entirely. Suffice it to say that things that represent a characteristic about an object should be properties and things that represent a behavior should be data points. Depending on the tool, only data points can be alerted upon, so that may influence a decision to make something that would normally be a property also a data point.  In other cases, sometimes data points can't be used to influence enablement of monitoring; so sometimes data points need to be properties too.
Assuming you're past the point where you've looked at the MIB and figured out which OIDs should be properties and which should be data points, the next thing to look at is whether or not the OIDs you will be storing as properties have enumerations.

Enumerations

An enumeration is used in SNMP to keep the simple in Simple Network Management Protocol. Instead of passing back a string comprised of ASCII characters, a map is created that connects a meaningful string to a single integer, which is passed back to the NMS. Passing back a single integer is much simpler than passing back a string of characters of varying length.
For example, the admin status of an interface is found at 1.3.6.1.2.1.2.2.1.7. It's not a particularly good example of a property since it lends itself more to a data point, but having it as a property can allow for advanced filtering which may only be available using properties.
ifAdminStatus OBJECT-TYPE
    SYNTAX  INTEGER {
                up(1),       -- ready to pass packets
                down(2),
                testing(3)   -- in some test mode
            }
    MAX-ACCESS  read-write
    STATUS      current
    DESCRIPTION
            "The desired state of the interface.  The testing(3) state
            indicates that no operational packets can be passed.  When a
            managed system initializes, all interfaces start with
            ifAdminStatus in the down(2) state.  As a result of either
            explicit management action or per configuration information
            retained by the managed system, ifAdminStatus is then
            changed to either the up(1) or testing(3) states (or remains
            in the down(2) state)."
    ::= { ifEntry 7 }
You can see in the definition of the OID that it has a custom syntax that allows for one of three values. When this OID is polled, either a 1, a 2, or a 3 is returned. It's up to the NMS to interpret the different values to understand the meaning. If the NMS tool has MIB compilation, this may have a certain level of automation. If it doesn't, interpretation is up to you. Obviously, storing a 1, 2, or 3 as a property could be sufficient, but it would be better to store the interpreted meaning making use by humans easier and more intuitive.

Interpretation of an instance property enumeration

So how do we interpret this? It's actually pretty easy, but it involves some scripting to process the returned data. It also involves some research into the MIB to find out all the meanings. Let's start with a simple case of interfaces. In most cases a MIB will contain a "Table" OID with an "Entry" child OID containing all the instances with whatever OIDs go along with the instances. In the case of interfaces (ignoring the ifXTable), the table is called "ifTable" at 1.3.6.1.2.1.2.2. You'll see that there is a child OID called "ifEntry" at 1.3.6.1.2.1.2.2.1.  This kind of table typically has an index along with perhaps some properties and data points as columns. Each row is an instance. We're going to ignore the data points for now and focus on the OIDs that would do well stored as properties. MTU(4), speed(5), and MAC address(6) are good items to store as properties. They require no interpretation. However, ifType(3) and ifAdminStatus(7) require some interpretation to be useful.
LogicMonitor's multi-instance datasource can be configured to use a groovy script (yes, it's a real thing) to do auto-discovery, which is the mechanism that discovers poll instances and sets properties per poll instance. The concept is pretty simple, use the groovy SNMP libraries to retrieve the data, use groovy to interpret the data, then just print the data to standard output, one line per poll instance.
I recently wrote a script to do this. It's all self documented with explanations and sample output and everything. Some points to consider at the following lines:
  1. This line defines the address of the Entry table. Everything we will be polling happens to live under this branch of the OID tree.
  2. This is where we define which column of the ifEntry table contains the name that we should be using for each of our instances.
  3. This line shows the information from the MIB added to the script so that the script can interpret the meaning from the returned value for ifAdminStatus
  4. This line shows the information from the MIB added to the script so that the script can interpret the meaning from the returned value for ifOperStatus
  5. This line shows that we're still polling ifMtu as a property, but there is no interpretation available for the value. We could alternatively put ["1500":"Default (1500)"] instead of [:] to tell the script to add some meaning to the most common value of MTU
  6. ifPhysAddress is the MAC address and needs no interpretation
  7. This line shows an interpretation that isn't defined in the MIB. Instead of just passing the raw value through, we can provide our own interpretation of the speed to give some more intuitive values for common speeds. If the speed of the interface isn't in our map, the speed itself will be stored as the value. If it does happen to match on eof
  8. This is where we start to define the enumeration for ifType. Turns out ifType has over 200 different enumerated values. Each of these is defined in the script so that the proper type name can be stored as a property.
  9. Here's where you can see some sample output against a device here in my house. Notice that there's one line for each port (both physical and logical).
  10. This line is a good example showing the interpreted values of ifAdminStatus, ifOperStatus, ifSpeed, and ifType.

Interpretation of device level properties

Polling and storing device level properties is a bit simpler mainly because looping through instances is not required. We do have to provide each OID, the name we want the property stored under, and the interpretation, if any. All of this comes from the MIB. The script to do this is here. This example comes from the mGuard MIB, which is from some work I did recently. However, the OIDs can be replaced with any OID from any MIB (notice there's not a baseOID), as long as the OID returns a single value because the script does an SNMP get on that OID.

Conclusion

That's about it. These two scripts can be used to add real meaning to device and instance level properties. The only things that have to change are the data that come from the MIB itself. Perhaps one of these days I'll get around to writing a MIB parser that will output this information for all OIDs in the MIB. Yeah, when I have time.

Tuesday, July 2, 2019

Visualizing Status Codes

If you follow me on LinkedIn, you will have noticed that I changed jobs and moved from Houston to Austin. I'm now working for LogicMonitor as a Sales/Monitoring Engineer. It's a great job and I'm enjoying it a lot more than previous jobs. One of the advantages of this job is that I should have much more opportunity, desire, and content for blog posts.

If this is your first time here, know that this blog is not written for you. It's written for me. I increasingly need more and more reminders of how to do things. That goes especially for things that I devise since no one else knows it unless I tell them. This blog is primarily a place for me to keep those things written down.

Anyway, on to this blog post. LogicMonitor monitors IT infrastructure. After collecting data through various mechanisms, it stores the data in a big database in the cloud and then provides a cloud hosted front end website to display the data. Part of the display is graphs. Many times, the metrics being graphed lend themselves to being plotted on a Cartesian coordinated graph. However, sometimes the metric being polled is a status code. A good example of this is license status on Sophos' XG Firewall. This metric is found at .1.3.6.1.4.1.21067.2.1.3.4.1.
asSubStatus OBJECT-TYPE
    SYNTAX          SubscriptionStatusType
    MAX-ACCESS      read-only
    STATUS          current
    DESCRIPTION     " "
    ::= { liAntispam 1 }
The syntax is "SubscriptionStatusType", which is an enumerated type meaning that only a number is returned, but that number has a meaning depending on the different values returned. Looking at the syntax definition in the MIB will help illustrate:
SubscriptionStatusType ::= TEXTUAL-CONVENTION
        STATUS            current
        DESCRIPTION       "enumerated type for subscription status"
        SYNTAX INTEGER {
                 trial          ( 1 ),
                 unsubscribed   ( 2 ),
                 subscribed     ( 3 ),
                 expired        ( 4 )
        }
So each different value returned indicates a particular state of the license subscription. It's not like a percentage where 100% is good and 0% is bad and there might be values in between. It's not like a rate, where a high number is fast and a low number is slow. It only has discreet values and values in between don't actually have any meaning.
Normally, without putting in much effort, someone might easily create a graph that just plots this number, putting time on the x-axis and the value retuned on the y-axis. This results in what you see here:
As you can see, it's not very helpful. It's a flat line because the status has been the same for the entire time range. That's ok. However, there's no real indicator of meaning. Some effort was made to add the meaning to the data point description (which appears in the tooltip). However, the enumeration is so long that it doesn't really fit in the tooltip. What does a 3 mean? Also, it's not illustrated here, but what happens if the value changes from 3 to 4? There would be two flat lines, one at 3 before the change and one at 4 after the change. But what would be shown at the change? Would it be a vertical line from 3 to 4? Would it be slightly slanted? Also not illustrated here, but what happens when larger timeframes are chosen and values are aggregated together (most often using an average)? Imagine that line at 3 that transitions to 4. What if that happened in middle of the quarter and you viewed it at the end of the quarter? If this status was polled every hour, that would mean 2190 data points to display! That's too many. Almost every graphing solution would attempt to decrease the data points by grouping points and averaging every group. In the case of a quarterly timeframe, it might simplify by averaging all data points for a single day together. This could be fine for most days, except for the one where there was a change. That would show an average of 3's and 4's, yielding a value of 3.5. WTH does 3.5 mean? It gets worse if you have a 2 that transitions to a 3 which then later transitions to a 4. You could end up with an average of 3, indicating no problem at all!?!?  It's not intuitive; and graphs need to be intuitive.

So, what do we do?

Well, we might be tempted to normalize the data. This is actually a very good idea. Let me explain: normalizing the data transforms it into a scale that is more intuitive. For example, we might say that we will normalize the data using the following rules:
  1. A value of 3 is good, so we'll call that 1
  2. Any other value is bad, so we'll call that 0
Pretty cool. That's a pretty good one. Any time everything is ok, we would plot a 1. Any other status is undesirable and we would plot a 0. Transitions are still ugly and potentially troublesome. If we make one tweak, it could allow us to put some context around the resulting values. What if we changed it to 100% instead of 1 and 0% instead of 0? If we did that, we could actually put some additional meaning behind the resulting values. Thing about it, if everything is good, you're plotting 100%. What does that mean? It means that for 100% of the timeframe displayed, the status was good. If the status is 4, we'd see a line down at 0% meaning that for 100% of the timeframe displayed, the status was not good, or conversely: the status was good 0% of the timeframe. We could also create another data point which is the inverse of our normalized data:
  1. If value is 3, plot 100, else plot 0
  2. Plot (100 - the value from above)
Doing this would also let us use a more intuitive graph called a stacked area graph. We could plot our normalized data using a pleasant color like blue or green and plot the inverse data using a warning color like red or orange. This would give us two series of data that compliment each other and when plotted as a stacked graph would look like the second graph here (the first graph is a status code plot for reference):
See how much more intuitive that is? The first graph in the above picture is a plot of the raw status code. How easy is it to know when things aren't good (without reading the axis labels)? Doable but not instantly intuitive. Imagine you had 12 of these kinds of graphs on a single dashboard on a wall. Would you want to take the time and effort to read the axis labels of each one to know how things are going? No. Now look at the second graph. It's pretty easy to tell that there were two different problems between 8pm and 11am and 1pm and 5pm.  We could even remove the y-axis labels and you'd still probably be able to tell with a glance how things are doing. Imagine 12 different graphs like this. How easy is it to see if there's a problem? Just look for the red!

There are only two drawbacks. Have you noticed? We see that there are two problems, but are they the same problem? Actually, they're not. Also, are they the same severity of problem? The morning problem is that status goes from "subscribed" to "expired". The afternoon problem is that the status goes from "subscribed" (did you notice that it returned to a good status at noon?) to "trial". The morning problem is worse than the afternoon problem.  There's a way to visualize this so that it all looks good. Let me explain:

Essentially we want to normalize the data but still keep as much detail as possible. We'll need to have four series, each with its own color. For any one data point, we'll only have a value of 100% in one of the series. All the others will be 0%. Meaning that for the timeframe that data point represents, whichever series has a value of 100% indicates the status for that moment. Let's look at the normalization rules:
  1. If the status code is 1, return 100, else return nothing (100 here means Trial status)
  2. If the status code is 2, return 100, else return nothing (100 here means Unsubscribed status)
  3. If the status code is 3, return 100, else return nothing (100 here means Subscribed status)
  4. If the status code is 4, return 100, else return nothing (100 here means Expired status)
This is what it would look like (graph on the right, first two shown for reference):
Notice how the problem in the morning is highlighted with a red and the problem in the afternoon is highlighted with a yellow? Easy to tell that there are two problems, that they are different, and that the morning problem is the more severe. 

This is what the final version would look like in the LogicMonitor web gui (not interesting I know since the status code didn't change the whole time I was building this):



Here's how it's built in the GUI:
Notes on the screenshot:

  • it shows line types of "Area" but they should be "Stacked" to display properly
  • the formulas should be `if(in(StatusCode,3),100,unkn())`

Monday, June 3, 2019

How to share a ton of stuff with someone else


If you have the stuff to share:

  1. Download and install Resilio Sync
  2. After opening the app, click the plus sign in the top left corner and select "Standard Folder"
  3. Browse to the folder you want to share with someone else and click "Open"
  4. A new entry will appear in your list of folders. At the right end of this entry will appear three dots (when you mouseover the row). Click the three dots and select "Copy Read Only key" or "Copy Read & Write key" depending on whether or not you want the sharer to be able to change what's in the shared folder. They key is now in your clipboard.
  5. Send the key to the person you want to share with.

If you have received a key:


  1. Download and install Resilio Sync
  2. After opening the app, click the plus sign in the top left corner and select "Enter key or link"
  3. Paste in the key that was sent to you
  4. Browse to the folder you want to synchronize and select Open.
  5. Go grab a coffee and chips and wait until the synchronization finishes.


Disclaimer: using any protocol to transmit/receive data that you are not legally allowed to transmit/receive is obviously illegal. I'm not responsible if you use this to do something illegal.

Tuesday, April 16, 2019

One Trailer for Each Piece of the Saga

Saw an article bringing together all the trailers for all the Star Wars productions. They did the episodes 1-9 first, then the ancillary productions. I decided to put them into a YouTube playlist. You're welcome.

Don't forget the one that you can't find on YouTube, it's after Episode 6, before 7.

Friday, March 22, 2019

Web GL Globe

I have spent some time now working in the Oil Industry. I've been working in technology, so I haven't had much to do with actual oil. However, I did come across this really cool visualization of oil imports and exports when I was searching for a way to visualize some network performance data. It's based on a bit of code by Google called the WebGL Globe. WebGL Globe is built on a technology called WebGL, which is like OpenGL except that it runs natively in the browser. It allows for interaction like what you would expect from a 3D graphics game but right in the browser with web code. Some of the examples are pretty cool and load extremely quickly because they don't really use the normal document structure. (Some other examples of really neat uses of WebGL and/or three.js, a library to facilitate using WebGL: Galactic Neighbors Internet Safety game for Kids by Google Lubricious)

The concept is pretty expansive when you think about the kinds of data that can be shown. WebGL Globe combines data that has several dimensions of information encoded and visualized:

  1. Node Location - latitude and longitude
  2. Node connections - showing that two nodes are connected in some way (shown as an arc when you click on a country in the World of Oil visualization).
  3. Connection intensity - this one can have multiple dimensions depending on how creative you get. For example:
    1. the line color itself can indicate some sort of status (red/green/yellow/orange). It's even possible that you could show percentages of the arc length as certain colors to indicate distribution of the status (i.e. 10% is bad while 90% is good might mean a line that is mostly green with a segment of length 10% that is red).
    2. the thickness of the arc can be another dimension, indicating something like volume
    3. the maximum altitude of the arc can be another, indicating sample size
All this is neat, but what good does it do anyone. Well, if you've watched my Analyzing TCP Application Performance video, you know that monitoring of response time is the most important part of any infrastructure monitoring. If you're doing application response time monitoring, you should have data for each transaction showing how each transaction performed. That's a huge amount of data (big data anyone?). 

Visualizing that data requires a few steps of summarization and grouping. First of all, every transaction's metrics should be retained individually so that you can dive into the details if needed. Second, you can start grouping by user then by summarizing by time buckets (1 minute or 5 minute). Another level of grouping that can be done is to group by network location. This can legitimately be done because most users at a particular location are going to share a very high percentage of the network path that gets them to the services they are consuming. 

Let me rephrase: You should have data that describes sources, destinations, and the performance between them. Sound familiar? What I'm envisioning is taking this data and plotting it out using WebGL Globe. Each network location and each service hosting location is a node (hm, could cloud services actually be represented as clouds?!?). They'd be connected with arcs. Thickness of the arcs could represent the number of transactions between those two nodes. Height of the arc could represent the recent negative variability in performance or a way to highlight the selected arc. Color of the arc could show a measure of the deviance in performance. 

Click on a node and you'd see the arcs from all the user locations consuming services from that site (if any) and the arcs to all the service hosting locations consumed by that site (if any) pop up in altitude (normally they'd be displayed at sea level or hidden based on a GUI setting). This is similar to the action you see in the Global Oil visualization when you click on a country (you see the countries connected to it). 

From there, if you saw that all the arcs were showing problems, you would know (from problem domain isolation) that there is a problem common to all users at that site. You could click on the node again and get into performance stats for the infrastructure at that location (from a tool like LogicMonitor). Following the colors should get you to the root of the problem pretty quickly.

Alternatively, if you saw a problem in only one of the arcs (or a small selection) follow problem domain isolation tactics and click on the arc having the problem. That should dive you into the infrastructure connecting those two nodes so you can find the problem.

Nodes themselves can have problems that don't evidence themselves outside the node. If that's the case, you should be going into the node to look at why there are performance issues. You'd know to go to there because the node icon itself would be showing some color indicator that there's a problem.

Friday, March 8, 2019

Using Docker

Just a couple articles that have been camping out in my browser since I found them. They were very useful in helping me get docker doing useful stuff, like Ansible.

https://blog.docker.com/2016/09/build-your-first-docker-windows-server-container/

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-18-04

alpine/git          latest    a1d22e4b51ad      10 days ago     27.5MB
ansible-docker      latest    25b39c3ffd15      2 weeks ago     153MB
ubuntu              latest    47b19964fb50      4 weeks ago     88.1MB

Ansible-docker is a container I modified to suit my needs. You can use it by either cloning the git repository and building from source or just pull it from the hub:
docker pull sweenig/ansible-docker

I used this to know how to push images to docker: https://ropenscilabs.github.io/r-docker-tutorial/04-Dockerhub.html

alpine/git is the easiest way I've found to run git on Windows (hint, it's not actually running in Windows but in Linux inside the container).

Thursday, March 7, 2019

Object Oriented Programming

Today's blog post may have been posted before, but it's a really good one. If you are looking to get into object oriented programming, you should give this a look: https://inventwithpython.com/blog/2014/12/02/why-is-object-oriented-programming-useful-with-a-role-playing-game-example/.