UPDATE 6/9/2015: Version 1.7 now released. This update adds standalone support. Since CA is including newer versions of MySQL in their products, DBToolv3 is no longer going to work. This change allows you to specify to use MySQLDump instead of DBToolv3. Essentially, you unremark line 15 and remove/remark line 14. If I get enthusiastic about it, I may update the script to allow a switch from the command line to specify which method to use. I'm just not there yet.
UPDATE 2/10/15: Version 1.6 now released. This update changes the way harvesters and DSAs are backed up, by only backing up the ReaperArchive, ReaperArchive15, and HarvesterArchive directories to a single directory (no redundant rolling backups). It only backs up files that have the archive bit set, so before running it the first time, set the archive bit for all the files in those directories. I also fixed the date naming method so it's YYYYMMDD instead of YYYYDDMM. I also added timestamping to the log so you know how long it takes to perform the file backups vs. the database backups.
UPDATE 2/27/14: Version 1.5 now released. This version doesn't have too many changes. I just added the lines below that allow the NFA mess of data files to be backed up along with everything else. This one script can still be used on any product. However, when running on a Harvester or DSA, extra commands backup the data files.
The syntax for running the tool hasn't changed since 1.4 (but 1.4 introduces some major changes), so you should be able to drop the script in place without changing any scheduled tasks.
nqbackup.bat <dbname> <num_backups_to_keep>
Remember, if you need a reminder how to run the tool, just run it without any arguments (or just double click it from Windows Explorer).
I'm an engineer who doesn't care for a lot of fluff for fluff's sake.
Showing posts with label NetQoS. Show all posts
Showing posts with label NetQoS. Show all posts
Tuesday, June 9, 2015
Friday, December 12, 2014
Enabling or Disabling the Flow Cloner in RA9.0
I know, 9.0 is an old version, but I had a customer who is transitioning and needed to temporarily enable and disable cloning of flows from the old harvesters to the new harvesters. Here's the resulting script. The first argument should be Y or N depending on whether you want to enable (Y) or disable (N) the flow cloner. The second argument is optional and is the IP address you want to clone to. If you specify the IP address, the flowclonedef.ini file is created. If you don't specify it, no changes are made.
Friday, November 15, 2013
Monday, November 11, 2013
Preparing Windows 2008 for NetQoS Installations
UPDATE: I've added the ability to disable IPv6 and configure SNMP with default settings (public community string and access from any host). I've also corrected the spelling in two places.
UPDATE: There have been several changes lately that I haven't published out here on the blog. Suffice it to say that you can now re-run the script and choose which parts to run and which parts not to run. You are prompted with a yes/no dialog before each section of the script runs. Also, the .Net 4.0 uninstaller now runs as a part of the script. It runs right before Windows Updates and will reboot if .Net 4.0 is uninstalled. You'll need to rerun the script if you want it to run Windows Updates for you.
I've got some ambitious plans for the next version. I want the script to allow you to run it once manually and it will create a response file. That response file can then be used to repeat the script on any number of servers. So, if you want to only run certain parts, the first time you manually run it will create an output file with that info. Then you can use that output file as an input for future runs. This should make preparing a bunch of w2k8 servers easier.
There are a bunch of things that have to be done to a Windows 2008 server before the NetQoS software can be installed. Being the efficient (aka lazy) engineer that I am, I decided to script the whole thing. Here are each of the pieces of my script. Download the whole thing here. If you want to run the script without copying and pasting each piece, you must run 'set-executionpolicy remotesigned -forced' first. Otherwise powershell doesn't allow the script to run. I usually copy and paste each part so I see each part as it happens.
UPDATE: There have been several changes lately that I haven't published out here on the blog. Suffice it to say that you can now re-run the script and choose which parts to run and which parts not to run. You are prompted with a yes/no dialog before each section of the script runs. Also, the .Net 4.0 uninstaller now runs as a part of the script. It runs right before Windows Updates and will reboot if .Net 4.0 is uninstalled. You'll need to rerun the script if you want it to run Windows Updates for you.
I've got some ambitious plans for the next version. I want the script to allow you to run it once manually and it will create a response file. That response file can then be used to repeat the script on any number of servers. So, if you want to only run certain parts, the first time you manually run it will create an output file with that info. Then you can use that output file as an input for future runs. This should make preparing a bunch of w2k8 servers easier.
There are a bunch of things that have to be done to a Windows 2008 server before the NetQoS software can be installed. Being the efficient (aka lazy) engineer that I am, I decided to script the whole thing. Here are each of the pieces of my script. Download the whole thing here. If you want to run the script without copying and pasting each piece, you must run 'set-executionpolicy remotesigned -forced' first. Otherwise powershell doesn't allow the script to run. I usually copy and paste each part so I see each part as it happens.
at
4:34 PM
0
comments
More posts like this:
How To,
NetQoS,
NPC,
ReporterAnalyzer NFA,
SuperAgent ADA


Friday, August 16, 2013
Device level context switching
UPDATE: I've developed some code to make this a regular view that can be dragged onto the page without any configuration required. The following code will create the standalone view and add it to all four device context pages:
UPDATE: I've rewritten this widget to make it easier to implement. Now instead of having to specify the {Item.ItemID} variable in the browser view URL, the widget just grabs the information from the parent URL. This is also better because any additional arguments you had in the URL will continue through to the other context pages. Here's the updated code:
Now all you have to do is point to this widget in your custom content directory.
Enjoy!
You may not know about it, but NPC classifies every device as either a router, switch, server, or device. The device category is for every type of device that isn't a router, switch, or server. This is too bad because NetVoyant actually has an extendable list of device classifications; you can make as many as you want. However any additional classes will show up in NPC as 'devices' because NPC doesn't understand them. This is fine in most cases. However, certain cases will cause problems.
For example, if I have an F5 load balancer and I'm monitoring the device in SuperAgent as well as NetVoyant, NPC has to choose whether to classify the device as a server (as SuperAgent reports it) or as a device (since NetVoyant either classifies it as 'other' or 'load balancers' if you've classified it). Turns out the NV classification is last on the list. If a device is monitored by RA or SA, NPC will classify it as a router or server, respectively, regardless of what classification exists in NV.
In this case, what I usually do is instruct customers how to switch from one context page to another after drilling in. For example, after I drill into the F5 and get to the server page, I would update the URL to read pg=d instead of pg=s. This loads the device page for the F5 instead of the server page. This can be handy since the device page may have specific F5 views on it that don't appear on the server page.
In order to make this easier, I built a simple html page that can be loaded into a browser view that will allow quick switching between all four context view types. Here's the page:
Link to this page from a browser view with a title like 'View this device as a...' and a URL like this:
UPDATE: I've rewritten this widget to make it easier to implement. Now instead of having to specify the {Item.ItemID} variable in the browser view URL, the widget just grabs the information from the parent URL. This is also better because any additional arguments you had in the URL will continue through to the other context pages. Here's the updated code:
Now all you have to do is point to this widget in your custom content directory.
Enjoy!
You may not know about it, but NPC classifies every device as either a router, switch, server, or device. The device category is for every type of device that isn't a router, switch, or server. This is too bad because NetVoyant actually has an extendable list of device classifications; you can make as many as you want. However any additional classes will show up in NPC as 'devices' because NPC doesn't understand them. This is fine in most cases. However, certain cases will cause problems.
For example, if I have an F5 load balancer and I'm monitoring the device in SuperAgent as well as NetVoyant, NPC has to choose whether to classify the device as a server (as SuperAgent reports it) or as a device (since NetVoyant either classifies it as 'other' or 'load balancers' if you've classified it). Turns out the NV classification is last on the list. If a device is monitored by RA or SA, NPC will classify it as a router or server, respectively, regardless of what classification exists in NV.
In this case, what I usually do is instruct customers how to switch from one context page to another after drilling in. For example, after I drill into the F5 and get to the server page, I would update the URL to read pg=d instead of pg=s. This loads the device page for the F5 instead of the server page. This can be handy since the device page may have specific F5 views on it that don't appear on the server page.
In order to make this easier, I built a simple html page that can be loaded into a browser view that will allow quick switching between all four context view types. Here's the page:
<html>
<script type="text/javascript">
var url1='<a target="_top" href="/npc/Default.aspx?pg=';
var str=location.search;
str=str.replace("?ItemID=","");
document.write(url1 + 'r' + '&DeviceID=' + str + '">Router</a> ');
document.write(url1 + 'sw' + '&DeviceID=' + str + '">Switch</a> ');
document.write(url1 + 'd' + '&DeviceID=' + str + '">Device</a> ');
document.write(url1 + 's' + '&DeviceID=' + str + '">Server</a> ');
<a target="_blank" href="http://stuart.weenig.com/2012/08/device-level-context-switching.html"><img src="/npc/images/DialogQuestion.gif" border=0></a>
</script>
</html>
Link to this page from a browser view with a title like 'View this device as a...' and a URL like this:
/content/viewdeviceas.html?ItemID={Item.ItemID}
As long as this page is named 'viewdeviceas.html' and it's hosted under a virtual directory on NPC's IIS web server with an alias of 'content' it should load just fine. Give it a height of 33, turn off the border and hide the scroll bars. This makes an excellent small browser view that can go right at the top of the page, displayed right under the page tabs.
Thursday, August 15, 2013
Using Distributions to show Performance of Multiple Objects on a Time Scale
Many people building custom views in NV will no doubt build one of two types of views: Details Trend or Management TopN. Unfortunately, this bypasses some of the cooler views like the distribution views. Consider this scenario: I have multiple third party devices and the manufacturer has provided a special MIB to monitor CPU utilization (instead of doing the smart thing like publishing their CPU statistics into the hrprocessor or UCDavis MIB OIDs). So, I now have the opportunity to build a custom dataset to pull in the CPU utilization for these devices. (Side note, i should probably republish my instructions on how to build a custom dataset.)
After I build the dataset, I'll start building my views. Let's suppose that the vendor has only provided the CPU utilization as an average of all the CPUs on the device or that the device will only ever have one CPU. The end result is that there is only one poll instance per device for that dataset. This means that I'll only really build views on the device level and configure the views to drill down to the device page instead of the poll instance page. After building the appropriate trends on the device page, I'd go to an overview page and build a table or bar chart to show the devices with the highest CPU utilization. All of this is great and normal and is what most people do when building views for this kind of data.
The problem with stopping here is that there is no way to look at multiple devices over a period of time and see how the devices were performing within that timeframe. The reason for this is that a TopN table or bar chart will display the rollup (usually the average) of the metric within the timeframe. In the case of my custom dataset, I'd see the average CPU utilization over the last hour, last day, last week, etc. This is ok as long as I pick one of the standard timeframes. Notice what happens when you pick last 4 hours in NPC. A table or bar chart will only do last hour. That's because NV hasn't pre-calculated rollups on a 4-hour basis. So, it becomes important to show the performance of the metric over time showing the values within the timeframe, be it a standard rollup period or not.
That's where distribution views can help. While they don't necessary show the value of each one of the poll instances analyzed, they do categorize the metric into groups. For example, I could build a distribution view to group the metrics like: 0-25%, 25-50%, 50-75%, 75-95%, and over 95%. In this case, NPC would look at all the data during the timeframe (if last hour with 5 minute polling, it will look at 12 data points for each poll instance included in the context) and categorize each data point into one of the buckets I've defined. The end result is a trend plot over time showing how many devices are in which buckets for each point in time.
Users need to be instructed in the proper way to interpret the view. If the view is setup properly, the undesirable buckets will have more extreme colors (reds and oranges). When a user sees a time period in which a larger number of devices are in the undesirable buckets, they should understand that a large number of devices has experience higher CPU utilization. If 10 devices' CPU utilization goes from 20% to 60%, the bars before the increase will show 10 devices in the 0-25% bucket while the bars after the increase will show 10 devices in the 50-75% bucket. NPC also calculates the percentage of total devices in each bucket. So, if half of my devices are in the 50-75% range, a mouseover will reveal 50% in that bucket.
This visualization can be equated to creating a pie chart for each poll cycle. If you look at one poll cycle for all the devices and created a pie chart with 5 slices, it would be easy to understand how many devices need attention. Imagine taking the crust off the pie, stretching it out flat and stacking it next to the pie crusts for the other poll cycles in the period.
One disadvantage to the distribution charts is that they lack drill down. So, while a distribution is good for a summary page, a table showing the rollups over the same timeframe will be helpful to identify which devices are experiencing the higher CPU utilization. This table would allow drill down to the device page where the individual trend plot could be analyzed individually. It could also be compared to the rest of the data being gathered by NV for the device.
After I build the dataset, I'll start building my views. Let's suppose that the vendor has only provided the CPU utilization as an average of all the CPUs on the device or that the device will only ever have one CPU. The end result is that there is only one poll instance per device for that dataset. This means that I'll only really build views on the device level and configure the views to drill down to the device page instead of the poll instance page. After building the appropriate trends on the device page, I'd go to an overview page and build a table or bar chart to show the devices with the highest CPU utilization. All of this is great and normal and is what most people do when building views for this kind of data.
The problem with stopping here is that there is no way to look at multiple devices over a period of time and see how the devices were performing within that timeframe. The reason for this is that a TopN table or bar chart will display the rollup (usually the average) of the metric within the timeframe. In the case of my custom dataset, I'd see the average CPU utilization over the last hour, last day, last week, etc. This is ok as long as I pick one of the standard timeframes. Notice what happens when you pick last 4 hours in NPC. A table or bar chart will only do last hour. That's because NV hasn't pre-calculated rollups on a 4-hour basis. So, it becomes important to show the performance of the metric over time showing the values within the timeframe, be it a standard rollup period or not.
That's where distribution views can help. While they don't necessary show the value of each one of the poll instances analyzed, they do categorize the metric into groups. For example, I could build a distribution view to group the metrics like: 0-25%, 25-50%, 50-75%, 75-95%, and over 95%. In this case, NPC would look at all the data during the timeframe (if last hour with 5 minute polling, it will look at 12 data points for each poll instance included in the context) and categorize each data point into one of the buckets I've defined. The end result is a trend plot over time showing how many devices are in which buckets for each point in time.
Users need to be instructed in the proper way to interpret the view. If the view is setup properly, the undesirable buckets will have more extreme colors (reds and oranges). When a user sees a time period in which a larger number of devices are in the undesirable buckets, they should understand that a large number of devices has experience higher CPU utilization. If 10 devices' CPU utilization goes from 20% to 60%, the bars before the increase will show 10 devices in the 0-25% bucket while the bars after the increase will show 10 devices in the 50-75% bucket. NPC also calculates the percentage of total devices in each bucket. So, if half of my devices are in the 50-75% range, a mouseover will reveal 50% in that bucket.
This visualization can be equated to creating a pie chart for each poll cycle. If you look at one poll cycle for all the devices and created a pie chart with 5 slices, it would be easy to understand how many devices need attention. Imagine taking the crust off the pie, stretching it out flat and stacking it next to the pie crusts for the other poll cycles in the period.
One disadvantage to the distribution charts is that they lack drill down. So, while a distribution is good for a summary page, a table showing the rollups over the same timeframe will be helpful to identify which devices are experiencing the higher CPU utilization. This table would allow drill down to the device page where the individual trend plot could be analyzed individually. It could also be compared to the rest of the data being gathered by NV for the device.
Friday, June 7, 2013
NetVoyant Trap Processing
I decided to go through some of the content I've garnered over the years and make a few videos. This one shows how NetVoyant goes through its processing of incoming traps. Enjoy!
Monday, May 20, 2013
Setting a default group for an NPC report page
UPDATE: An additional tool now available: autorefresh.html. This widget uses its own script to handle automatically refreshing a page. It does not call the built in Auto-Refresh capability of NPC. You provide the number of minutes between refreshes in the browser view url: /custom/autorefresh.html?interval=5
UPDATE: An additional tool now available for download lets you set the default timeframe for a page. Do you have a page that you always want to show the last 24 hours of data? Just add this widget.
UPDATE: The newest version is out. I updated the script to fix a problem when going back to NPC when a page using the default page context setter was used. I also have released a modified version that allows you to set the default IP SLA test type on an IP SLA test report page.
UPDATE: An additional tool now available for download lets you set the default timeframe for a page. Do you have a page that you always want to show the last 24 hours of data? Just add this widget.
UPDATE: The newest version is out. I updated the script to fix a problem when going back to NPC when a page using the default page context setter was used. I also have released a modified version that allows you to set the default IP SLA test type on an IP SLA test report page.
Page IPSLA Type Default
Page Group Default
Friday, May 17, 2013
Interface Summary Table Ultimate Tweak
A while back, I did a major customization of the Top Least Interfaces Table in NPC. This is a NetVoyant view that normally shows interface availability and utilization in and out of every interface. There's no reason, however, that that table can't contain many more metrics. That's essentially what I did with this customization.
In order to implement this run the following command on the NPC server:
That should do it. Now the default definition for that view should contain all the advanced metrics shown above. The view also has a new title when in the view list: "Interface Utilization Summary". The way you can know it'st he right one is by hovering over the view in the view list and it should pop up a description with my name in it.
This can also be applied directly to the NV view. You can do it through the wizard, or just run the following command on the NV server:
In order to implement this run the following command on the NPC server:
That should do it. Now the default definition for that view should contain all the advanced metrics shown above. The view also has a new title when in the view list: "Interface Utilization Summary". The way you can know it'st he right one is by hovering over the view in the view list and it should pop up a description with my name in it.
This can also be applied directly to the NV view. You can do it through the wizard, or just run the following command on the NV server:
Wednesday, May 8, 2013
NV Default Tweaks
To go along with my post about the default tweaks that I do to a vanilla SuperAgent (ADA) installation, I decided to go ahead and document my default tweaks for NetVoyant. Note the disclaimer at the bottom of this page. All of these tweaks should be done before the first discovery cycle begins.
- Add discovery scopes by network, not individual IP address. This is a hot topic, but I maintain that using networks is better than individual IP addresses, if only for the sake of administration. If you've configured DNS and discovery properly (see point 5 below) IP address changes won't require any intervention. If you'd rather keep a super tight grip on your stack, go right ahead.
- Enable daily Periodic Discover: just a checkbox under discovery
- Tweak SNMP Timeout: Change the timeout from 5 seconds to 2. If it hasn't responded after 2 seconds, it's not going to respond after 5.
- Enable Reachability Only Monitoring: If you want to monitor devices in scope but not SNMP capable, you can by only using ICMP. Enable this by unchecking the box that says 'Ignore Non-SNMP Devices'. You'll also need to go to Config>>Discovery>>Device Models and check the 'Enabled' checkbox on the 'NonSNMP Devices' model.
- Update Device Naming: This one takes some thinking. If you know you will have DNS entries for all of your devices, the best would be to let NV poll via FQDN (vs. polling by IP address). That way, if your discovery scopes include networks instead of individual IP addresses you won't have to change anything in NV when the IP address of a device changes. Since NV will be polling via FQDN and the new IP address is still in scope, NV won't know any different. Set Default device name to 'DNS Name'. If there isn't one, NV will poll via IP address.
- Give NV more resources: Slide the resource usage slider up to its max. If NV isn't the only thing on the server, do this carefully.
- Disable Undesired Classes: Under Discovery>>Device Classes disable any device classes you won't want to monitor. This is one way you can prevent NV from monitoring everything on your network even though you've added scopes by network. I typically disable Printers and workstations. You will need to keep an eye on any SNMP capable devices that show up in the other group. This means NV doesn't know what class the device belongs to. Right click the device and click change classification. If you need a new class, come to Config>>Discovery>>Device Classes and create it. After you make a classification change, make sure your undesired classes still say "No Device Models Enabled Upon Discovery'.
Tip: when you're reclassifying devices, you can set the icon that gets used by the NV console when displaying the device. This is only for the console, but it can make things easier to troubleshoot. You can either use one of the built in images (found at D:\netqos\netvoyant\classes\redpoint\images) or store your own there (keep it to less than 20x20 pixels) by entering the image name (without the .gif) in the change classification dialog box. - Disable polling of the System Idle Process: If the Host Resources Software Performance (hrswrun) dataset is going to be used, setup a discovery rule called 'Default' with expression:
hrSWRunName <> 'System Idle Process'
It's also a good idea to go ahead and set the poll event severity to none. Otherwise you'll get an alarm every time a process fails to poll. This can be a good thing, since it indicates that a process has gone down. However, if NV is polling a process that is being run by a user, when the user logs off, the process will disappear. In fact, I usually go through and disable poll events for all datasets. This should be done understanding what is lost when not getting poll events. - Disable Host Resource Device Table (hrdevice): Create a discovery rule called 'None' with expression:
1==2
If you've already discovered some/all of your devices, set the poll instance expiration to 0 and enable the 'None' discovery rule. Then run a full rediscovery. After that's done, disable polling and periodic discovery on that dataset. - Disable VMware datasets: You will only get data for these datasets if you own CA Virtual Assurance. If you do, skip this step. If you don't, disable polling and periodic discovery for VMware Datacenter Element (aimdc), VMware Host (aimhost), and VMware Virtual Machine (aimvm).
- Disable NBAR and RMON2: if you have NBAR or RMON2 probes and want to poll them from NV, skip this step. Otherwise, disable polling and periodic discovery for Protocol Distribution (NBAR) (nbarstats) and Protocol Distribution (RMON2) (protodist).
- Disable polling of optical, removable, and floppy drives: Add a discovery rule to the Host Resource Storage (hrstorage) dataset called 'Default' with expression:
hrStorageType NOT IN ('1.3.6.1.2.1.25.2.1.7','1.3.6.1.2.1.25.2.1.5')
If you've already discovered some/all of your devices, set the poll instance expiration to 0 and enable the 'Default' discovery rule. Then run a full rediscovery. After that's done, set the poll instance expiration back to something reasonable like 28. - Disable polling of various interface types: Add a discovery rule called 'Default' with expression:
ifInOctets+ifOutOctets<>0 AND ifType NOT IN (1, 18, 24, 134, 37, 100, 101, 102, 103, 104) AND ifSpeed<>0
If you're curious about which interface types this excludes, look on the Config tab under Discovery>>Interface Types. - Enable Verbosity on the Topology service: Go to Services>>Topology and change the drop down from 'Normal' to 'Normal (Verbose)'. There's no save button. Turn this back to 'Normal' after NV is up and running and stable in production.
- Disable Traps: If NV isn't going to be your trap handler, prevent stray traps from getting logged into the database by going to Services>>Traps and setting start mode to 'Manual'. Then click 'Stop' to stop the service.
- Configure your view options: Under the View menu, make sure everything is enabled.
Tuesday, May 7, 2013
Default SuperAgent Tweaks
Whenever I'm setting up a new SuperAgent system, there are always a few things I go through and do before I start data collection. So, here's my list so I don't have to remember it:
- Add collectors by DNS - I like to add by NetBIOS name then click the IP button. This helps me make sure I get the right IP address. Then, after SA has done its check of the collector, I click the DNS button, which finds the DNS name from the IP address it previously resolved from the NetBIOS name. This double check makes sure I have the right server since the FQDN name should be fairly similar to the NetBIOS name.
- Add a port exclusion - Given the troubles I've had with large deployments and auto-discovery, I've decided to start adding a huge port exclusion from the get go. I add 1025-65535 for the whole domain. When/if I need to monitor an application in that range, I can always add an exception. This can be done via the GUI or through a query:
insert into application_rules
(application_id,exclude_port_begin,exclude_port_end,rule_type)
values (0,1025,65535,0);
New in ADA 9.3! - This option can be enabled on each collector. For standard collector or virtual collector, create a new text file: drive:\CA\bin\saConfigInterfaceRuntimeOptions.ini with the following line:
/force positive config
Then restart the CA ADA Monitor service. - Add actions to the default incident responses - add all the possible actions to the incident responses. If I have the email address of the person/group that will be monitoring the SA for problems, I put an email action in the collection device default incident response.
- Create a 'No Response' network incident response - create this incident response with no actions.
- Adjust network types - I like to have only 4 network types: Internet - VPN, LAN, Not Rated, WAN. I delete all the other network types. Assign the Internet - VPN and Not Rated network types to the 'No Response' incident response created earlier.
- Edit the 'Weekends' maintenance schedule - Change the name to 'Full Time Maintenance' and change the period to all day, every day. If there is a standing maintenance window that affects every server everywhere, add that period to the default.
- Change the data retention settings - Bump everything up to their max. If it becomes a problem later on, I can always tune it down.
- Change the free space alarm - Change this from 5GB to 20GB and put somebody's email address in there.
- Import a networks list - I prefer to use the GXMLG, but at least understand regions if doing it by hand. You can also use the standard private networks list if you have nothing to start with.
- Bump up the default observation numbers
New in ADA 9.3! - You don't have to do this via direct database manipulation any more. Just go to Administration >> Policies >> Performance Thresholds. The middle table allows manipulation of the default threshold settings. You can also setup the default threshold settings for the 'Not Rated' and 'Internet - VPN' network types; set them up for no thresholds on the network and combined thresholds. - Import servers as subnets instead of individual servers - This just makes sense. If possible, try to group servers together into subnets by application. This makes it easier to assign groups of servers to an application. If this isn't possible, enter the entire subnet.
Those are all the tweaks I can think of at the moment. If I think of any others, I'll add them to this list.
SuperAgent Application Thresholds
SuperAgent thresholds are comprised of several different components. The most critical part of the thresholds is the baseline sensitivity. Out of the box, SA thresholds are applied to every application and are set with the sensitivity values dictated by NetQoS. There are actually two types of thresholds that can be applied: sensitivity and milliseconds.
Doing anything in the database directly isn't supported by CA and you may break your stuff. If you do, i'm not responsible and CA will probably have you revert to a db backup before even considering talking to you. So either don't tell them that you did this or make sure you can backup and restore your database as needed. There, you have been warned.
The second type of threshold is a more traditional threshold that looks at the value and determines if it is over a specified value. This threshold is much harder to set since you'd have to track data and understand what values you should set. This type of threshold does have one advantage: baseline creep protection.
Baseline creep is when the baseline increases over time because of a slowly degrading performance. Thresholds tied to that baseline would also slowly increase. This is like boiling a frog. You start out with a live frog in cool water and heat it up gradually. By the time the water is hot enough to kill and boil the frog, it's too late for the frog to jump out.
SuperAgent also takes into consideration the fact that a single observation of a transaction that exceeds a threshold (either sensitivity or millisecond) is nothing to pay attention to. The problems really come into play when many observations are seen exceeding the threshold. The minimum observation count is the number of observations that must exceed the threshold within a 5 minute period before the whole 5 minute period is marked as degraded or excessively degraded. These numbers are quite low out of the box. It is common practice to bump these numbers up (usually by a power of 10) in order to reduce the amount of noise that is reported by SA. More on this later.
When an application is configured, either by a user or by the system, a default set of thresholds is applied. The same settings are used for all applications. This can be a problem with newer SA systems since auto-discovery tends to create many applications. If they are all using the default thresholds, it can result in much noise. This is not because the thresholds are too low. Remember, the default thresholds are tied to the baseline. The real problem is that the default minimum observation numbers are too low. Luckily, these numbers can be changed.
The thresholds and minimum observations can be changed in the GUI through two different places. In the applications list or under policies. The applications list is the better place to be if you want to change more than one application/network type set at a time. In the applications list, multiple applications can be selected (maximum of 100 applications selected at a time) and the thresholds edited for all those applications. This may be handy at least for editing the thresholds of the user created applications.
New in ADA 9.3! - A new option has been added to the GUI that allows the modification of the default threshold for new applications (new system discovered applications and new user defined applications). Go to Administration >> Policies >> Performance Thresholds. The middle table allows modification of the default threshold set. You should also go back to applications that have already been defined and update those thresholds. Once an application is discovered by the system or created by the user, the thresholds are independent of the default set.
When changing the thresholds for the system applications, there are several tactics. The first involves increasing the minimum observation count. This can be done with a fairly simple query that both increases the minimum observation count for all defined applications but also modifies the default application thresholds so that all future applications use the same settings.
--run this query to increase the minimum observation count by a power of 10.
update performance_incident set observations = observations * 10;
You shouldn't have to reload the collectors to get this change to take effect, however if you do experience problems seeing the updated threshold values, reloading the collectors should fix it.
A best practice when configuring SuperAgent is to configure a special network type for all the network definitions in SA whose network performance is not entirely within your control. Alarming on networks like this is ineffective since the resulting alarms are inactionable. I usually create a network type called 'Internet - VPN' to indicate any networks that are entirely or partially out of my domain of control. In other words, I set the network type to 'Internet - VPN' for any client IP address ranges across the internet or on another organization's network. If I were to detect a problem with the network metrics to a user within one of these networks, I wouldn't know if the problem were within my portion of the network or out on the internet. If it were out on the internet, I wouldn't be able to do much about it.
So, first of all, create the 'Internet - VPN' network type and assign all your non-internal IP address ranges to it. This would include VPN IP addresses since a portion of their conversation occurs over the internet.
The next step is optional, since the third step negates its necessity. However, if you don't want to go ahead with the third step, implementing this step will at least prevent you from getting alerts on the network metrics for those networks. All that you need to do is create a new network incident response for the 'Internet - VPN' network type and don't assign any actions to it. This should weed out email notifications from issues detected for networks where you can't help the network performance.
New in ADA 9.3! - A new option has been added to the GUI that negates having to perform step three using direct database manipulation. Instead, go to Administration >> Policies >> Performance Thresholds. Click 'Add Custom by Network Type' in the second table. Pick the 'Internet - VPN' network type. Change the Network and Combined thresholds from 'Use Default' to 'Customize' then change the now enabled drop downs from 'Sensitivity' to 'None'. You'll want to do this for NRTT, RTD, NCST, ERTT, DTT, and TTT.
Step three involves a little database manipulation. Essentially, you will need to add a record to the performance_incident table for every metric/app combo you want to ignore. Since you'll need to ignore NRTT, RTD, NCST, ERTT, DTT, and TTT, you'll need to add 6 rows for every application. Luckily, this isn't too hard. The only downside is that this doesn't set things up for any future applications. You'll have to repeat the process. If you do, the query will fail unless you do a complete undo of everything else first. This first query undoes all the threshold sets for the network type containing the string 'VPN'. Make sure your network type has this string or modify the query below.
-- run this query to remove any thresholds currently tied to that network type
Delete from performance_incident where agg_id = (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%');
Once you've done that, or if this is the first time you're running this, run the following query. Again, make sure your network type has the string 'VPN' in the name. Essentially, this inserts a row ignoring thresholding for the VPN network type (hence the 0's in the query below right after m.metric_type) for every application and for each of the metrics we want to ignore (hence the last set of numbers).
-- run this query to disable network and combined metrics for the network type whose name contains the string: VPN
INSERT INTO performance_incident (app_id, agg_id, metric_type, thres1, thres1_type, thres2, thres2_type, observations)
SELECT a.app_id, (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%'), m.metric_type, 100, 0, 90, 0, 50 as observations
FROM applications as a, metric_types as m where m.metric_type in ( 0 , 1 , 2 , 3 , 4 , 9 );
Doing anything in the database directly isn't supported by CA and you may break your stuff. If you do, i'm not responsible and CA will probably have you revert to a db backup before even considering talking to you. So either don't tell them that you did this or make sure you can backup and restore your database as needed. There, you have been warned.
Sensitivity
Sensitivity is a unit-less scalar number between 0 and 200. This type of threshold looks for deviations from baseline. A higher number (think 'more sensitive') will alert on a slight deviation. A lower number will not alert until the deviation is more extreme. Think of it as that sensitive co-worker who goes to HR for everything. If that person is very sensitive, any little thing will cause them to go to HR. If they were less sensitive, it would take something more extreme for them to march over and report you. Sensitivity baselines are really handy since the actual numbers involved in the threshold change as the baseline changes. This means that if one day of the week is typically different than the other days, the baseline would reflect that. Since the baseline reflects that, so do the thresholds for that day. SuperAgent baselines take into consideration hour of the day and other factors to get very good baselines. The other thing that SuperAgent does with regards to baselines is that it baselines every combination individually. Since every combination has its own baseline, a single set of thresholds that refer only to the baseline can be set across the board. This is how things come out of the box.
Milliseconds
The second type of threshold is a more traditional threshold that looks at the value and determines if it is over a specified value. This threshold is much harder to set since you'd have to track data and understand what values you should set. This type of threshold does have one advantage: baseline creep protection. Baseline creep is when the baseline increases over time because of a slowly degrading performance. Thresholds tied to that baseline would also slowly increase. This is like boiling a frog. You start out with a live frog in cool water and heat it up gradually. By the time the water is hot enough to kill and boil the frog, it's too late for the frog to jump out.
Minimum Observation Count
SuperAgent also takes into consideration the fact that a single observation of a transaction that exceeds a threshold (either sensitivity or millisecond) is nothing to pay attention to. The problems really come into play when many observations are seen exceeding the threshold. The minimum observation count is the number of observations that must exceed the threshold within a 5 minute period before the whole 5 minute period is marked as degraded or excessively degraded. These numbers are quite low out of the box. It is common practice to bump these numbers up (usually by a power of 10) in order to reduce the amount of noise that is reported by SA. More on this later.
Default Application Thresholds
When an application is configured, either by a user or by the system, a default set of thresholds is applied. The same settings are used for all applications. This can be a problem with newer SA systems since auto-discovery tends to create many applications. If they are all using the default thresholds, it can result in much noise. This is not because the thresholds are too low. Remember, the default thresholds are tied to the baseline. The real problem is that the default minimum observation numbers are too low. Luckily, these numbers can be changed.
Changing Thresholds Through the Web GUI
The thresholds and minimum observations can be changed in the GUI through two different places. In the applications list or under policies. The applications list is the better place to be if you want to change more than one application/network type set at a time. In the applications list, multiple applications can be selected (maximum of 100 applications selected at a time) and the thresholds edited for all those applications. This may be handy at least for editing the thresholds of the user created applications.New in ADA 9.3! - A new option has been added to the GUI that allows the modification of the default threshold for new applications (new system discovered applications and new user defined applications). Go to Administration >> Policies >> Performance Thresholds. The middle table allows modification of the default threshold set. You should also go back to applications that have already been defined and update those thresholds. Once an application is discovered by the system or created by the user, the thresholds are independent of the default set.
Changing Thresholds Through a MySQL Query
Setting Thresholds for Internet/VPN Network Type
A best practice when configuring SuperAgent is to configure a special network type for all the network definitions in SA whose network performance is not entirely within your control. Alarming on networks like this is ineffective since the resulting alarms are inactionable. I usually create a network type called 'Internet - VPN' to indicate any networks that are entirely or partially out of my domain of control. In other words, I set the network type to 'Internet - VPN' for any client IP address ranges across the internet or on another organization's network. If I were to detect a problem with the network metrics to a user within one of these networks, I wouldn't know if the problem were within my portion of the network or out on the internet. If it were out on the internet, I wouldn't be able to do much about it.So, first of all, create the 'Internet - VPN' network type and assign all your non-internal IP address ranges to it. This would include VPN IP addresses since a portion of their conversation occurs over the internet.
The next step is optional, since the third step negates its necessity. However, if you don't want to go ahead with the third step, implementing this step will at least prevent you from getting alerts on the network metrics for those networks. All that you need to do is create a new network incident response for the 'Internet - VPN' network type and don't assign any actions to it. This should weed out email notifications from issues detected for networks where you can't help the network performance.
New in ADA 9.3! - A new option has been added to the GUI that negates having to perform step three using direct database manipulation. Instead, go to Administration >> Policies >> Performance Thresholds. Click 'Add Custom by Network Type' in the second table. Pick the 'Internet - VPN' network type. Change the Network and Combined thresholds from 'Use Default' to 'Customize' then change the now enabled drop downs from 'Sensitivity' to 'None'. You'll want to do this for NRTT, RTD, NCST, ERTT, DTT, and TTT.
Tuesday, April 16, 2013
Finding the data source for a particular device in NPC
Recently, we needed to know which data source was contributing to the report data for a particular device in NPC. This was fairly easy to find out given a simple query:
mysql -P 3308 -D netqosportal -e "select a.itemname as Device, v6_ntoa(a.address) as Address, b.consolename as DataSource from dst_device as a, data_sources2 as b where a.sourceid=b.sourceid and itemname like '%devicename%' order by a.itemname;"
Simply replace devicename with the device name and execute this at a command prompt on the NPC server. The result should look something like this:
It would be nice if NPC or the new CAPC had some kind of feature that showed the datasource(s) for a particular object on the device details page.
mysql -P 3308 -D netqosportal -e "select a.itemname as Device, v6_ntoa(a.address) as Address, b.consolename as DataSource from dst_device as a, data_sources2 as b where a.sourceid=b.sourceid and itemname like '%devicename%' order by a.itemname;"
Simply replace devicename with the device name and execute this at a command prompt on the NPC server. The result should look something like this:
+------------------------+-----------------+----------------------+ | Device | Address | DataSource | +------------------------+-----------------+----------------------+ | center | 192.168.100.2 | NetVoyant | | nacogdoches | 192.168.100.3 | NetVoyant | | nacogdoches | 192.168.100.3 | ReporterAnalyzer | | houston | 192.168.100.4 | ReporterAnalyzer | | houston | 192.168.100.4 | NetVoyant | | dallas | 192.168.100.5 | ReporterAnalyzer | | dallas | 192.168.100.5 | NetVoyant | | sanfelipe | 192.168.100.6 | ReporterAnalyzer | | sanfelipe | 192.168.100.6 | NetVoyant | | austin | 192.168.100.7 | ReporterAnalyzer | | austin | 192.168.100.7 | NetVoyant | | elpaso | 192.168.100.8 | NetVoyant | | brownsville | 192.168.100.9 | NetVoyant | | beaumont | 192.168.100.10 | NetVoyant | | lufkin | 192.168.100.11 | NetVoyant | | ftworth | 192.168.100.12 | NetVoyant | | ftworth | 192.168.100.12 | ReporterAnalyzer | | tyler | 192.168.100.13 | ReporterAnalyzer | | tyler | 192.168.100.13 | NetVoyant | | henderson | 192.168.100.14 | NetVoyant | | amarillo | 192.168.100.15 | NetVoyant | | amarillo | 192.168.100.15 | ReporterAnalyzer | | sanantonio | 192.168.100.16 | NetVoyant | | bexar | 192.168.100.17 | NetVoyant | +------------------------+-----------------+----------------------+
It would be nice if NPC or the new CAPC had some kind of feature that showed the datasource(s) for a particular object on the device details page.
Wednesday, March 6, 2013
Page Retirement Methods
Continuing my effort to document the various ways I've used the ODBC connector for the NetQoS products, here's my next query and controls I've built and that I use in production. Today's query comes from a need to notify users that a particular page is being retired. Instead of immediately deleting a page and having to deal with confused users, I used the browser view to display a banner telling the users that they have X days until the page is retired. This could be done manually by creating a different static HTML page display a different date for each page, but I wanted a way to make it easy.
First of all, I didn't want to have to rebuilt the banner for each page that is going to be retired. So, I needed a way that each instance of the banner have a way to store an individual expiration date. I also wanted a way for users who don't want their precious pages deleted to notify me. A simple email would have done, but that would have almost always involved an immediate ping back to the user asking which page. So, the banner needs a way to start an email and automatically put the URL to the page.
The implementation is pretty easy.
Now that you've marked a couple pages to be retired, you can use an ODBC connection to display all those pages. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
First of all, I didn't want to have to rebuilt the banner for each page that is going to be retired. So, I needed a way that each instance of the banner have a way to store an individual expiration date. I also wanted a way for users who don't want their precious pages deleted to notify me. A simple email would have done, but that would have almost always involved an immediate ping back to the user asking which page. So, the banner needs a way to start an email and automatically put the URL to the page.
The implementation is pretty easy.
- First of all, get this first section of code, update 'email@company.com' to your email address on the fourth to last line and save it as pageexpiring.html in your custom content directory. Download a copy of the code here.
- Then put a browser view on the page to be retired and set the URL to point to pageexpiring.html in your custom content directory and save the view. When the page refreshes, you should see a message saying that the page has been scheduled to be retired on an unknown date.
- To add a specific date, edit the browser view and append ?expirationdate=3/31/2013 to the URL (replacing 3/31/2013 with the page retirement date).
- Repeat steps 2 and 3 for any remaining pages to be retired.
Now that you've marked a couple pages to be retired, you can use an ODBC connection to display all those pages. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
Wednesday, January 23, 2013
NetVoyant Duplicates through ODBC
Continuing my effort to document the various ways I've used the ODBC connector for the NetQoS products, here's my next query and controls I've built and that I use in production. Today's query comes from a need to view duplicate devices and make it easy to eliminate the duplicates. I had built a fairly extensive method to do this using Perl script and batch files. It suffered from all the problems that most NPC browser views suffer from. So, ODBC is a better way to accomplish this and I'm officially retiring that script. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
To create the view, run the following SQL commands against the NPC server:
Wednesday, January 16, 2013
Automating NFA Parser Reports
UPDATE: CA Support has endorsed the NAST tool as the replacement for the NFAParser. I haven't tested it, but if it's like the other updated tools it will run faster. The nice thing is that the syntax for running the NAST tool silently is the same as the NFAParser. So, it doesn't take much to update this tool to use the new tool.
A while back I was tasked with making it possible to view NFA Parser output inside NPC. It was actually easier than I thought. I came up with something that isn't as optimal as I would like it (I'll explain why later), but it works for now.
The first thing you have to do is to download the NFA Parser which is part of CA's Support Tools 6 and copy it to each harvester. If you don't want to use all the tools, you can just download the parser and put it on each harvester. The output of the parser is an HTML file which is ready to be published to a web service so you can link to it from NPC. The easiest way to do this is to call the parser with a working directory of C:\inetpub\wwwroot on the harvester. That way the output will be put in that directory, ready to be viewed in a browser. However, every time you run the parser, the output file's name contains a date/time stamp, so that makes it a little difficult to link to. The solution is to wrap it all in a batch file that clears the old output, calls the new output, then renames the new output to some static name. Here's what that batch script would look like:
This could be tweaked a bit to keep the last X files using the following batch script:
This second option moves the existing files up in a queue by renaming them with a higher name, except for the highest one that gets deleted. So, if I created a scheduled task like this: C:\inetpub\wwwroot\nfa.bat 1 5, I would eventually end up with 5 files, each one representing one of the last 5 runs with nfaout1.htm being the most recent, each spanning 1 minute. This second method is the option I'm using in production and it seems to work just fine. In order to easily give access to the files, I create an HTML table with a column for the servers then a column for each of the retained reports. Then I put in a row for each harvester. I put that HTML in my custom content directory and load it into a browser view.
Obviously running the report more frequently and with a longer timespan will increase load on the harvester, so don't turn it on to run for 1 minute every minute.
A while back I was tasked with making it possible to view NFA Parser output inside NPC. It was actually easier than I thought. I came up with something that isn't as optimal as I would like it (I'll explain why later), but it works for now.
The first thing you have to do is to download the NFA Parser which is part of CA's Support Tools 6 and copy it to each harvester. If you don't want to use all the tools, you can just download the parser and put it on each harvester. The output of the parser is an HTML file which is ready to be published to a web service so you can link to it from NPC. The easiest way to do this is to call the parser with a working directory of C:\inetpub\wwwroot on the harvester. That way the output will be put in that directory, ready to be viewed in a browser. However, every time you run the parser, the output file's name contains a date/time stamp, so that makes it a little difficult to link to. The solution is to wrap it all in a batch file that clears the old output, calls the new output, then renames the new output to some static name. Here's what that batch script would look like:
This could be tweaked a bit to keep the last X files using the following batch script:
This second option moves the existing files up in a queue by renaming them with a higher name, except for the highest one that gets deleted. So, if I created a scheduled task like this: C:\inetpub\wwwroot\nfa.bat 1 5, I would eventually end up with 5 files, each one representing one of the last 5 runs with nfaout1.htm being the most recent, each spanning 1 minute. This second method is the option I'm using in production and it seems to work just fine. In order to easily give access to the files, I create an HTML table with a column for the servers then a column for each of the retained reports. Then I put in a row for each harvester. I put that HTML in my custom content directory and load it into a browser view.
Obviously running the report more frequently and with a longer timespan will increase load on the harvester, so don't turn it on to run for 1 minute every minute.
Thursday, January 10, 2013
Giving Existing Users Access to New Data Sources
UPDATE: Turns out I had released a more limited set of commands prevously. Think of this method as version 2. It's more complete and replaces the previous method.
There is a limitation in NPC that is a little annoying. If you have a slew of users created in NPC (either using LDAP integration or just local product authentication) and you add a new data source, only nqadmin and nquser get access to the new data source. By default, all other users don't get any access to the data source. This doesn't mean they don't get access to the data, it just means that they can't log into the web GUI for that data source. The 'proper' action would be to edit every single user and grant them either user, power user, or admin rights to the new data source. With the advent of SSO and LDAP integration, this just won't work (especially if you have more than a couple dozen users). And if you've made it a habit to only use the nqadmin account for root level tasks and you are using your own account (setup as an administrator), you would be able to add the data source but not be able to access it until you edit your own user account and give yourself access. The silly thing is that any new users based on the nqadmin or nquser account would have access to the data source. The problem is with existing users.
This is a difficult nut to crack; There are a couple features that could be built into NPC that would prevent the data source adder from having to touch every single user:
There are two ways that this query can be run: 1) scheduled task or 2) manual batch file after adding a data source. Either way should be sufficient. Option 2 would be more efficient since this only needs to be done after a new data source is added to NPC.
There is a limitation in NPC that is a little annoying. If you have a slew of users created in NPC (either using LDAP integration or just local product authentication) and you add a new data source, only nqadmin and nquser get access to the new data source. By default, all other users don't get any access to the data source. This doesn't mean they don't get access to the data, it just means that they can't log into the web GUI for that data source. The 'proper' action would be to edit every single user and grant them either user, power user, or admin rights to the new data source. With the advent of SSO and LDAP integration, this just won't work (especially if you have more than a couple dozen users). And if you've made it a habit to only use the nqadmin account for root level tasks and you are using your own account (setup as an administrator), you would be able to add the data source but not be able to access it until you edit your own user account and give yourself access. The silly thing is that any new users based on the nqadmin or nquser account would have access to the data source. The problem is with existing users.
This is a difficult nut to crack; There are a couple features that could be built into NPC that would prevent the data source adder from having to touch every single user:
- Give all users no access. This is what happens now.
- Give all users 'user' access to the data source. This means that you would still have to log edit your own account and grant yourself admin access, not to mention any other accounts that need to be administrators on the new data source.
- Give all users except the data source adder 'user' access. This would allow you immediate admin access, but any other admins would still have to be edited manually.
- Give all users the same access to the new data source as they have in NPC. By this I mean that if a user is an administrator in NPC, they become an administrator for the new data source as well. If they're a user in NPC, they become a user in NPC.
- Give all cloned users the same access as the account they were cloned from. So, if a user was cloned from the nquser account (which inherits access to the new data source by default), then that user would also inherit access to the new data source.
- Setup product privilege sets and assign them to users. I'd create three: all admin, all power user, and all user. Then I could go to one place to make the change and all the users would be affected.
There are two ways that this query can be run: 1) scheduled task or 2) manual batch file after adding a data source. Either way should be sufficient. Option 2 would be more efficient since this only needs to be done after a new data source is added to NPC.
Monday, January 7, 2013
RA Router List through ODBC
Continuing my effort to document the various ways I've used the ODBC connector for the NetQoS products, here's my next query and controls I've built and that I use in production. Today's query comes from a need to view the router status for all routers monitored by ReporterAnalyzer (NFA). The goal is to show a list of the routers along with the address from which RA is receiving Netflow, along with which harvester it's assigned to, how many interfaces have been discovered, how many are enabled, when the router last rebooted, when the last refresh was, and when it was last discovered. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
To create the view, run the following SQL commands against the NPC server:
Wednesday, December 12, 2012
SA Collector Feed List through ODBC
Continuing my effort to document the various ways I've used the ODBC connector for the NetQoS products, I'm going to post a couple of the queries and controls I've built and that I use in production. Today's query comes from a need to review all the collector feeds for all the collectors on two different master consoles. Collector feeds are configured for each collection NIC on standard collectors. I haven't run this query on a MC with a MTP connected, but it should result in the same thing.
The query comes from three different tables and results in a list of feeds and options to edit each collector. Unfortunately, the functionality to directly modify the feed properties within the GUI isn't present yet, but something like it should be coming soon. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
The query comes from three different tables and results in a list of feeds and options to edit each collector. Unfortunately, the functionality to directly modify the feed properties within the GUI isn't present yet, but something like it should be coming soon. Here's the SelectCommand and OdbcConnection String to put in the configuration.xml:
To create the view, run the following SQL commands against the NPC server:
Subscribe to:
Posts (Atom)