Monday, May 20, 2013

Setting a default group for an NPC report page

UPDATE: An additional tool now available: autorefresh.html.  This widget uses its own script to handle automatically refreshing a page.  It does not call the built in Auto-Refresh capability of NPC.  You provide the number of minutes between refreshes in the browser view url: /custom/autorefresh.html?interval=5


UPDATE: An additional tool now available for download lets you set the default timeframe for a page. Do you have a page that you always want to show the last 24 hours of data? Just add this widget.


UPDATE: The newest version is out.  I updated the script to fix a problem when going back to NPC when a page using the default page context setter was used.  I also have released a modified version that allows you to set the default IP SLA test type on an IP SLA test report page.

Page IPSLA Type Default


Page Group Default


Friday, May 17, 2013

Interface Summary Table Ultimate Tweak

A while back, I did a major customization of the Top Least Interfaces Table in NPC.  This is a NetVoyant view that normally shows interface availability and utilization in and out of every interface.  There's no reason, however, that that table can't contain many more metrics.  That's essentially what I did with this customization.

In order to implement this run the following command on the NPC server:


That should do it. Now the default definition for that view should contain all the advanced metrics shown above. The view also has a new title when in the view list: "Interface Utilization Summary". The way you can know it'st he right one is by hovering over the view in the view list and it should pop up a description with my name in it.

This can also be applied directly to the NV view.  You can do it through the wizard, or just run the following command on the NV server:

Wednesday, May 8, 2013

NV Default Tweaks

To go along with my post about the default tweaks that I do to a vanilla SuperAgent (ADA) installation, I decided to go ahead and document my default tweaks for NetVoyant.  Note the disclaimer at the bottom of this page.  All of these tweaks should be done before the first discovery cycle begins.
  1. Add discovery scopes by network, not individual IP address.  This is a hot topic, but I maintain that using networks is better than individual IP addresses, if only for the sake of administration.  If you've configured DNS and discovery properly (see point 5 below) IP address changes won't require any intervention.  If you'd rather keep a super tight grip on your stack, go right ahead.
  2. Enable daily Periodic Discover: just a checkbox under discovery
  3. Tweak SNMP Timeout: Change the timeout from 5 seconds to 2.  If it hasn't responded after 2 seconds, it's not going to respond after 5.
  4. Enable Reachability Only Monitoring: If you want to monitor devices in scope but not SNMP capable, you can by only using ICMP.  Enable this by unchecking the box that says 'Ignore Non-SNMP Devices'.  You'll also need to go to Config>>Discovery>>Device Models and check the 'Enabled' checkbox on the 'NonSNMP Devices' model.
  5. Update Device Naming: This one takes some thinking.  If you know you will have DNS entries for all of your devices, the best would be to let NV poll via FQDN (vs. polling by IP address).  That way, if your discovery scopes include networks instead of individual IP addresses you won't have to change anything in NV when the IP address of a device changes.  Since NV will be polling via FQDN and the new IP address is still in scope, NV won't know any different.  Set Default device name to 'DNS Name'.  If there isn't one, NV will poll via IP address.
  6. Give NV more resources: Slide the resource usage slider up to its max.  If NV isn't the only thing on the server, do this carefully.
  7. Disable Undesired Classes: Under Discovery>>Device Classes disable any device classes you won't want to monitor.  This is one way you can prevent NV from monitoring everything on your network even though you've added scopes by network.  I typically disable Printers and workstations.  You will need to keep an eye on any SNMP capable devices that show up in the other group.  This means NV doesn't know what class the device belongs to.  Right click the device and click change classification.  If you need a new class, come to Config>>Discovery>>Device Classes and create it.  After you make a classification change, make sure your undesired classes still say "No Device Models Enabled Upon Discovery'.
    Tip: when you're reclassifying devices, you can set the icon that gets used by the NV console when displaying the device.  This is only for the console, but it can make things easier to troubleshoot.  You can either use one of the built in images (found at D:\netqos\netvoyant\classes\redpoint\images) or store your own there (keep it to less than 20x20 pixels) by entering the image name (without the .gif) in the change classification dialog box.
  8. Disable polling of the System Idle Process: If the Host Resources Software Performance (hrswrun) dataset is going to be used, setup a discovery rule called 'Default' with expression:
    hrSWRunName <> 'System Idle Process'
    It's also a good idea to go ahead and set the poll event severity to none.  Otherwise you'll get an alarm every time a process fails to poll.  This can be a good thing, since it indicates that a process has gone down.  However, if NV is polling a process that is being run by a user, when the user logs off, the process will disappear.  In fact, I usually go through and disable poll events for all datasets.  This should be done understanding what is lost when not getting poll events.
  9. Disable Host Resource Device Table (hrdevice): Create a discovery rule called 'None' with expression:
    1==2
    If you've already discovered some/all of your devices, set the poll instance expiration to 0 and enable the 'None' discovery rule.  Then run a full rediscovery.  After that's done, disable polling and periodic discovery on that dataset.
  10. Disable VMware datasets: You will only get data for these datasets if you own CA Virtual Assurance.  If you do, skip this step.  If you don't, disable polling and periodic discovery for VMware Datacenter Element (aimdc), VMware Host (aimhost), and VMware Virtual Machine (aimvm).
  11. Disable NBAR and RMON2: if you have NBAR or RMON2 probes and want to poll them from NV, skip this step.  Otherwise, disable polling and periodic discovery for Protocol Distribution (NBAR) (nbarstats) and Protocol Distribution (RMON2) (protodist).
  12. Disable polling of optical, removable, and floppy drives: Add a discovery rule to the Host Resource Storage (hrstorage) dataset called 'Default' with expression:
    hrStorageType NOT IN ('1.3.6.1.2.1.25.2.1.7','1.3.6.1.2.1.25.2.1.5')
    If you've already discovered some/all of your devices, set the poll instance expiration to 0 and enable the 'Default' discovery rule.  Then run a full rediscovery.  After that's done, set the poll instance expiration back to something reasonable like 28.
  13. Disable polling of various interface types: Add a discovery rule called 'Default' with expression:
    ifInOctets+ifOutOctets<>0 AND ifType NOT IN (1, 18, 24, 134, 37, 100, 101, 102, 103, 104) AND ifSpeed<>0
    If you're curious about which interface types this excludes, look on the Config tab under Discovery>>Interface Types.
  14. Enable Verbosity on the Topology service: Go to Services>>Topology and change the drop down from 'Normal' to 'Normal (Verbose)'.  There's no save button.  Turn this back to 'Normal' after NV is up and running and stable in production.
  15. Disable Traps: If NV isn't going to be your trap handler, prevent stray traps from getting logged into the database by going to Services>>Traps and setting start mode to 'Manual'.  Then click 'Stop' to stop the service.
  16. Configure your view options: Under the View menu, make sure everything is enabled.
That's it for now.  Make sure the discovery monitor is open and kick off discovery.  That should get you started.

Here's a picture:

Tuesday, May 7, 2013

Default SuperAgent Tweaks

Whenever I'm setting up a new SuperAgent system, there are always a few things I go through and do before I start data collection.  So, here's my list so I don't have to remember it:

  1. Add collectors by DNS - I like to add by NetBIOS name then click the IP button.  This helps me make sure I get the right IP address.  Then, after SA has done its check of the collector, I click the DNS button, which finds the DNS name from the IP address it previously resolved from the NetBIOS name.  This  double check makes sure I have the right server since the FQDN name should be fairly similar to the NetBIOS name.
  2. Add a port exclusion - Given the troubles I've had with large deployments and auto-discovery, I've decided to start adding a huge port exclusion from the get go.  I add 1025-65535 for the whole domain.  When/if I need to monitor an application in that range, I can always add an exception.  This can be done via the GUI or through a query:
    insert into application_rules
    (application_id,exclude_port_begin,exclude_port_end,rule_type)
    values (0,1025,65535,0);

    New in ADA 9.3! - This option can be enabled on each collector.  For standard collector or virtual collector, create a new text file: drive:\CA\bin\saConfigInterfaceRuntimeOptions.ini with the following line:
    /force positive config
    Then restart the CA ADA Monitor service.
  3. Add actions to the default incident responses - add all the possible actions to the incident responses. If I have the email address of the person/group that will be monitoring the SA for problems, I put an email action in the collection device default incident response.
  4. Create a 'No Response' network incident response - create this incident response with no actions.
  5. Adjust network types - I like to have only 4 network types: Internet - VPN, LAN, Not Rated, WAN.  I delete all the other network types.  Assign the Internet - VPN and Not Rated network types to the 'No Response' incident response created earlier.
  6. Edit the 'Weekends' maintenance schedule - Change the name to 'Full Time Maintenance' and change the period to all day, every day.  If there is a standing maintenance window that affects every server everywhere, add that period to the default.
  7. Change the data retention settings - Bump everything up to their max.  If it becomes a problem later on, I can always tune it down.
  8. Change the free space alarm - Change this from 5GB to 20GB and put somebody's email address in there.
  9. Import a networks list - I prefer to use the GXMLG, but at least understand regions if doing it by hand.  You can also use the standard private networks list if you have nothing to start with.
  10. Bump up the default observation numbers
    New in ADA 9.3! - You don't have to do this via direct database manipulation any more.  Just go to Administration >> Policies >> Performance Thresholds.  The middle table allows manipulation of the default threshold settings.  You can also setup the default threshold settings for the 'Not Rated' and 'Internet - VPN' network types; set them up for no thresholds on the network and combined thresholds.
  11. Import servers as subnets instead of individual servers - This just makes sense.  If possible, try to group servers together into subnets by application.  This makes it easier to assign groups of servers to an application.  If this isn't possible, enter the entire subnet.  
Those are all the tweaks I can think of at the moment.  If I think of any others, I'll add them to this list.

SuperAgent Application Thresholds

SuperAgent thresholds are comprised of several different components.  The most critical part of the thresholds is the baseline sensitivity.  Out of the box, SA thresholds are applied to every application and are set with the sensitivity values dictated by NetQoS.  There are actually two types of thresholds that can be applied: sensitivity and milliseconds.

Doing anything in the database directly isn't supported by CA and you may break your stuff.  If you do, i'm not responsible and CA will probably have you revert to a db backup before even considering talking to you.  So either don't tell them that you did this or make sure you can backup and restore your database as needed.  There, you have been warned.

Sensitivity 

Sensitivity is a unit-less scalar number between 0 and 200. This type of threshold looks for deviations from baseline. A higher number (think 'more sensitive') will alert on a slight deviation. A lower number will not alert until the deviation is more extreme. Think of it as that sensitive co-worker who goes to HR for everything. If that person is very sensitive, any little thing will cause them to go to HR. If they were less sensitive, it would take something more extreme for them to march over and report you. Sensitivity baselines are really handy since the actual numbers involved in the threshold change as the baseline changes.  This means that if one day of the week is typically different than the other days, the baseline would reflect that.  Since the baseline reflects that, so do the thresholds for that day.  SuperAgent baselines take into consideration hour of the day and other factors to get very good baselines.  The other thing that SuperAgent does with regards to baselines is that it baselines every combination individually.  Since every combination has its own baseline, a single set of thresholds that refer only to the baseline can be set across the board.  This is how things come out of the box.


Milliseconds

The second type of threshold is a more traditional threshold that looks at the value and determines if it is over a specified value.  This threshold is much harder to set since you'd have to track data and understand what values you should set.  This type of threshold does have one advantage: baseline creep protection.

Baseline creep is when the baseline increases over time because of a slowly degrading performance.  Thresholds tied to that baseline would also slowly increase.  This is like boiling a frog.  You start out with a live frog in cool water and heat it up gradually.  By the time the water is hot enough to kill and boil the frog, it's too late for the frog to jump out.


Minimum Observation Count 

SuperAgent also takes into consideration the fact that a single observation of a transaction that exceeds a threshold (either sensitivity or millisecond) is nothing to pay attention to.  The problems really come into play when many observations are seen exceeding the threshold.  The minimum observation count is the number of observations that must exceed the threshold within a 5 minute period before the whole 5 minute period is marked as degraded or excessively degraded.  These numbers are quite low out of the box.  It is common practice to bump these numbers up (usually by a power of 10) in order to reduce the amount of noise that is reported by SA.  More on this later.


Default Application Thresholds 

When an application is configured, either by a user or by the system, a default set of thresholds is applied.  The same settings are used for all applications.  This can be a problem with newer SA systems since auto-discovery tends to create many applications.  If they are all using the default thresholds, it can result in much noise.  This is not because the thresholds are too low.  Remember, the default thresholds are tied to the baseline.  The real problem is that the default minimum observation numbers are too low.  Luckily, these numbers can be changed.


Changing Thresholds Through the Web GUI 

The thresholds and minimum observations can be changed in the GUI through two different places.  In the applications list or under policies.  The applications list is the better place to be if you want to change more than one application/network type set at a time.  In the applications list, multiple applications can be selected (maximum of 100 applications selected at a time) and the thresholds edited for all those applications.  This may be handy at least for editing the thresholds of the user created applications.

New in ADA 9.3! - A new option has been added to the GUI that allows the modification of the default threshold for new applications (new system discovered applications and new user defined applications).  Go to Administration >> Policies >> Performance Thresholds.  The middle table allows modification of the default threshold set.  You should also go back to applications that have already been defined and update those thresholds.  Once an application is discovered by the system or created by the user, the thresholds are independent of the default set.

Changing Thresholds Through a MySQL Query 

When changing the thresholds for the system applications, there are several tactics.  The first involves increasing the minimum observation count.  This can be done with a fairly simple query that both increases the minimum observation count for all defined applications but also modifies the default application thresholds so that all future applications use the same settings.

--run this query to increase the minimum observation count by a power of 10.
update performance_incident set observations = observations * 10;

You shouldn't have to reload the collectors to get this change to take effect, however if you do experience problems seeing the updated threshold values, reloading the collectors should fix it.


Setting Thresholds for Internet/VPN Network Type 

A best practice when configuring SuperAgent is to configure a special network type for all the network definitions in SA whose network performance is not entirely within your control.  Alarming on networks like this is ineffective since the resulting alarms are inactionable.  I usually create a network type called 'Internet - VPN' to indicate any networks that are entirely or partially out of my domain of control.  In other words, I set the network type to 'Internet - VPN' for any client IP address ranges across the internet or on another organization's network.  If I were to detect a problem with the network metrics to a user within one of these networks, I wouldn't know if the problem were within my portion of the network or out on the internet.  If it were out on the internet, I wouldn't be able to do much about it.

So, first of all, create the 'Internet - VPN' network type and assign all your non-internal IP address ranges to it.  This would include VPN IP addresses since a portion of their conversation occurs over the internet.

The next step is optional, since the third step negates its necessity.  However, if you don't want to go ahead with the third step, implementing this step will at least prevent you from getting alerts on the network metrics for those networks.  All that you need to do is create a new network incident response for the 'Internet - VPN' network type and don't assign any actions to it.  This should weed out email notifications from issues detected for networks where you can't help the network performance.

New in ADA 9.3! - A new option has been added to the GUI that negates having to perform step three using direct database manipulation.  Instead, go to Administration >> Policies >> Performance Thresholds.  Click 'Add Custom by Network Type' in the second table.  Pick the 'Internet - VPN' network type.  Change the Network and Combined thresholds from 'Use Default' to 'Customize' then change the now enabled drop downs from 'Sensitivity' to 'None'.  You'll want to do this for NRTT, RTD, NCST, ERTT, DTT, and TTT.

Step three involves a little database manipulation.  Essentially, you will need to add a record to the performance_incident table for every metric/app combo you want to ignore.  Since you'll need to ignore NRTT, RTD, NCST, ERTT, DTT, and TTT, you'll need to add 6 rows for every application.  Luckily, this isn't too hard.  The only downside is that this doesn't set things up for any future applications.  You'll have to repeat the process.  If you do, the query will fail unless you do a complete undo of everything else first.  This first query undoes all the threshold sets for the network type containing the string 'VPN'.  Make sure your network type has this string or modify the query below.

-- run this query to remove any thresholds currently tied to that network type
Delete from performance_incident where agg_id = (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%');

Once you've done that, or if this is the first time you're running this, run the following query.  Again, make sure your network type has the string 'VPN' in the name.  Essentially, this inserts a row ignoring thresholding for the VPN network type (hence the 0's in the query below right after m.metric_type) for every application and for each of the metrics we want to ignore (hence the last set of numbers).

-- run this query to disable network and combined metrics for the network type whose name contains the string: VPN
INSERT INTO performance_incident (app_id, agg_id, metric_type, thres1, thres1_type, thres2, thres2_type, observations)
SELECT a.app_id, (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%'), m.metric_type, 100, 0, 90, 0, 50 as observations
FROM applications as a, metric_types as m where m.metric_type in ( 0 , 1 , 2 , 3 , 4 , 9 );