Tuesday, May 7, 2013

SuperAgent Application Thresholds

SuperAgent thresholds are comprised of several different components.  The most critical part of the thresholds is the baseline sensitivity.  Out of the box, SA thresholds are applied to every application and are set with the sensitivity values dictated by NetQoS.  There are actually two types of thresholds that can be applied: sensitivity and milliseconds.

Doing anything in the database directly isn't supported by CA and you may break your stuff.  If you do, i'm not responsible and CA will probably have you revert to a db backup before even considering talking to you.  So either don't tell them that you did this or make sure you can backup and restore your database as needed.  There, you have been warned.

Sensitivity 

Sensitivity is a unit-less scalar number between 0 and 200. This type of threshold looks for deviations from baseline. A higher number (think 'more sensitive') will alert on a slight deviation. A lower number will not alert until the deviation is more extreme. Think of it as that sensitive co-worker who goes to HR for everything. If that person is very sensitive, any little thing will cause them to go to HR. If they were less sensitive, it would take something more extreme for them to march over and report you. Sensitivity baselines are really handy since the actual numbers involved in the threshold change as the baseline changes.  This means that if one day of the week is typically different than the other days, the baseline would reflect that.  Since the baseline reflects that, so do the thresholds for that day.  SuperAgent baselines take into consideration hour of the day and other factors to get very good baselines.  The other thing that SuperAgent does with regards to baselines is that it baselines every combination individually.  Since every combination has its own baseline, a single set of thresholds that refer only to the baseline can be set across the board.  This is how things come out of the box.


Milliseconds

The second type of threshold is a more traditional threshold that looks at the value and determines if it is over a specified value.  This threshold is much harder to set since you'd have to track data and understand what values you should set.  This type of threshold does have one advantage: baseline creep protection.

Baseline creep is when the baseline increases over time because of a slowly degrading performance.  Thresholds tied to that baseline would also slowly increase.  This is like boiling a frog.  You start out with a live frog in cool water and heat it up gradually.  By the time the water is hot enough to kill and boil the frog, it's too late for the frog to jump out.


Minimum Observation Count 

SuperAgent also takes into consideration the fact that a single observation of a transaction that exceeds a threshold (either sensitivity or millisecond) is nothing to pay attention to.  The problems really come into play when many observations are seen exceeding the threshold.  The minimum observation count is the number of observations that must exceed the threshold within a 5 minute period before the whole 5 minute period is marked as degraded or excessively degraded.  These numbers are quite low out of the box.  It is common practice to bump these numbers up (usually by a power of 10) in order to reduce the amount of noise that is reported by SA.  More on this later.


Default Application Thresholds 

When an application is configured, either by a user or by the system, a default set of thresholds is applied.  The same settings are used for all applications.  This can be a problem with newer SA systems since auto-discovery tends to create many applications.  If they are all using the default thresholds, it can result in much noise.  This is not because the thresholds are too low.  Remember, the default thresholds are tied to the baseline.  The real problem is that the default minimum observation numbers are too low.  Luckily, these numbers can be changed.


Changing Thresholds Through the Web GUI 

The thresholds and minimum observations can be changed in the GUI through two different places.  In the applications list or under policies.  The applications list is the better place to be if you want to change more than one application/network type set at a time.  In the applications list, multiple applications can be selected (maximum of 100 applications selected at a time) and the thresholds edited for all those applications.  This may be handy at least for editing the thresholds of the user created applications.

New in ADA 9.3! - A new option has been added to the GUI that allows the modification of the default threshold for new applications (new system discovered applications and new user defined applications).  Go to Administration >> Policies >> Performance Thresholds.  The middle table allows modification of the default threshold set.  You should also go back to applications that have already been defined and update those thresholds.  Once an application is discovered by the system or created by the user, the thresholds are independent of the default set.

Changing Thresholds Through a MySQL Query 

When changing the thresholds for the system applications, there are several tactics.  The first involves increasing the minimum observation count.  This can be done with a fairly simple query that both increases the minimum observation count for all defined applications but also modifies the default application thresholds so that all future applications use the same settings.

--run this query to increase the minimum observation count by a power of 10.
update performance_incident set observations = observations * 10;

You shouldn't have to reload the collectors to get this change to take effect, however if you do experience problems seeing the updated threshold values, reloading the collectors should fix it.


Setting Thresholds for Internet/VPN Network Type 

A best practice when configuring SuperAgent is to configure a special network type for all the network definitions in SA whose network performance is not entirely within your control.  Alarming on networks like this is ineffective since the resulting alarms are inactionable.  I usually create a network type called 'Internet - VPN' to indicate any networks that are entirely or partially out of my domain of control.  In other words, I set the network type to 'Internet - VPN' for any client IP address ranges across the internet or on another organization's network.  If I were to detect a problem with the network metrics to a user within one of these networks, I wouldn't know if the problem were within my portion of the network or out on the internet.  If it were out on the internet, I wouldn't be able to do much about it.

So, first of all, create the 'Internet - VPN' network type and assign all your non-internal IP address ranges to it.  This would include VPN IP addresses since a portion of their conversation occurs over the internet.

The next step is optional, since the third step negates its necessity.  However, if you don't want to go ahead with the third step, implementing this step will at least prevent you from getting alerts on the network metrics for those networks.  All that you need to do is create a new network incident response for the 'Internet - VPN' network type and don't assign any actions to it.  This should weed out email notifications from issues detected for networks where you can't help the network performance.

New in ADA 9.3! - A new option has been added to the GUI that negates having to perform step three using direct database manipulation.  Instead, go to Administration >> Policies >> Performance Thresholds.  Click 'Add Custom by Network Type' in the second table.  Pick the 'Internet - VPN' network type.  Change the Network and Combined thresholds from 'Use Default' to 'Customize' then change the now enabled drop downs from 'Sensitivity' to 'None'.  You'll want to do this for NRTT, RTD, NCST, ERTT, DTT, and TTT.

Step three involves a little database manipulation.  Essentially, you will need to add a record to the performance_incident table for every metric/app combo you want to ignore.  Since you'll need to ignore NRTT, RTD, NCST, ERTT, DTT, and TTT, you'll need to add 6 rows for every application.  Luckily, this isn't too hard.  The only downside is that this doesn't set things up for any future applications.  You'll have to repeat the process.  If you do, the query will fail unless you do a complete undo of everything else first.  This first query undoes all the threshold sets for the network type containing the string 'VPN'.  Make sure your network type has this string or modify the query below.

-- run this query to remove any thresholds currently tied to that network type
Delete from performance_incident where agg_id = (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%');

Once you've done that, or if this is the first time you're running this, run the following query.  Again, make sure your network type has the string 'VPN' in the name.  Essentially, this inserts a row ignoring thresholding for the VPN network type (hence the 0's in the query below right after m.metric_type) for every application and for each of the metrics we want to ignore (hence the last set of numbers).

-- run this query to disable network and combined metrics for the network type whose name contains the string: VPN
INSERT INTO performance_incident (app_id, agg_id, metric_type, thres1, thres1_type, thres2, thres2_type, observations)
SELECT a.app_id, (select max(agg_id) from aggregates where agg_type=1 and agg_name like '%VPN%'), m.metric_type, 100, 0, 90, 0, 50 as observations
FROM applications as a, metric_types as m where m.metric_type in ( 0 , 1 , 2 , 3 , 4 , 9 );