Friday, March 30, 2012

Understanding SuperAgent Network Regions

I've found that many people don't understand the concept of regions in a network definition in SuperAgent.  Given the power of a region to make defining networks easier and give more granular reports, I'm actually quite surprised that it hasn't been evangelized a bit more.  So, here's my explanation:

SuperAgent organizes data according into buckets.  SA could store the analysis data for every single client IP address in its own bucket in the database, but that's kind of the point of MTP.  Also, having reports that are that granular are only helpful if you already know where the problem is.  In addition, if you think about it, storing the analysis of two client IP addresses in two individual buckets in the database doesn't make sense if those two client IP addresses are connected to the same switch, which is using the same router to get to the WAN which is coming into the same network hardware in the datacenter.  If the two clients are using all the same network hardware, measuring two different network round trip times for those two clients is virtually impossible.  Think about it, the only thing that is different is the client's NIC, which doesn't really affect SA metrics, due to modern technologies like TCP offload engine (TOE) which bring the ACK turn around time on the NIC down to sub-millisecond.

Ok, so there's the reason to summarize networks according to the network path.  If a bunch of IP addresses use the same network path to get back to the servers monitored by SA, there's not much value in storing the analysis on a per-IP basis.

However, for groups of IP addresses that do use different network infrastructure, it is imperative to separate them so that the differentiating network hardware can be isolated and therefore identified and troubleshot (troubleshooted?).

Therefore SA provides the ability to define client networks.  Each client network instructs SA how to group IP address blocks together and treat them as one unit for analysis and storage.  Each network definition should only contain the IP addresses that share all of their network infrastructure.

This is nice because it cuts down on the amount of configuration required in SA.  To illustrate, let me give an example.  A US company has decided that its IP address scheme is to allocate an entire /10 block of IP addresses to each time zone (e.g. 10.0.0.0/10 for Eastern, 10.64.0.0/10 for Central, 10.128.0.0/10 for Mountain, and 10.192.0.0/10 for Pacific).  It then decides to allocate an entire /19 block of IP addresses to each site within that time zone (e.g. 10.30.0.0/19 for NYC, 10.74.32.0/19 for Chicago, & 10.200.128.0/19 for LAX, among others).  This is actually really easy to configure in SA.  The networks would be defined in SA as such:
Network NameNetworkMask
EST10.0.0.010
CST10.64.0.010
MST10.128.0.010
PST10.192.0.010
NYC10.30.0.019
......19
CHI10.74.32.019
......19
LAX10.200.128.019
......19
Each of the time zone IP address blocks should be configured so that any clients in the time zone that don't match a site definition get categorized somewhere.  Any traffic showing up in those networks is an indicator that a site is missing.  On a side note, the time zone networks could be given their own network type and special, tighter thresholds could be applied so that incidents trip immediately for any amount of NRTT.  A special network incident response could be setup to send an email to the SA admin to notify him/her that traffic has been seen on a time zone network (indicating a site network definition that is either missing or incomplete).

While this is great, the network administrators at the US based company decided that a standard of 32 VLANs should be implemented at every site.  Each VLAN should be a /24 subnet and each VLAN has a standard use (floor 1, floor 2, floor 3, printers, servers, wireless, etc.).  With the networks above defined in SA, the network administrator won't be able to differentiate between bad performance on a wireless VLAN and bad performance on a wired VLAN.  At this point the administrator has two options: 1) either he can rebuild all the network definitions defining every single /24 subnet or 2) he can define 32 regions in each of the site network definitions.  The better option is #2.  Here's why:

Defining 32 regions on a /19 network definition in SA is equivalent to defining all 32 /24 sub-subnets within that /19 network.  It's shorthand.  Once defined, the /19 network definition will have a plus sign (+) next to it.  When clicked, the admin can see that SA actually has 32 networks defined within that /19.  The nice thing is that they are all grouped together according to site (/19 network).

One disadvantage is that the name originally assigned to the /19 network is the same one originally assigned to all the sub-subnets (regions).  This however can be overcome by expanding a /19 (hitting the plus sign) and renaming the VLANs as necessary.  Each region can be named individually.  The way to get around this is to use option 1 and create a CSV containing all the /24 networks each with a site name prefix and a VLAN designator (name and/or VLAN number).

Thursday, March 29, 2012

Understanding SA Discovery and Pruning/Grooming

First of all, a little conceptual history around SuperAgent:
SuperAgent was meant to automate the task of analyzing packet captures for essential metrics indicating server or network latency.  An engineer wanted a better way to do it than manually and SA was born.  Since its inception, it has grown by leaps and bounds increasing its capabilities.  Despite the growth, one major concept has remained: SA is meant to automate a manual process for your top applications.  This is not a scalability issue.  It's something fundamental to the through process behind every revision of the product.  SA is meant to analyze the transactions of applications of interest to determine where latency lies.

With the most recent version, SA added a feature that automatically discovers and configures applications.  This opened up a whole new area of SA since admins didn't have to automatically configure the applications they were interested in.  All they had to do was identify the servers that might be involved and SA did the rest.  Expectations began to rise since admins could now easily increase the bounds of what was considered an 'application of interest'.

In order to prevent performance problems that might arise in very complex environments, the developers imposed a limit on the discovery process.  When the discovery process has discovered and configured 1000 servers or 1000 applications (whichever comes first) a pruning process will begin.  This algorithm reevaluates the active combinations every 5 minutes to determine which 1000 servers and which 1000 applications will remain in the configuration.  This doesn't affect any applications configured by the administrator and shouldn't affect the largest, most active applications.  Administrators have to understand that this is by design and that the applications configured in SA don't necessarily represent all the applications hosted by a server.

Luckily, the server and application limits can be raised with a simple query in the database.  To view the current limits execute the following query:
select * from parameter_descriptions where parameter like 'maxNumAuto%';
Updating those values will change the limits.  Remember, those limits were put into place to prevent performance problems.  Also SA hasn't been tested by CA's QA department with any limit other than 1000, so if you run into any problems after changing those limits, you'll get push back from support because of it.  This is one of the things included in the CIG, which is basically required for every case, so support will know that you did it.  

I have increased the limits by 500 in some cases, just to push the envelope a little.  I didn't experience severe, immediate problems.  If you need much more than that, consider more infrastructure (read more SA master consoles).

Wednesday, March 21, 2012

Disabling SuperAgent Relationship Groups

When an application is configured in SuperAgent, the application actually consists of two parts: the application configuration item itself and the servers assigned to the application.  In order to grant permissions to specific applications, administrators would need to create a group in NPC containing both the application configuration item and the servers.  The problem with this is that the group that would be created would be static; any time the application configuration changed in SuperAgent, the group would have to be updated.

The answer to this problem is a pair of group sets created by SuperAgent that contain these items.  One set contains a group for every application in SuperAgent containing the application CI and the servers assigned to it.  The other set contains a group for every server containing the server and any application CIs the server belongs to.

This worked well in the past when applications were manually configured.  This only resulted in twice as many groups as the number of applications/servers that the admin was willing to configure.  However, with the advent of automatic application discovery and configuration in SuperAgent, it is possible that the number of these groups could skyrocket.  This high number of groups can degrade the performance of NPC sync.  As a result, there is a need to disable them.  The only bad part about disabling them is that you no longer can take advantage of the dynamic groups for permissions purposes.

In order to disable these groups, you have to go to SuperAgent and execute a couple of queries.  Before executing any queries you find on the internet, you should backup your database.  There.  You have been warned.
REPLACE INTO parameter_descriptions(Parameter, Level, Type, DefaultValue, Description) VALUES ('SyncRelationshipGroups',   'ProductSync', 'boolean', 'false', 'Sends App/Svr relationship groups to the performance center'), ('SyncRelationshipsEnabled', 'ProductSync', 'boolean', 'true',  'Sends App/Svr relationship combinations to the performance center');
UPDATE parameter_descriptions SET DefaultValue = '0' WHERE Parameter IN ('pullLastFullSyncTime', 'pushLastFullSyncTime', 'pullLastIncrSyncTime', 'pushLastIncrSyncTime');
UPDATE parameter_descriptions SET DefaultValue = 'true' WHERE Parameter = 'pullForceFullSync';
After running these queries, kick off a full resync of NPC.  It may take some time for all those groups to get deleted from NPC.  I don't suggest doing this on multiple SA MCs at once.  Do them one at a time and let NPC sync with one before moving on to the next.

Friday, March 16, 2012

How to Build or Modify Groups in NPC by using XML

In a previous post, I discussed how to build application reporting groups using dynamic system groups.  While this strategy is the recommended way of building application reporting groups, it can become tedious to actually copy and paste all the system groups into your application group.  Especially if you have the most recent version of SuperAgent and it has discovered a ton of applications and networks.  Luckily, there is an unpublished easier way: XML.  NPC has the ability to use XML to modify the group structure.  This includes the ability to put referential copies of system groups into custom application reporting groups.

The following is an excerpt from some documentation I wrote about how to use the web service to modify various portions of the NetQoS systems:

Group management in NPC is performed through the AdminCommand web service (found at PortalWebService/AdminCommandWS.asmx?WSDL ).  Adding or removing groups and adding or removing members from groups is accomplished by invoking the UpdateGroups operation (found at PortalWebService/AdminCommandWS.asmx?op=UpdateGroups).  Three main parameters are required: (1)useIDs, (2)allowDeletes, and (3) XML defining the groups (the groups to be added/removed/updated).  Groups themselves are defined explicitly in the XML while membership of groups is managed by a set of rules applied to each group.  Managed objects that match the rules are included as members of the group, objects that don’t match are excluded.  
The first two parameters are global and don’t change often.  The parameter useIDs instructs the operation whether or not to use the group ID values from the XML definition when identifying the group to be updated.  The useIDs parameter will need to be set to ‘true’ to ensure the correct groups are updated.  The only time a value of false would be used is when a single XML file is being used to import a group structure from one NPC system to another.