Friday, May 27, 2011

SNMP Polling vs. Traps

SNMP has been around for decades.  Many manufacturers build SNMP agents into their products so NMS nodes can monitor their status.  There are two ways SNMP can be used to monitor a device: 1) active regular polling by the NMS to the device and 2) traps sent by the device to the NMS.  Unfortunately, many people seem to only know about one or the other method.  I'm approached regularly with requests to monitor a particular set of devices via SNMP.  I ask what metrics they'd like to monitor (approaching from method #1) and they usually respond with the MIB and say, "We want to monitor everything."  After a simple discussion about what they expect should happen, their requests usually come down to wanting NV to be able to receive any and all traps defined in the MIB.  Oh the humanity.

First of all, any SNMP receiver can receive any trap from any device, usually with minimal configuration (having to do with firewalls and ACLs).  Whether it can do something with that trap is something else entirely.  NV, out of the box, will receive traps from any device that has UDP connectivity with NV.  Using the MIBs already compiled, NV will try to interpret the trap to understand what the trap means.  If you want better interpretation of the trap, compile in a MIB that describes the trap.

However, traps are not the best way to go.  Active polling is better for a couple reasons.  Let me draw an analogy: traps are how a typical college kid would communicate with his parents, only calling them when he needs something.  While this does count as communication, the parents aren't very well informed of the kid's progress in school.  Active polling is like a good son who is on a call with his parents every Sunday afternoon, filling them in on his progress on a regular basis.  Obviously, the good son is maintaining better communication with his parents and will be more likely to get the help he needs when or before he needs it.
Active polling has several advantages: 1) since active polling metrics can be stored, historical analysis can be performed, 2) given the historical analysis, thresholds can be set just above 'normal' values to determine when a problem may be forming, before the problem becomes apparent, 3) active polling usually involves some sort of discovery process, which can help identify devices that come online that might not have been configured with the proper SNMP target for traps, and 4) with NV in particular, configuring notifications and alarms based on active polling is much easier and straightforward than on traps.

Traps do have the advantage that if something is wrong that isn't being monitored, the device can still send out a trap indicating that.

9 comments:

  1. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much. Best Rat Traps 2018

    ReplyDelete
  2. Hi, I wanted to know whether we can POLL some attribute defined as TRAP in MIB File.

    ReplyDelete
  3. Sorry for the delay. Let me clarify whether or not you can poll something defined as a trap. Technically, you cannot poll the OID of the trap. That OID references the trap definition and doesn't actually correspond to any actual value in the MIB on the device. I answered yes earlier because I assumed this meaning of the question: "Can you poll the values that a trap reports on?" To use correct terminology, the question would be restated thus: "When I receive a trap, it usually contains a value. Can I poll that value?" The answer to that question is yes. Here's why: Each trap definition may or may not have bindings. Meaning that when the trap is sent out, the trap itself has an OID. This OID identifies the trap itself. However, traps are just an encapsulation and data transit mechanism. Traps can come in one of two flavors: empty and non-empty (this is my terminology). Empty traps indicate a problem in and of themselves, just by their presence. Non-empty traps indicate that something is going on. To help you, the trap receiver, know what the trap is indicating, the trap can contain one or more bindings. These bindings are key-value pairs, where the key is some OID in the MIB that contains data and the value is the data itself. In this way, the trap can say something like "There is a problem with your CPU" and the bindings could contain one OID that tells which CPU has the problem, another OID to indicate the severity of the problem, and yet another OID to give you some value giving you a measure of the problem. The trap OID itself can't be polled. However, the bindings should be pollable.

    ReplyDelete
  4. Polling assumes that you know what you are looking for. Akin to the old "Can you hear me now?" Verizon commercials, polling also requires response - there is more overhead on edge devices. Traps report on exception; therefore, there is less overhead.

    If you poll for something that can possibly exceed a threshold between a polling interval, your monitor will miss the event. For example, third degree burns occur at 185 degrees Fahrenheit in just two seconds of exposure. A poll for surface skin temperature set every three seconds could miss a "hot coffee spill" event. Also, the polled chart history would not show the extent of the hi-temp event. On the other hand, a trap would report on exception. It would not miss the cause of the proverbial burn...

    So, depending on what you are monitoring and why, polls are not always the best way to go.

    ReplyDelete
  5. That's fairly true for gauge based OIDs. However, for counter based OIDs, you can catch deviations like that. As long as you check back in at some point, you will see the increase in the counter value. Dividing the delta from the current poll result and the previous poll result by the duration between polls gives you an average change rate. The smaller the polling interval, the more granular the data.

    ReplyDelete
    Replies
    1. Gauge-based OIDs correlate well with the analogy about weekly college status.

      "How are your grades?"
      "My grades are fine."

      If the device, the college student in this case, could be instructed, trained, or configured to report on exception to grades, acme/car/money, or other [problem], the parent wouldn't have to poll and graph every status object.

      Cold-call to parents: "My grades may suffer because [my car just broke down] [I just broke my leg playing football]."

      It is possible to graph and average counters between polling intervals. But in the case of an electrical fuse, network link saturation, or other condition, we would want it to operate or take action (trip/trap) not over the mean of an arbitrary polling interval, but when a condition is actually met.

      My point is that depending on the device and purpose of monitoring, traps can be useful. I agree with most all of what you are writing here but feel that the trap's usefulness requires a little defense: polling for specifics is not _always_ the better way to go.

      Delete
    2. Hey, this is good discussion. Keep it coming.

      I think you're right about traps being able to notify you more quickly in the case of long polling intervals. Arguably, for short polling intervals, the difference between knowing about the problem up to 59 seconds (for 1 minute polling) before the problem would be discovered by active polling is not worth the effort. Active polling still has the advantage of being able to set your own thresholds instead of depending on the manufacturer's threshold, plotting the values on a graph for innate human visual analysis, computation analysis, etc. Also, perhaps it wasn't clear, but I'm comparing traps with bindings to active polling of the same OIDs in the trap bindings. In either case, you're getting the same data. With traps you only get the data when the manufacturer thinks there is a problem. With active polling, you get the data on whatever schedule you want and you can apply smart analytics to the data, like anomaly detection. This can result in threshold violations that are much more meaningful and dynamic that can be provided by traps.

      That said, if you can do both, you probably should do it. Be prepared though that you might be setting yourself up for some noise.

      Yes, gauge based OIDs correlate with the grades because the previous value of the grade doesn't matter as long as the grade is good now. However, if the student were driving a new car his parents gave him, they might want to keep an eye on things; there are two ways the student can communicate things back to his parents:
      1) Trap method - don't talk to parents for the first 9 weeks of the semester. Then on week 10: "Hey mom and dad, I've got an odometer reading of 10001. This is higher than Honda thinks it should be, so I'm calling you."
      2) Active polling - Week 1: "Hey mom and dad, I've got an odometer reading of 1000." Week 2: "Hey mom and dad, I've got an odometer reading of 2000.", etc.

      If I were this kid's dad, I would have looked into things on week 2, if not week 1 because I could see the trend and that we'd have a problem very soon. In my head, I have smarter analytics and trend forecasting than Honda built into their car.

      Also, think of speeding in that brand new car. If the student does speed, a trap would obviously catch it immediately, if the speed passed the manufacturer's threshold. However, active polling would also catch it, since it could derive speed from a counter based OID like the odometer. If I'm actively polling the odometer every 5 minutes, I would still see a jump in the derived rate (odometer delta / poll interval). That jump would be less pronounced the longer my polling interval, but it would still be there and definitely visible with a polling rate of 5 minutes or less. I could apply smarter thresholds (anomaly detection) to pick up on these changes.

      In the end, it comes down to context: traps don't usually give enough context because of their instantaneous, point-in-time nature. Active polling gives much more context. Combine the two and you have context with point-in-time events.

      Delete