My employer is in the same situation at the moment. We have been
using a combination of a very customised nagios, cacti, and perl
syslog parsing scripts for years, and are currently evaluating
various free and commercial offerings.
We would like to to have a single package that can monitor and graph
the figures it gets back from the various pollers or checks without
duplication of snmp gets/walks on every network device, and
something that can handle snmptraps and other arbitrary events in
some intelligent way. For example, one of the more annoying things
about nagios is that you can't send it an alarm for something unless
you've first defined that alarm. In other words, I can't receive a
critical *-1-* message from a cisco device and pass it on to Nagios
intact - I have to at best create a generic "critical cisco event"
alarm, and submit it there, which can be problematic if I then
receive another similar alarm from a different device while the
first is already acknowledged. I could create hundreds of passive
critical cisco event checks, one for each device, and do it that
way, but then what if get more than one critical event for the same
device. I also get very annoyed by the flap detection, which
results in us getting a critical (hard) alarm for a device, and then
never seeing the OK message because flap detection quietly
suppresses it. That might possibly be a result of the way we've
customised it though - I'm not sure.
However, I would very much like to hear more on this thread about
what people are using, and have found to work. Even the commercial
packages seem to have serious limitations on what they can do, and
run aground when <unknown but critical device that can only be
queried via expect scripts> is added to the mix and expected to
be monitored and graphed.
On 30/08/2011 10:18 p.m., Jonathan Brewer wrote:
Hi Folks,
If you had it all to do over again, what would you use for network
monitoring: Nagios, OpenNMS, or something else entirely?
I care about availaility, latency, loss, jitter, and trap handling
for interface up/down, loss of power, etc. Sensible behavior in
situations where parent routers/links are flapping is also
important.
I would very much appreciate input from folks monitoring 1000+
network elements.
Cheers,
Jon