h1

Hyperic is not (yet?) a Full Nagios Replacement

October 7, 2009

These days, among the many projects I’m shuffling, monitoring is taking some rather significant brain cycles. In any operational environment, monitoring is the closest link between humans and machines, primarily because monitoring is one of the main channels machines use most often to talk to humans (terminal sessions notwithstanding).

In prior lives, Nagios has been one of the primary fault monitoring systems in use. It has a relatively long history as the monitoring workhorse in a large number of production environments, and with reason: high quality, open source, and free has meant easy adoption, especially in cash-conscious startups. It compiles on essentially all contemporary Unix-based platforms, and there is a significant knowledge base on and off the Intertron. It gets the job done. Add some good design to its configuration files and its management tools, and it’s top-notch. Minus, of course, its archaic web-based GUI. Details, details, details. Hyperic HQ is one of the monitoring tools I have been involved with most recently. Installation is relatively painless on both on the server and the client (and admittedly faster than that of getting Nagios up and running), it is pretty, sporting a modern, somewhat configurable interface, and offering features like autodiscovery out of the box.

A notable section in the Hyperic documentation is related to Nagios (Nagios Data), and the Hyperic HQ Tour also offers some comparative notes (PDF, page 5: “Hyperic HQ Compared to Nagios”). This is no coincidence: Nagios boasts a large installed base, and that makes it a tasty target to aim for (a seach for “Zenoss”, for instance, yields no results). The comparison is, by all accounts, fair. There is nothing misleading or dishonest in it, and Hyperic HQ is indeed superior to Nagios in many respects (particularly in terms of presentation). There are some points, however, where I am not quite synched up with the Hyperic story, but these are generally small (but by no means insignificant) details. For instance, I’m quite happy to have configuration data stored in plain text files when its complexity does not merit a RDBMS.

The one significant sticky point (which is not mentioned in the tour but that is prominently highlighted in the documentation) is the apparent lack of any passive-like monitoring in Hyperic (no NSCA or NCSA-like support). Personally, I am a big fan of passive monitoring: components perform self-monitoring duties and report back to central command as necessary. I have used it quite successfully to monitor scripts that run periodically and are unpokeable (think all those wonderful tools you’ve written that ship logs back and forth), NetApp filers (by proxy), and almost any kind of custom software. Nagios documentation does not do true justice to passive monitoring: it is mostly described as a way to build distributed monitoring or reach components behind firewalls. And yet, it is an incredibly elegant methodology to monitor components and to do so from within the components themselves asynchronously; by definition, it is highly distributed, making each component (or proxies on their behalf) responsible for its own health, and yielding Nagios a great alert, escalation and notification manager. Add powerful inheritance and the resulting monitoring setup is scalable, efficient, and straightforward to automate. It is still (no point in hiding it) ugly, but hard to beat.

While Hyperic HQ has a home for performance monitoring, my money is still on Nagios for fault monitoring. I will be learning a lot more about Hyperic HQ over the next few weeks, so I will likely have more details to report in future posts (I am seriously exploring what it would take to feed passive test results to Hyperic so they can be available in the Hyperic dashboard, for instance). Stay tuned.


7 comments

  1. I notice you mentioned Zenoss Core as an option for monitoring but that you prefer Nagios. I am curious given that Zenoss has a lot of additional capabilities like automatic discovery of devices, a little more sophisticated alerting system, syslog monitoring and additional features why you didn’t choose Zenoss for fault monitoring. Plus Zenoss combines fault monitoring and performance monitoring. Also Zenoss can execute Nagios plugins if you have favorites. I also noticed in an earlier post that you do some Python programming which I would think would be a reason for you to consider using Zenoss which is a python application.

    I would love to get your thoughts.


  2. Hi Mark,

    Technically, I never claimed to prefer Nagios over Zenoss. I mentioned Zenoss as an example in the context that there is no mention of it in the Hyperic documentation, whereas Nagios has a section devoted to it especifically geared towards helping in migrating from Nagios to Hyperic (in fact, they have tools to crawl Nagios configuration files and import them into Hyperic). This leads to me to speculate that Hyperic sees current Nagios installations as a good source of potential Hyperic users, whereas Zenoss (for reasons known only to Hyperic) does not seem to be as appetizing. I suspect this has to do with raw numbers and probabilities (i.e., quantity), but it’s all conjecture.

    My last direct (and admittedly brief, since I had a local resource far more familiar with Zenoss that I was) contact with Zenoss was a couple of years ago. I mention this because much has probably changed since then, and I am therefore in no position to make any authoritative comparisons in regards to Zenoss. I will say however, that, at the time, we found Zenoss to be very SNMP-centric (in an environment where we had very little of it) and easy to bog down under load with active checks (the system population was fairly large). Having Nagios-based infrastructure that was performing well provided no incentive to migrate to Zenoss, particularly as brittle as it seemed to us.

    My key point is that, as much as possible, I am in favor of passive test methodology for fault monitoring because from my point of view, it is inherently highy distributed (given components are keeping track of their own health), takes advantage of levels of introspection not readily available with active checks (i.e., external poking mechanisms), and it’s asynchronous (faults are reported as they happen, not during the next check). This is not to say that everything should be passively monitored: it depends upon the semantics of the test itself, and, in many cases, it is impossible or impractical to do so. Regardless, Hyperic doesn’t seem to do it: the agent model is really an extension of the active check model, where checks still poke some other entity and run on a schedule (but they do so on the client hosts). Zenoss, I believe, doesn’t offer this capility either.

    Whether the monitoring software is Python or C it almost irrelevant because I want (and have in many cases) put the onus of keeping track of health on the application itself. Thus, I have simple scripts or system software that is concerned about its own health, and my monitoring platform only has to be concerned about receiving status and acting upon it.

    Clearly, none of this is really relevant to performance monitoring (i.e., monitoring that isn’t really about faults but about maintaning historical data about system load and other metrics along those lines). Zenoss did well in that regard, as does Hyperic.

    I have been working on a more abstract post about my views on these ideas, but it is a complex subject, and one that, admittedly, is probably somewhat subjective. At the end of the day, I want simplicity, and there is much to be said for having a solid, well-designed template configuration in Nagios that lends itself to simple generating of host configurations from a CMDB that inherit the right properties from the right places. There is no denying that this is possibly doable with both Hyperic and Zenoss as well, but I’m certainly not qualified (yet) to ascertain how difficult it might be (so far, my Hyperic experience has left at lot to be desired, but it takes time to hit the eureka moment when automating becomes an extension of your environment).


  3. Passive monitoring is possible via something like a Script Plugin to monitor the output of some activity. The script would be executed on a regular frequency (1 min, 5 min, 24 hours, etc.) and the normal state would be “all good”, showing green in the HQ resource and Dashboard. And then if an exception occurs, state would become “all bad”, and the resource/dashboard would go red. And since HQ does a lot more than just up/down metrics, your script could also be tracking many other metrics of interest. Basically anything you can return in a name=value pair.

    We’ve used this same approach in many, many a Nagios->Hyperic migration.

    It’s not a push model, but it looks like it from the outside.


    • Passive monitoring is possible via something like a Script Plugin to monitor the output of some activity. The script would be executed on a regular frequency (1 min, 5 min, 24 hours, etc)[...]

      In the context of the post, passive fault monitoring is defined by two key characteristics:

      • it happens asynchronously
      • it is intrinsic to the component being monitored

      There are, of course, exceptions. “Keep-alives” (regular status OK reports) must be somewhat synchronous so that if something goes wrong and no OK reports take place, the monitoring server can flag it as a problem. Furthermore, there are components (such as appliances) for which it is simply impossible to have [usable] monitoring built-in (some SNMP implementations are rather broken), in which case, a proxy is necessary (asynchronicity is usually lost as well in this case). Finally, there are cases where what we want to monitor for is external behavior (for instance, wher multiple components are involved in a transaction), for which active monitoring is the right tool.

      What is, I think, a significant gain in the passive model is that I can save myself the step of watching for, parsing and interpreting output by having the monitoring intelligence built right in, taking place in real-time, and with a richer environment to drive conclussions as to the state of the process and the reasons behind said state, because the monitoring is taking place in the runtime environment. My preference, particularly for home-grown software, is to push as much of the monitoring smarts for a given component to the component itself (and, by extension, to the engineers writing said component) because it is the One True Source. None of this is to say that passive monitoring is the One and Only True Way.

      In the realm of operations we write a fair number of tools to aid in running the environment. A lot of these tools perform their work in the background, for instance, moving logs back and forth or pushing changes to systems. If I control those tools, I will definitely make them use passive fault monitoring, and only pester me when necessary. No digestion of output needed. For engineering, where they write large amounts of code to make up the product or service being delivered, passive monitoring is likely the right answer. Take some application server performing database queries to build web pages being consumed by an end user. The best place to know that the running average database query time has slowed by 20% in the last five minutes is the application server itself. Again, no output to grok, and better yet, no polling schedules where swings in the metrics can return to normal.

      It is, of course, possible to write these stats somewhere and have the Script Plugin read them. But that’s an extra step I find no need to take. This, of course, requires bringing a lot of moving parts into the same wavelength. A solid definition for monitoring thresholds, procedures, responses and specifications must be in place, well documented and publicized to the appropriate parties. But having this done, contracts can be enacted so that those that are writing code know how they can raise awareness about something going amiss. These definitions are mosrly tool agnostic, but having them allows for levels of automation that are impossible otherwise.

      I still believe that the summary of the post is true: Hyperic is not yet a full Nagios replacement as far as passive monitoring is concerned. I don’t simply want it to look like a push model from the outside (fakeable by increasing frequency): I want it to be a push model because I find said model (and I want to stress that I am referring to fault monitoring) to be more accurate, resource friendly and easier to manage.


      • I think what you are saying is similar to an SNMP trap, which Zenoss and Hyperic appear are all capable of handling (so is Nagios, but of course Nagios also does NSCA [http://nagios.sourceforge.net/docs/3_0/addons.html]).

        For instance, you don’t want to poll every N minutes to see if a disk has failed, rather you have an SNMP trap setup to alert you when the disk has failed.


      • Hi Nick,

        Yep, pretty much. Indeed, when I talk about passive checks in Nagios, I am referring to NSCA. The two key differences are that SNMP is complex to understand out of the box (and requires a lot of wiring to generate MIBs and such) and that you also get “I’m OK” check points (so you know that reporting is taking place).


  4. Gerirgaudi,

    Thanks for the clarification, I get your points now.



Leave a Comment