In recent moths, we (“we” as in Ning) have started to open up some of our code to the community at large. There is quite a bit of useful stuff in there (23 public repositories and counting), compliments of powerhouses like brianm, davidsklar or tomdz (to name a few). We have also started sharing code from the Operations side of the house, in hopes that it is useful to other Operations shops out in the ether.
Our first entry in this regard, at pierre‘s suggestion, is a Nagios plugin for Tableau servers, check_tableau_systeminfo, which is currently a little rough around the edges but quite usable. There is a new version right around the corner with some polish applied to said edges, and we are preparing a host of other tools for release that we currently use every day in our production environment.
Our production environment is currently comprised of 2000+ nodes, which makes it a relatively large environment that provides a fair amount of interesting operational problems to solve. And problems solvers is something we are actively looking for, i.e., we are hiring! Ning is the largest platform in the world for creating custom social networks, currently hosting 70,000 paid subscribers (up from 15,000 before we transitioned from the prior “freemium” model), and serving over 80 million unique visitors monthly. This makes us one of the top 100 sites in the US, and, according to CNBC.com, one cool company to work for (indeed!).
I found out recently that Nagios was forked into Icinga. It looks interesting, and the new web interface is heading over to sexyland fast. I will take it out for spin soon and see how it handles our current configuration (which relies heavily on object inheritance). The team at Icinga has already built a fair number of improvements for Nagios proper. It may be the fastest path to nirvana to a more usable Nagios install for shops heavily invested in Nagios.
Although I am not a full-time developer, I do deal with a fair amount of information that benefits from the joys of versioning (including software I write). A lot of that work today happens in an IDE (Xcode), but a lot of it happens in other contexts, so having a pretty (and above all, useful) tool to navigate a repository is quite useful. I have been using Versions for the Mac since it was beta, and it has proven to be a worthy helper. Recently, I ran into Cornerstone, and I decided to try it out of curiosity. Both are solid apps, and either one will service your SVN needs nicely. Hopefully Git will get one in the near future.
Go watch: Reconnoiter: a whirlwind tour. Any piece of software with a tagline like another product build from pain deserves your attention:
defaults write com.apple.NetworkBrowser EnableODiskBrowsing -bool true defaults write com.apple.NetworkBrowser ODSSupported -bool true
Context: we have an older MacBook with what appears to be a busted DVD-ROM drive, preventing a clean install of the newly arrived cat. After enabling CD/DVD Sharing on the “server”, nothing would show up on the “client”. After tweaking the above properties and restarting the Finder, it’s all good now. Install in progress.
These days, among the many projects I’m shuffling, monitoring is taking some rather significant brain cycles. In any operational environment, monitoring is the closest link between humans and machines, primarily because monitoring is one of the main channels machines use most often to talk to humans (terminal sessions notwithstanding).
In prior lives, Nagios has been one of the primary fault monitoring systems in use. It has a relatively long history as the monitoring workhorse in a large number of production environments, and with reason: high quality, open source, and free has meant easy adoption, especially in cash-conscious startups. It compiles on essentially all contemporary Unix-based platforms, and there is a significant knowledge base on and off the Intertron. It gets the job done. Add some good design to its configuration files and its management tools, and it’s top-notch. Minus, of course, its archaic web-based GUI. Details, details, details. Hyperic HQ is one of the monitoring tools I have been involved with most recently. Installation is relatively painless on both on the server and the client (and admittedly faster than that of getting Nagios up and running), it is pretty, sporting a modern, somewhat configurable interface, and offering features like autodiscovery out of the box.
A notable section in the Hyperic documentation is related to Nagios (Nagios Data), and the Hyperic HQ Tour also offers some comparative notes (PDF, page 5: “Hyperic HQ Compared to Nagios”). This is no coincidence: Nagios boasts a large installed base, and that makes it a tasty target to aim for (a seach for “Zenoss”, for instance, yields no results). The comparison is, by all accounts, fair. There is nothing misleading or dishonest in it, and Hyperic HQ is indeed superior to Nagios in many respects (particularly in terms of presentation). There are some points, however, where I am not quite synched up with the Hyperic story, but these are generally small (but by no means insignificant) details. For instance, I’m quite happy to have configuration data stored in plain text files when its complexity does not merit a RDBMS.
The one significant sticky point (which is not mentioned in the tour but that is prominently highlighted in the documentation) is the apparent lack of any passive-like monitoring in Hyperic (no NSCA or NCSA-like support). Personally, I am a big fan of passive monitoring: components perform self-monitoring duties and report back to central command as necessary. I have used it quite successfully to monitor scripts that run periodically and are unpokeable (think all those wonderful tools you’ve written that ship logs back and forth), NetApp filers (by proxy), and almost any kind of custom software. Nagios documentation does not do true justice to passive monitoring: it is mostly described as a way to build distributed monitoring or reach components behind firewalls. And yet, it is an incredibly elegant methodology to monitor components and to do so from within the components themselves asynchronously; by definition, it is highly distributed, making each component (or proxies on their behalf) responsible for its own health, and yielding Nagios a great alert, escalation and notification manager. Add powerful inheritance and the resulting monitoring setup is scalable, efficient, and straightforward to automate. It is still (no point in hiding it) ugly, but hard to beat.
While Hyperic HQ has a home for performance monitoring, my money is still on Nagios for fault monitoring. I will be learning a lot more about Hyperic HQ over the next few weeks, so I will likely have more details to report in future posts (I am seriously exploring what it would take to feed passive test results to Hyperic so they can be available in the Hyperic dashboard, for instance). Stay tuned.
Today I picked up a cheap Compaq Presario (yep, C-o-m-p-a-q) with an Athlon X2, 4GB of RAM and a 320GB SATA drive for under $300. I have some VMs and other cruft to run and I rather not do it on the Mac Pro anymore. This will do fine, especially as a playground of sorts. I wanted to put OpenSolaris 2009.06 on it, which took less than 20 minutes, including the creation of a ZFS root mirror. Another 10 minutes to flip into a Dom0 and I’m ready to go.
Very sweet indeed.