In recent moths, we (“we” as in Ning) have started to open up some of our code to the community at large. There is quite a bit of useful stuff in there (23 public repositories and counting), compliments of powerhouses like brianm, davidsklar or tomdz (to name a few). We have also started sharing code from the Operations side of the house, in hopes that it is useful to other Operations shops out in the ether.
Our first entry in this regard, at pierre‘s suggestion, is a Nagios plugin for Tableau servers, check_tableau_systeminfo, which is currently a little rough around the edges but quite usable. There is a new version right around the corner with some polish applied to said edges, and we are preparing a host of other tools for release that we currently use every day in our production environment.
Our production environment is currently comprised of 2000+ nodes, which makes it a relatively large environment that provides a fair amount of interesting operational problems to solve. And problems solvers is something we are actively looking for, i.e., we are hiring! Ning is the largest platform in the world for creating custom social networks, currently hosting 70,000 paid subscribers (up from 15,000 before we transitioned from the prior “freemium” model), and serving over 80 million unique visitors monthly. This makes us one of the top 100 sites in the US, and, according to CNBC.com, one cool company to work for (indeed!).
I found out recently that Nagios was forked into Icinga. It looks interesting, and the new web interface is heading over to sexyland fast. I will take it out for spin soon and see how it handles our current configuration (which relies heavily on object inheritance). The team at Icinga has already built a fair number of improvements for Nagios proper. It may be the fastest path to nirvana to a more usable Nagios install for shops heavily invested in Nagios.
Although I am not a full-time developer, I do deal with a fair amount of information that benefits from the joys of versioning (including software I write). A lot of that work today happens in an IDE (Xcode), but a lot of it happens in other contexts, so having a pretty (and above all, useful) tool to navigate a repository is quite useful. I have been using Versions for the Mac since it was beta, and it has proven to be a worthy helper. Recently, I ran into Cornerstone, and I decided to try it out of curiosity. Both are solid apps, and either one will service your SVN needs nicely. Hopefully Git will get one in the near future.
Go watch: Reconnoiter: a whirlwind tour. Any piece of software with a tagline like another product build from pain deserves your attention:
defaults write com.apple.NetworkBrowser EnableODiskBrowsing -bool true defaults write com.apple.NetworkBrowser ODSSupported -bool true
Context: we have an older MacBook with what appears to be a busted DVD-ROM drive, preventing a clean install of the newly arrived cat. After enabling CD/DVD Sharing on the “server”, nothing would show up on the “client”. After tweaking the above properties and restarting the Finder, it’s all good now. Install in progress.
These days, among the many projects I’m shuffling, monitoring is taking some rather significant brain cycles. In any operational environment, monitoring is the closest link between humans and machines, primarily because monitoring is one of the main channels machines use most often to talk to humans (terminal sessions notwithstanding).
In prior lives, Nagios has been one of the primary fault monitoring systems in use. It has a relatively long history as the monitoring workhorse in a large number of production environments, and with reason: high quality, open source, and free has meant easy adoption, especially in cash-conscious startups. It compiles on essentially all contemporary Unix-based platforms, and there is a significant knowledge base on and off the Intertron. It gets the job done. Add some good design to its configuration files and its management tools, and it’s top-notch. Minus, of course, its archaic web-based GUI. Details, details, details. Hyperic HQ is one of the monitoring tools I have been involved with most recently. Installation is relatively painless on both on the server and the client (and admittedly faster than that of getting Nagios up and running), it is pretty, sporting a modern, somewhat configurable interface, and offering features like autodiscovery out of the box.
A notable section in the Hyperic documentation is related to Nagios (Nagios Data), and the Hyperic HQ Tour also offers some comparative notes (PDF, page 5: “Hyperic HQ Compared to Nagios”). This is no coincidence: Nagios boasts a large installed base, and that makes it a tasty target to aim for (a seach for “Zenoss”, for instance, yields no results). The comparison is, by all accounts, fair. There is nothing misleading or dishonest in it, and Hyperic HQ is indeed superior to Nagios in many respects (particularly in terms of presentation). There are some points, however, where I am not quite synched up with the Hyperic story, but these are generally small (but by no means insignificant) details. For instance, I’m quite happy to have configuration data stored in plain text files when its complexity does not merit a RDBMS.
The one significant sticky point (which is not mentioned in the tour but that is prominently highlighted in the documentation) is the apparent lack of any passive-like monitoring in Hyperic (no NSCA or NCSA-like support). Personally, I am a big fan of passive monitoring: components perform self-monitoring duties and report back to central command as necessary. I have used it quite successfully to monitor scripts that run periodically and are unpokeable (think all those wonderful tools you’ve written that ship logs back and forth), NetApp filers (by proxy), and almost any kind of custom software. Nagios documentation does not do true justice to passive monitoring: it is mostly described as a way to build distributed monitoring or reach components behind firewalls. And yet, it is an incredibly elegant methodology to monitor components and to do so from within the components themselves asynchronously; by definition, it is highly distributed, making each component (or proxies on their behalf) responsible for its own health, and yielding Nagios a great alert, escalation and notification manager. Add powerful inheritance and the resulting monitoring setup is scalable, efficient, and straightforward to automate. It is still (no point in hiding it) ugly, but hard to beat.
While Hyperic HQ has a home for performance monitoring, my money is still on Nagios for fault monitoring. I will be learning a lot more about Hyperic HQ over the next few weeks, so I will likely have more details to report in future posts (I am seriously exploring what it would take to feed passive test results to Hyperic so they can be available in the Hyperic dashboard, for instance). Stay tuned.
Today I picked up a cheap Compaq Presario (yep, C-o-m-p-a-q) with an Athlon X2, 4GB of RAM and a 320GB SATA drive for under $300. I have some VMs and other cruft to run and I rather not do it on the Mac Pro anymore. This will do fine, especially as a playground of sorts. I wanted to put OpenSolaris 2009.06 on it, which took less than 20 minutes, including the creation of a ZFS root mirror. Another 10 minutes to flip into a Dom0 and I’m ready to go.
Very sweet indeed.
In what seems like a long time ago, when I started running systems, there was a premium placed on both knowing and understanding what happens under the covers, at least on a conceptual level. Is that no longer the case? Or does it just feel that way sometimes? Maybe it has always been like that (not likely) and I simply had not noticed for whatever reason. Or perhaps I was (this much is true, and definitely so today) lucky enough to work alongside rather bright individuals who both knew and understood, and thus it seemed the natural state of things.
While I don’t expect people to be able to slice and dice schedulers on a whim (even though I have met a couple who probably can), I do expect them to have a basic understanding of such matters if they’re supposed to be running systems. Soft links and hard links. Forking and threading. Concepts of that nature. It’s certainly good whenever someone is able to install Linux or Solaris, but given the quality of today’s installers, it’s far from impressive. I ran into this a short while ago while having a conversation with an individual with a fairly long sysadmin resume (and some neteng experience for good measure). At some point during the conversation, we started talking about how we would implement poor man’s remote snapshots. There are, I suspect, commercial solutions available to do just this sort of thing, but if you happen to be working in a startup where cash conservation is a mantra, cheap is good. To see if anyone has already invented that pretty wheel you need, go to Google and search for snapshot and rsync; quite a few hits are returned, and at least one of them is widely considered to be the source of most of the implementations available. The trivial part of this problem is solving for efficient network utilization (rsync essentially gives you that for free). The slightly trickier bit comes from optimizing disk usage, and that’s where hard links come in handy (and understanding them is critical in order to even get this off the ground). Even trickier is optimizing for run times once the data sets are large enough (or, in particular, contain large numbers of files). Circular rings anyone?
But the details aren’t really relevant. What is relevant is that the definition of basic sysadmin skills (and by basic I’m referring to things beyond adding accounts and listing processes running on a box) seems to have been relaxed quite a bit. This may be due to the environment we lived in during the go-go years of the net, where breathing and being able to type a few shell commands on a keyboard almost certainly guaranteed a job, or to the ease of use of the newer Linux distros (no more fudging around with X Window configuration files), which make it easy to claim success in getting a Linux system up and running.
At times, this is a bit frustrating. Can I get that half an hour back, please? I remember a time when I could bring up Uresh Vahalia’s Unix Internals and get a good conversation started. That’s not so common anymore. In that conversation I mentioned earlier, the name barely (didn’t, really) rang a bell. I suppose it’s a new reality and I have to get on with it. Nevertheless, curiosity strikes, so I’m going to close this post with the title of the book I’m going to pick up in the next few minutes for some late night reading: The Design and Implementation of the FreeBSD Operating System (I still have the 4.4 version; sysadmin or not, I do know what nostalgia feels like).