Archive

Archive for the ‘ops’ Category

Zettabee and Theia

October 21, 2011 Leave a comment

It’s hard to believe it has almost a year since we started the process of open sourcing tools, but it has indeed been that long, and it picked up steam a few weeks ago, when pushed out nddtune, which is admittedly a very simple tool. Today we’re continuing that effort with a couple of more significant tools: Zettabee and Theia.

A Little History

About four years ago, we had a very real need to have fairly detailed performance metrics for NetApp filers. At the time, the available solutions relied on SNMP (NetApp’s SNMP support has historically been weak) or were NetApp’s own, which, asides from expensive, were hard to integrate with the rest of our monitoring infrastructure (which is comprised of Nagios and Zenoss). As such, we set out to write a tool that would both perform detailed filer monitoring (for faults and performance) and that would be able to interface with those systems. Theia was born.

In more recent times, as we were looking at beefing up our DR strategy, we found ourselves needing a good ZFS-based replication tool, and set out to write Zettabee, which gave us an opportunity to dive deeper into ZFS capabilities.

Let the Games Begin

Today we’re very excited to be releasing those two tools into the open. Theia has been in production for the last four years, dutifully keeping an eye on our filers, while Zettabee has been pushing bits long-distance for well over nine months. We are working on putting together a roadmap for future work, but are happy to have them out in the open for further collaboration. Tim has written a good post on some of the work he has done to make this happen, and I am grateful for his help on this endeavor.

Categories: python, ruby, tools

NRPE and Solaris SMF

September 7, 2011 Leave a comment

NRPE running under Solaris SMF control.

The SMF manifest:

<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type='manifest' name='export'>
  <service name='application/monitoring/nrpe' type='service' version='0'>
    <single_instance/>
    <dependency name='fs-local' grouping='require_all' restart_on='none' type='service'>
      <service_fmri value='svc:/system/filesystem/local'/>
    </dependency>
    <dependency name='network-service' grouping='require_all' restart_on='none' type='service'>
      <service_fmri value='svc:/network/service'/>
    </dependency>
    <dependency name='name-service' grouping='require_all' restart_on='none' type='service'>
      <service_fmri value='svc:/milestone/name-services'/>
    </dependency>
    <instance name='default' enabled='true'>
      <dependency name='config-file' grouping='require_all' restart_on='refresh' type='path'>
        <service_fmri value='file://localhost/usr/local/etc/nrpe/nrpe.cfg'/>
      </dependency>
      <exec_method name='start' type='method' exec='/local/lib/svc/method/nrpectl start' timeout_seconds='30'>
        <method_context working_directory='/var/tmp'>
          <method_credential user='nrpe' group='nrpe' privileges='basic,sys_resource,!proc_info,!file_link_any' limit_privileges='basic,sys_resource,!proc_info,!file_link_any'/>
        </method_context>
      </exec_method>
      <exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'/>
      <exec_method name='refresh' type='method' exec='/local/lib/svc/method/nrpectl refresh' timeout_seconds='60'/>
      <property_group name='nrpectl' type='application'>
        <propval name='NRPE_CFG' type='astring' value='/usr/local/etc/nrpe/nrpe.cfg'/>
        <propval name='NRPE_FQB' type='astring' value='/usr/local/sbin/nrpe'/>
      </property_group>
    </instance>
    <template>
      <common_name>
        <loctext xml:lang='C'>NRPE</loctext>
      </common_name>
      <documentation>
        <doc_link name='nagios.org' uri='http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf'/>
      </documentation>
    </template>
  </service>
</service_bundle>

The associated method:

#!/bin/sh

. /lib/svc/share/smf_include.sh

NRPE_FQB=`svcprop -p nrpectl/NRPE_FQB $SMF_FMRI`
NRPE_BIN=`basename $NRPE_FQB`
NRPE_CFG=`svcprop -p nrpectl/NRPE_CFG $SMF_FMRI`

pid=`pgrep -x -d " " $NRPE_BIN`

case $1 in
   'start')   if [ -z "$pid" ]; then
                 smf_clear_env
                 $NRPE_FQB -c $NRPE_CFG -d >&2
                 if pgrep -x -d " " $NRPE_BIN >/dev/null 2>&1; then
                    :
                 else
                    echo "NRPE failed to start" >&2
                    exit $SMF_EXIT_ERR_FATAL
                 fi
              else
                 echo "NRPE already running (pid=$pid)" >&2
                 exit $SMF_EXIT_ERR_OTHER
              fi
              ;;
   'refresh') if [ -z "$pid" ]; then
                 echo "NRPE not running; nothing to refresh" >&2
                 exit $SMF_EXIT_ERR_OTHER
              else
                 pkill -x $NRPE_BIN
              fi
              ;;
esac
exit $SMF_EXIT_OK

Season to taste.

Categories: solaris, sysadmin Tags: , , ,

TimeMachine and Logged Out Users

January 5, 2011 Leave a comment

With the deployment of the MacMini3,1 as an important box, I wanted to have timely backups and easy recovery, and that is one thing Snow Leopard does rather well with TimeMachine. Attach a disk, configure as a TimeMachine destination, and done, right? Not exactly: I noticed that TimeMachine was only backing up the system if there was a user logged in, something that’s rather rare on this box (in fact, there is generally no display or keyboard attached to it).

It turns out that this is normal behavior, as the system unmounts all external volumes when a user logs out, including TimeMachine volumes (this does not apply to network volumes, just volumes physically attached to the system). There are some edge cases that affect somewhat this behavior (such us when FileVault is in use), but it can be completely disabled:

defaults write /Library/Preferences/SystemConfiguration/autodiskmount \
     AutomountDisksWithoutUserLogin -bool true

I went ahead and rebooted the system. TimeMachine now works even when users are not logged in.

Categories: macos, sysadmin Tags: , , ,

Macmini3,1 and PowerBook5,8

January 4, 2011 Leave a comment

macmini_systemprofiler.jpg

A few months ago the aging Early 2009 Mac Mini in the living room was replaced with a 2010 model. The old one was having a hard time keeping up with HD content (mainly in terms of performance but also flat out refusing to display iTunes HD content after the upgrade to Snow Leopard) and the 1080p display over the DVI to HDMI adapter over-scanning issues were rather tiresome. The 2010 model did away with all that: faster CPU, more memory and native HDMI took care of those issues, which left a perfectly functional Macmini3,1 searching for a mission in life, a mission I had found even before I pulled the trigger on the new model.

A small server in the office that I use to store backup copies of precious data away from my main desktop system, such as music and photos, is also the authoritative repository of software that gets pushed to all the other systems I use or care for. Additionally, it runs a small mail setup (mx + imap) for two personal domains and other bits of useful software, such as a personal wiki. It was been working flawlessly for quite some time, but I have been wanting to reduce the office’s power footprint, especially while I travel, which was challenging given the system needed to be up all the time.

Thus, the mission is defined: the Mac Mini needs to take over the services that run continuously so the other system can be powered off at will.

The migration is nearly complete: mail is flowing and the software repository is up to date. The wiki bits are still a work in progress, but those are not as critical, primarily because Evernote has largely replaced (and enhanced) the wiki use. None of this would have been possible without the MacPorts Project community, at least not as fast and seamlessly as it has been. So there is happiness in the living room and there is happiness in the office.

On other related news, the aging PowerBook5,8 is finally headed for retirement. It has been a good 5+ years run, but in the end, it was entirely too slow now that its last user had embraced digital photography and was using quite heavily. I’m not sure what I will do with it: the recycling center should be its final destination, but there is an emotional link to that laptop that keeps me from doing it. It was the first laptop I bought at Ning (before we actually purchased Apple products at the office) and it has served us very well.
Categories: macos, sysadmin Tags: , , ,

Githubed!

November 3, 2010 Leave a comment

In recent moths, we (“we” as in Ning) have started to open up some of our code to the community at large. There is quite a bit of useful stuff in there (23 public repositories and counting), compliments of powerhouses like brianm, davidsklar or tomdz (to name a few). We have also started sharing code from the Operations side of the house, in hopes that it is useful to other Operations shops out in the ether.

Our first entry in this regard, at pierre‘s suggestion, is a Nagios plugin for Tableau servers, check_tableau_systeminfo, which is currently a little rough around the edges but quite usable. There is a new version right around the corner with some polish applied to said edges, and we are preparing a host of other tools for release that we currently use every day in our production environment.

Our production environment is currently comprised of 2000+ nodes, which makes it a relatively large environment that provides a fair amount of interesting operational problems to solve. And problems solvers is something we are actively looking for, i.e., we are hiring! Ning is the largest platform in the world for creating custom social networks, currently hosting 70,000 paid subscribers (up from 15,000 before we transitioned from the prior “freemium” model), and serving over 80 million unique visitors monthly. This makes us one of the top 100 sites in the US, and, according to CNBC.com, one cool company to work for (indeed!).

Check out the openings, check out the code, and come play in our playground.

Nagios Forked: Icinga

March 18, 2010 Leave a comment

I found out recently that Nagios was forked into Icinga. It looks interesting, and the new web interface is heading over to sexyland fast. I will take it out for spin soon and see how it handles our current configuration (which relies heavily on object inheritance). The team at Icinga has already built a fair number of improvements for Nagios proper. It may be the fastest path to nirvana to a more usable Nagios install for shops heavily invested in Nagios.

Categories: sysadmin Tags: , ,

Mac SVN clients: Versions vs Cornerstone

January 20, 2010 Leave a comment

Although I am not a full-time developer, I do deal with a fair amount of information that benefits from the joys of versioning (including software I write). A lot of that work today happens in an IDE (Xcode), but a lot of it happens in other contexts, so having a pretty (and above all, useful) tool to navigate a repository is quite useful. I have been using Versions for the Mac since it was beta, and it has proven to be a worthy helper. Recently, I ran into Cornerstone, and I decided to try it out of curiosity. Both are solid apps, and either one will service your SVN needs nicely. Hopefully Git will get one in the near future.

Categories: computing, macos, sysadmin Tags: , , ,

Reconnoiter

November 17, 2009 Leave a comment

Go watch: Reconnoiter: a whirlwind tour. Any piece of software with a tagline like another product build from pain deserves your attention:

Reconnoiter

Categories: miniblog, sysadmin Tags: , ,

Enabling Remote Disc on non-MacBook Air clients

November 11, 2009 Leave a comment

Courtesy of bstreiff: Enabling Remote Disc on not-Airs:

defaults write com.apple.NetworkBrowser EnableODiskBrowsing -bool true
defaults write com.apple.NetworkBrowser ODSSupported -bool true

Context: we have an older MacBook with what appears to be a busted DVD-ROM drive, preventing a clean install of the newly arrived cat. After enabling CD/DVD Sharing on the “server”, nothing would show up on the “client”. After tweaking the above properties and restarting the Finder, it’s all good now. Install in progress.

Categories: macos, sysadmin Tags: , , ,

Hyperic is not (yet?) a Full Nagios Replacement

October 7, 2009 7 comments

These days, among the many projects I’m shuffling, monitoring is taking some rather significant brain cycles. In any operational environment, monitoring is the closest link between humans and machines, primarily because monitoring is one of the main channels machines use most often to talk to humans (terminal sessions notwithstanding).

In prior lives, Nagios has been one of the primary fault monitoring systems in use. It has a relatively long history as the monitoring workhorse in a large number of production environments, and with reason: high quality, open source, and free has meant easy adoption, especially in cash-conscious startups. It compiles on essentially all contemporary Unix-based platforms, and there is a significant knowledge base on and off the Intertron. It gets the job done. Add some good design to its configuration files and its management tools, and it’s top-notch. Minus, of course, its archaic web-based GUI. Details, details, details. Hyperic HQ is one of the monitoring tools I have been involved with most recently. Installation is relatively painless on both on the server and the client (and admittedly faster than that of getting Nagios up and running), it is pretty, sporting a modern, somewhat configurable interface, and offering features like autodiscovery out of the box.

A notable section in the Hyperic documentation is related to Nagios (Nagios Data), and the Hyperic HQ Tour also offers some comparative notes (PDF, page 5: “Hyperic HQ Compared to Nagios”). This is no coincidence: Nagios boasts a large installed base, and that makes it a tasty target to aim for (a seach for “Zenoss”, for instance, yields no results). The comparison is, by all accounts, fair. There is nothing misleading or dishonest in it, and Hyperic HQ is indeed superior to Nagios in many respects (particularly in terms of presentation). There are some points, however, where I am not quite synched up with the Hyperic story, but these are generally small (but by no means insignificant) details. For instance, I’m quite happy to have configuration data stored in plain text files when its complexity does not merit a RDBMS.

The one significant sticky point (which is not mentioned in the tour but that is prominently highlighted in the documentation) is the apparent lack of any passive-like monitoring in Hyperic (no NSCA or NCSA-like support). Personally, I am a big fan of passive monitoring: components perform self-monitoring duties and report back to central command as necessary. I have used it quite successfully to monitor scripts that run periodically and are unpokeable (think all those wonderful tools you’ve written that ship logs back and forth), NetApp filers (by proxy), and almost any kind of custom software. Nagios documentation does not do true justice to passive monitoring: it is mostly described as a way to build distributed monitoring or reach components behind firewalls. And yet, it is an incredibly elegant methodology to monitor components and to do so from within the components themselves asynchronously; by definition, it is highly distributed, making each component (or proxies on their behalf) responsible for its own health, and yielding Nagios a great alert, escalation and notification manager. Add powerful inheritance and the resulting monitoring setup is scalable, efficient, and straightforward to automate. It is still (no point in hiding it) ugly, but hard to beat.

While Hyperic HQ has a home for performance monitoring, my money is still on Nagios for fault monitoring. I will be learning a lot more about Hyperic HQ over the next few weeks, so I will likely have more details to report in future posts (I am seriously exploring what it would take to feed passive test results to Hyperic so they can be available in the Hyperic dashboard, for instance). Stay tuned.

Categories: sysadmin Tags: , , , ,
Follow

Get every new post delivered to your Inbox.