Tweaking Nagios For Performance

April 19th, 2009 | Categories: Technology, Tutorials | Tags: , , ,

The company I work for has about 3,000 servers that need to be monitored in our Dallas datacenter. For the past few years we’ve been using a fairly standard Nagios setup. If you don’t take the time to really learn Nagios and tweak the config files it’ll run fairly well, until you are monitoring more then a few hundred servers. The reason that Nagios slows down when checking 300+ servers is that it stores all state/check information in a flat text file on the system’s hard drive. When you have only a few servers and services to check it’s not so bad, but when you the more you add, the more IOPS you’ll see. At 3,000 servers disk IO is a huge bottleneck.

A lot of systems will be fairly responsive but show a really high load average, this is because of IO wait. Fortunately, the guy who setup Nagios at our DC was smart enough to realize we had a massive issue with disk IO and so he had everything running off 4x 15k RPM SCSI drives with a hardware RAID 10. Unfortunately, even with the fairly substantial hardware Nagios still took nearly 20 minutes to check every system in the datacenter. For a while, this was considered acceptable, because we didn’t want to pay thousands of dollars for a commercial system and this particular admin was convinced that Nagios was running as fast as possible for now and that maybe the Nagios developers would speed things up in a later version.

The old Nagios system grabbed it’s information about what to monitor from a program we had called “Server Locator”, which would soon be replaced with a database in Microsoft’s SharePoint. So it fell to me to modify our existing Nagios system to grab it’s configuration information from SharePoint instead of Server Locator. I had just recently been promoted to Jr. Admin, so I hadn’t had a chance to look at Nagios and see how things were setup. I took one look at the system and decided it would be easier to setup a new one, on my terms. This meant that I could do things my way (hopefully that means the right way), and the old system could be kept ready in case the swichover didn’t go smoothly.

The first thing I looked into using was a distributed Nagios setup, but after only a day of playing with it I ran into a huge problem. It was slow, really slow, and I had no idea why. I had 3 boxes setup, the main system was called mother.nag, the others were named after the phases (sections of the datacenter) that they would be monitoring. Eventually I discovered that the problem was due to how Nagios communicated back to the main server. A daughter server (mother/dauther, get it? I know I’m clever) would perform a service check, and then report the result back to the mother server. Even if the check itself took only a fraction of a second, the entire exchange would always take at least 1 second. While this alone wasn’t a huge issue, it was some fairly signifigant overhead. The real problem was that during the time that Nagios on the daughter server was communicating with the mother server it was doing nothing else. For some reason, during the communication it wouldn’t do anything else, such as performing other service checks. This meant that if there were 300 servers in a phase (about average) it would take 5 minutes to scan the entire phase and report back.

Maybe I’m just weird, but to me this was unacceptable, I knew Nagios could do things faster. The servers showed almost no load, or network traffic during normal operation, there was no reason why one system shouldn’t be able to monitor everything in the datacenter. So I went back to the drawing board. Then it hit me, the system we had before only really had issues with disk IO, so what if I could just cut out that bottleneck, how fast would things go? So I moved the all the files nagios uses to a ramdisk, and then setup a quick cron-job to save them to disk once a minute.

# m h  dom mon dow   command
*/5 *   *   *   *    /usr/bin/rsync -a /dev/shm/ /var/nagios3/; rm /var/nagios3/lib/spool/checkresults/*; rm /var/nagios3/cache/nagios.tmp*; rm -rf /var/nagios3/nagios-config-*; chgrp www-data /dev/shm/lib/rw/nagios.cmd;

The server I had setup was using Ubuntu, which means it has a ramdisk mounted at /dev/shm. The ramdisk is allowed to use up to half of the memory in the system, which was fine with me since I had 2GB of ram and in total all the configuration files and cache files for Nagios came out to a whopping 16MB.

With Nagios now using the ramdisk I decided to test it out by monitoring the everything in the datacenter. Why start small, if I was right this would be really fast, and sometimes I just love being recklessly over-confident in my ideas. So I started up Nagios and watched it make a scan of the datacenter. Once it was done I was disappointed, it was still taking too long, it took nearly 5 minutes to complete the scan of the datacenter. I was disappointed because even though I was now showing a 300% increase in speed over the original Nagios setup I noticed that my test box still have virtually no load while running the service checks. I knew I could make it go faster, but how would I do it?

So I decided to throughly read the parts of the Nagios manual that I had glossed over, namely the section that deals with scheduling of host/service checks. I figured the issue was that the system just wasn’t scheduling enough checks to happen in parrallel. This was backed up by the performance data, which showed that while my Check Execution Time was low (around 0.105 seconds on average) my Check Latency was much higher (I believe around 10 seconds or so on average). After much testing and tweaking I finally found the perfect settings. Please note, that the system I am running Nagios on has two dual-core Xeon processors. This is important to keep in mind, because the settings below will cause Nagios to spawn hundreds of processes at the same time. I strongly recommend you get as many cores (real cores, not hyper-threading) in your monitoring system so that you can make the most out of this setup.

service_inter_check_delay_method=0.01
service_interleave_factor=s
host_inter_check_delay_method=0.02
max_concurrent_checks=0
use_large_installation_tweaks=1

I set the Service Inter-Check Delay to a static 0.01 seconds, so that checks would happen as fast as possible. Combined with the Max Concurrent Checks set unlimited, this means that Nagios spawns processes for service checks like it’s Zerg rushing. Luckily, the system’s 4 cores handle everything pretty well. We only monitor one or two services (by default) for each server, so I didn’t care too much about service interleaving, I left it up to Nagios to determine how to interleave service checks. Using Large Installation Tweaks is pretty standard for large Nagios installs, so you should already have that set. With these settings, and everything in the /dev/shm ramdisk, Nagios can now monitor every system in our datacenter in about 50 seconds. Yes, that’s right, we can monitor the entire datacenter once a minute.

This screen lists out servers that are down (white means it's down but acknowledged)

This screen lists out servers that are down (white means it's down but acknowledged)

I decided that in order to not overload customer’s servers with traffic, that by default we’ll run a service check every 3 minutes. If the a problem is detected, the interval for that service/host drops down to 1 minute until the service/host is determined to be in a “hard down” state. The best part is that because the old Nagios was so slow, we purchased a commercial system (Site Scope) to monitor the core infrastructure. About a week after my Nagios seutp went live it detected that the webserver that ran our main website went down, and so it sent out alerts to our blackberries. The commercial solution alerted us 4 minutes later. I was told that our CEO was extremely impressed with how quickly Nagios was able to detect and report the issue.

Good job, here is a lolcat

Good job, here is a lolcat

If you have any questions, feel free to use the comments section.

This is mah job

  1. Rabbit
    April 20th, 2009 at 16:21
    Reply | Quote | #1

    They should have given you a raise for THAT. But alas I suppose the migration was just as good a reason.

  2. June 18th, 2009 at 11:04
    Reply | Quote | #2

    I just started to implement a Nagios setup and it is great. There is a few things I need to figure out. It would be cool if we could do an email conversation or phone converstation. I would really like to hear more and maybe you could even give me some tips on how I could improve my system here. Thanks

  3. Magic Nagios
    September 24th, 2009 at 00:18
    Reply | Quote | #3

    Pretty interesting post.. I am curios to know how many service checks are there in your setup. In my clients setup we have around 4000 servers spread gepgraphically across the world and the the total check count will come to 160000 on an average of 40 checks per slave. we are currently using muliple master slaves setups. 14 masters having 6 slaves each.. we would like to reduce the number of masters to two. So my question is can a nagios instance handle 75000 passive check results per five minutes. i assume there will be a bottleneck at nagios.cmd level when nsca tries to writes so rigourously at the same time. will nagios.cmd being on a ramfs help in this case?
    Thanks again for ur nice doument. it is really encouraging to be with Nagios !!!!

  4. September 24th, 2009 at 16:07
    Reply | Quote | #4

    @Magic Nagios

    I have no idea if Nagios can handle that many passive check results, our setup only has about 3,000 right now. It might be possible. I strongly recommend putting any of the files that have heavy read/writes on a ram disk, as that is what allows our setup to run smoothly.

    It’s been a long time since I tried the distributed Nagios setup, so I can’t remember if I had nagios.cmd on a ram disk. I may not have, and that may have been why I saw such a slowdown with NSCA sending the passive check results.

  5. Brandon
    October 16th, 2009 at 16:44
    Reply | Quote | #5

    How did you get the custom webui. I really want the last pic the “Good Job”

  6. October 16th, 2009 at 18:17
    Reply | Quote | #6

    Brandon :

    How did you get the custom webui. I really want the last pic the “Good Job”

    I grabbed the Nagios Dashboard PHP script from Nagios Exchange and heavily modified it. You can find the original script here:

    http://exchange.nagios.org/directory/Addons/Frontends-(GUIs-and-CLIs)/Web-Interfaces/Nagios-Dashboard-%252D-PHP/details

  7. December 9th, 2009 at 07:12
    Reply | Quote | #7

    Thanks for sharing your experience, very valuable.

  8. February 26th, 2010 at 17:20
    Reply | Quote | #8

    Thanks for the tips. There seem to be a lot of people pushing parts of nagios into shared memory which is a great idea.

    For large distributed setups consider looking at DNX (Distributed Nagios Executor) and Merlin. We are in the process of evaluating both of these solutions and they look very promising.

  9. November 17th, 2010 at 09:49
    Reply | Quote | #9

    @AndrewL
    could you share your version of nagios.php or at least the cat picture :)

  10. November 17th, 2010 at 12:23

    @pugnacity
    I’m sorry, but I can’t post the source for the modified dashboard script due to company policy. I was able to add the original lolcat picture to the post though.

  11. Steffan
    December 16th, 2010 at 10:50

    Hi Andrew,

    I like your post.. and i have some questions for you (hope thats alright ;D)

    1. i am setting up nagios at work right now, and i noticed that the cpu load (opensuse server with 1×2,33 ghz core) was pretty high (about 20-50%), and i for now only have 20 server with about 5-6 services each (checking stuff via SNMP), and i need about 400 servers/switches/printer….. in there, so i wonder, how is it that your server is not using that much cpu with 3k servers? are you not checking anything else but ping?

    2. i liked your nagios.php page alot, and i made one almost identical, the only thing i need i to display some text (by you the green text saying “God job…..”) when there is no errors, could you tell me how you made it do this? because i cant really figure it out (i can only get it to write the text every check thats OK) :)

    Thanks
    -Steffan

  12. December 17th, 2010 at 15:26

    @Steffan
    1. I’ve got Nagios running on a quad-core server and the load average will peak to about 3. Every server has at least 1 service check (usually HTTP).

    2. I just count() the variables like $critcount to see if they all equal zero.

  13. Steffan
    December 18th, 2010 at 06:55

    Hey andrew, thanks for your answer!

    I gave my server another core, so it has 2×2,33 ghz, but i think i will need to give it more when i get more servers/service checks, altså i looked into my nagios.cfg file and found out that the use_large_installation_tweaks was set to 0, so i enabled it and that helped a little..

    Thanks for the tip on the counter, i will try that monday!

  14. Derek Brewer
    June 7th, 2012 at 09:31

    Thanks for the tips. I just made your tweaks on a Nagios server with 850 hosts and almost 12,000 services. It brought the average service latency from ~350 seconds to less than a second. I feel like I’m actually utilizing my server more now.