Debugging

  1. Learn how to Debug
  2. Fix the root cause
  3. Fix it once
 
Bug Killer!

Learn how to Debug

The ability to debug your operating system, and software running on it, is a central part of managing your Operating System. This all comes back to reading your manual, but it really is important that you understand how your operating system works from the ground up. This kind of knowledge doesn't come overnight, but with some research and time, you can get there.

Spend some time learning to use strace/truss, ldd, lsof, /proc/, ipcs , netstat, netcat ... so on, any effort spent learning will pay itself back 10-fold. Reading RFC's, HOWTO's and FAQs will also help get some grip on the protocols and applications you run. Ideally, you should be able to interact with any plaintext protocol using telnet manually.

Fix the root cause

Fixing the symptom of a problem is not a real solution. Just because a problem goes away when you reboot your server, or restart your application, does not mean that you've solved any problem. It is important that you determine and fix the root cause. This can take more time, but once you do it, it's likely you won't have to repeat yourself. In the longer term, it's a much bettter way of solving you problem.

When we hit problems, we tried not to leave them until we were satisfied we had discovered a root cause. Sometimes this takes a lot of effort, but it's worth it. In one example, we were seeing Apache hang for IPv6 requests over 256 bytes, but only after the second request. It took hours of debugging to even determine the characteristics of the bug.

Being able to replicate a bug from a known state is a very important part of debugging, once you've gotten this far you can show other people how to produce the bug and get some help fixing it. In our case, it turned out to be a bug with the network card not doing TCP checksumming correctly in IPv6. It took RedHat Kernel Developers to diagnose it fully.

Fix it once

Fixing the same problem twice doesn't make much sense. If you find a fix for a problem, make a note of it. If you use the same solutions somewhere elsewhere, make sure you apply the fix there aswell. Make sure you integrate the fix into your future deployments. Document the fix if appropriate.