Catastrophic Failures

I just got bitten by a bad programming habit, and I wanted to share some advice with you. This is on the order of a tip that an old hand is sharing with you, which might save you some grief in the long run.

You know how you’ll be coding along on a routine that’s supposed to open a file, write something to it, and then close the file? You write this routine and go about your business. You compile (or not) the code and go on about your business.

Then one day, the thing fails because for one reason or another, you couldn’t open the output file the way you wanted. Oops, you say to yourself. It never occurred to you at the time that the file wouldn’t be there or openable in the way you specified. That happens more than you think in the programming world. While you’re writing this code in the first place, it just doesn’t occur to you that something might go wrong. If it does, you add a little “else” clause return some kind of flag to the calling program or routine to indicate failure. But you never handle the failure. And you never stop and consider whether the error or failure is such that it would more or less obviate the rest of the code.

There’s a class of failures one could call “catastrophic”. That is, if this operation cannot be completed properly, there’s no point in going on with the rest of the program. That’s it, you’re done. Let’s say the point of a routine is to write some text to a log file, then close it and terminate. But what if the log file’s been deleted by some over-enthusiastic sysadmin, or the system’s been upgraded so that you no longer have permission to write to that file. Now what? Well, if the whole point of this operation is to write to that file, and you can’t do so, then your whole routine (or program) is a bust. Might as well never have called it in the first place. That’s a catastrophic failure.

The tendency for a lot of programmers is to test whether the file can be opened and written to, and if that fails, return a flag or signal to the calling program or routine indicating failure. Then, in the main part of the program, somehow deal with the error. But what if your routine could fail in multiple ways? How do you indicate to the calling program which part failed, and any additional info (like the filename) needed to correct the problem?

The more of your error handling you hand off to the main program, the more complex the handling part of your program is going to be. And the more complex your API or interface to it will be. Also, is there really a point in returning to the main part of your program with an error flag in the case of a catastropic failure?

The answer is, “no”. The best handling in the case of a catastropic failure is to stop right there, output some sort of error message and abort. You can’t continue, so there’s no point in returning with a flag and expecting the calling routine to do something with the error. If you want to, write a simple error-reporting routine that you can call in the case of a catastropic failure. Call it “die”, or “oops”, or “uh_oh” or something, and simply pass a message to it. All the info about the error is contained in the routine which failed, so bundle whatever part of that you like into a message, pass that to the error routine, and then have the error routine abort the whole program or script.

My point here is two-fold. First, be on the lookout for places where a failure could occur (all file-level operations should be suspect), which, if it does occur, make continuing impossible or at least pointless. These are “catastrophic” failures. Second, in handling such errors, don’t continue on with the program or script from there. Stop immediately, issue an error message or error return code or something, and then immediately exit or abort. Working it this way will save you a lot of time and trouble in the end, and simplify your coding.

By the way, if some other programmer tells you this isn’t the best way to handle things, ask them why and listen carefully to their explanation. Then compare what they’re saying to what I’ve said above. Then decide what to do based on the better reasoning. And if you do have some compelling reason not to do as above, let me know. Don’t listen to academic arguments about “best practices” and “unit testing requirements” and crap like that. That stuff usually amounts to horse hockey, and comes from guys who haven’t actually written a line of code in 25 years or something. Or guys who guys who’ve thought and theorized a lot about coding, but never learned to do it well. Lot of guys out there like that.