≡ Menu

Style Guide from 1998

Foreword

This is a document basically scraped from the Wayback Machine’s archive of some of my old content. It’s… really, really old. The last revisions were from 1998, and it was originally written in 1995.

It’s included here because I thought it was interesting to see how I wrote back then (it’s changed a little) and because some of the information is still useful.

I’ve changed links where they’re outdated; I’ve also made sure the spelling is consistent (just in case I misspelled something) and fixed some minor formatting things (including a conversion to Markdown), but this should be equivalent to what I published back then.

Introduction

I’m a programmer/analyst currently working for the State of Florida Department of Revenue. Having spent my entire professional career caring about programming, both personally and professionally, I’m managed to form some opinions on programming style, in a lot of areas. What follows is a collection of information gathered from a lot of reading, and a lot of experience.

This is written and aimed primarily at C and C++ programmers, but hopefully it will be useful to programmers in all languages, although some features may not be usable in certain environments. This isn’t meant to be an end-all discussion; Steve McConnell has a book out called “Code Complete” that addresses every one of these issues and more. It’s good reading, and he goes into more issues and in more detail than I can. Hey, he’s got 850 pages to do it in…

Also: let me know if you have a comment or a differing opinion regarding some of the issues and resolutions I’ll bring up. I’m happy to mention other viewpoints; I’ve even learned a bit by encouraging dialogue.

  1. Writing and Formatting Code
  2. Using Language Features
  3. Naming Variables and Functions
  4. Designing Functions
  5. Comments in Programs and Functions
  6. Writing for the Future
  7. Debugging
  8. Improving
  9. But What About Perl?

Writing and Formatting Code

Two of the most religious issues in programming – once you get past operating system and language concerns – are: the best editor and the best format. Both of these issues have caused many a flame, and much gnashing of teeth. I must confess I’ve done my own bit of gnashing regarding both of these, but I think I’ve finally settled down.

Note that this section is tangential to real programming style; how you write and format your code really isn’t as important as using consistent styles, and the existence of tools like GNU indent (for C) and Tal Davidson‘s Beautifier (for Java) really wipe the problem out. That said, this particular page may or may not interest you; if you’re more interested in real coding issues, go to the next section.

Writing code using an editor can only be said to be a preference. Some editors have more features than others, and if you prefer those features… well, use them. The bottom line for me here is that an editor is your own choice, and no amount of rhetoric can really alter that. All that said, my favorite editor is Multi-Edit Professional, for both DOS and Windows. Man, I wish they’d port that to Linux. Emacs is nice, but for some reason I can fly in Multi-Edit like I can in no other editor. For the record: what I use is joe and Emacs. The only problem with Emacs is, oddly enough, the language support (in particular, it thinks that files ending in .l are Lisp files and not lex input files). Oh, well – that’s probably more my problem than Emacs’.

The “best format” is just as variable, but fortunately, there are formats that are more useful than others. For the most part, I’ll be dealing with C and C++ here, but some other languages can certainly borrow the structure. What’s more important is why the structure in question was chosen, than what the structure actually is.

One of the most popular formats is the “K&R” format, from Kernighan and Ritchie’s book on Programming in C. This format looks roughly like this:

char* StringCopy(char *Destination, const char *Source) 
  {
  char *CurrentSrcChar;
  char *CurrentDstChar;

  CurrentSrcChar=Source;
  CurrentDstChar=Destination;

  while(*CurrentChar) {
    CurrentDstChar=CurrentSrcChar;

    CurrentSrcChar++;
    CurrentDstChar++;
  }

  CurrentDstChar='\0';
  return Destination;
}

One of the hallmarks of this style is the block-beginning brace being at the line that begins the block (in this case, the while()), and the block-ending brace lining up with the statement that begins the block. Note, however, that the brace that begins the function itself is on a line after the function declaration.

Dean Velasco (http://dora.eeap.cwru.edu/vbv) pointed out that he used a slightly modified version of K&R, in which he lines the block-ending brace with the text of the block. In that case, my example function would look a bit differently:

char* StringCopy(char *Destination, const char *Source)
{
  char *CurrentSrcChar;
  char *CurrentDstChar;

  CurrentSrcChar=Source;
  CurrentDstChar=Destination;

  while(*CurrentChar) {
    CurrentDstChar=CurrentSrcChar;

    CurrentSrcChar++;
    CurrentDstChar++;
    }

  CurrentDstChar='\0';
  return Destination;
}

An advantage of this slightly modified version is that it encapsulates entire blocks very well (and preserves a bit of screen space, if you need it).

The GNU coding standard advocates separating the function return type from the name of the function and putting braces on the line after the block initialization (the while() above), indenting the braces, and then indenting the lines within the braces as well. This uses a bit more screen space than the K&R style, but tends to open up the code a bit:

char*
StringCopy(char *Destination, const char *Source)
{
  char *CurrentSrcChar;
  char *CurrentDstChar;
  
  CurrentSrcChar=Source;
  CurrentDstChar=Destination;
  
  while(*CurrentChar)
    {
      CurrentDstChar=CurrentSrcChar;
      
      CurrentSrcChar++;
      CurrentDstChar++;
    }
  
  CurrentDstChar='\0';
  return Destination;
}

My personal style is the Whitesmith style, similar to the GNU convention, except that I don’t indent code inside the braces (I find the extra level of indentation distracting). Here’s a the same code, in the Whitesmith format:

char*
StringCopy(char *Destination, const char *Source)
  {
  char *CurrentSrcChar;
  char *CurrentDstChar;

  CurrentSrcChar=Source;
  CurrentDstChar=Destination;

  while(*CurrentChar)
    {
    CurrentDstChar=CurrentSrcChar;

    CurrentSrcChar++;
    CurrentDstChar++;
    }

  CurrentDstChar='\0';
  return Destination;
  }

This isn’t an exhaustive list of style, obviously. (I’ve shown these four styles to two other programmers in my office, and gotten two more variants on these. Oh, well.) I’ve used a tab size of two characters here, although many studies have shown four to six spaces are optimal. Your milage may vary.

A side issue of tab formatting is the page width. With a large tab size, some programmers complain that a lot of nesting tends to make the page too wide for most printers. Personally, that’s not an issue; if I have more than four or so levels of indentation, I break some of the blocks out into separate functions to make the procedure simple enough so that the indentation isn’t a problem. (More on this in later sections.)

The neatest things about formatting styles are: they’re great ways to get in pointless arguments with other programmers, and tools exist to render the whole point moot anyway. GNU indent is quite available for nearly every operating system I can think of, and it can be given parameters to emit nearly every variation of output format there is. (Oddly enough, I haven’t found my
variation, but you can’t win them all. I may dig up the source code for indent and try to put that in someday.)

Author’s note: Since writing this, I’ve more or less given in to Emacs’ formatting style, since I’m lazy. (2016 note: haven’t done it that way for years.)

The existence of tools like indent wipes the religious argument aside; for example, the first thing I do when I look at any significant amount of code from another programmer is run it through indent with my settings, so I can look at the code the way I’m used to it; when and if I send code to someone else, I have no qualms about them doing the exact same thing.

Note that there are a lot of other issues in formatting that I’m ignoring. The above is primarily a (small) discussion of control-structure blocks. Other issues that are important (that I’m not addressing) are whitespace in routines, and the spacing between parameters. These may seem like minor issues – and the tools that can reformat code for you certainly make them seem small – but anything that improves code readability is a Good Thing (tm).

There are other issues in formatting that I am addressing. Read on!

Using Language Features

Nearly every language has its hackers, wizards, and gurus. When I worked in BASIC 2.0 those many years ago, I remember using shortcut commands (you could use “?” and the language would know you meant the PRINT token) to get massive lines entered. There was an actual speed difference involved when you did that; there was a small bit of processing for each new line. The tradeoff, of course, was that if you made an error in the line, you had to meticulously go back through entering the line again, and hope that you didn’t go over the limit in terms of how large the entry buffer was.

From my standpoint (as a young would-be hacker), that was great coding style; it was just tricky enough that your average Joe Schmoe wouldn’t be able or willing to do it. Unfortunately, that attitude is still alive and well in the programming world.

Looking back, after being employed as a professional programmer in a workgroup environment, I wonder what I ever saw in obfuscated code, besides some egregious ego-stroking. It’s insane… and quite popular. Consider the following example, which came from a programmer on EFNet. This person, whose identity has been concealed to protect the guilty, was trying to sort strings in qsort(), based on the 20th character onward in the string.

int
cmpfnc(void *a, void *b)
  {
    return strcmp((char *)a+20, (char *)b+20);
  }

Amazingly, the code didn’t work. Can you see why? When I first looked at it, I saw a lot of potential problems with it, and then told the fellow to rewrite it, and then ask me to help. I’d consider it only then, because I didn’t feel like taking the time to trace through the code.

This code suffers from two distinct problems, even though it’s only one line! The first problem is that it is simply wrong; the programmer needed to handle his casts properly. The other problem is that the code does its best to make itself hard to read. The above single line does six distinct things:

  1. Convert a to a char*.
  2. Convert b to a char*.
  3. Offset a to the 20th position.
  4. Offset b to the 20th position.
  5. Call strcmp().
  6. Return the result of the strcmp() call.

Some programmers I know would think: “Wow, that’s the glory of C! Instead of six lines, I’ve got only one! How cool can you get?” Those C programmers aren’t exactly idiots, necessarily, but it’s just poor style. Our exemplar of bad style here couldn’t even see what his problem is, because his debugger worked line-by-line, like most do. Instead of seeing that his original casts were incorrect, he only saw his strcmp() call fail. If he’d have broken that statement into even four lines, he would have seen that something was amiss when he went from void* to char*.

Just to give you an example of very simple – and thus, preferable – code, here’s what I would have written:

int
QSortCompareStrings(const void *elem1, const void *elem2)
  {
    char *First;
    char *Second;
    int CompareResult;

    First=*(char **)elem1;
    Second=*(char **)elem2;

    First+=20;
    Second+=20;

    CompareResult=strcmp(First, Second);

    return CompareResult;
  }

Something that’s not-so-surprising about this code is that it ran the first time I tried it. I traced it in a debugger just to make sure it was correct, and it was. There’s a hole in it, to be honest, but I was making an assumption for brevity. (If I was really writing this code for a production environment, I’d have had an assert(), or something similar, in there to make sure the passed strings were at least 21 characters long.) Amazing, isn’t it – chances are that you can read that code clearly, and see how it’s working, without having to think about it very much.

I’ll be revisiting this code snippet in Designing Functions, to address it from start to finish.

Code complexity is a bad problem in C, like a virus. There’s an ego trip tied up in every line, it seems, and the more complex you can make your code, the better. Hogwash. People who do that have no feeling for maintenance programmers, and have never had to maintain code like that themselves, or they’d know better. The simpler you can make your code, the better – not just for you, but for everyone else, because while people may enjoy poring over good code, they rarely if ever want to have to pore over it to have a chance of understanding what it does.


Flags are another area in which C programmers tend to get stars in their eyes. C has lots of neat ways to access boolean flags: ints, bit fields, and bit accesses. Using an int for a flag is rather obvious: you have a single variable, defined as int, that contains either zero or non-zero to indicate a status of some kind. Bit fields are a way to get the compiler to break down an int into its component bits; single-bit fields contain either one or zero. Bit fields access ints as if they were structures, yielding very compact data structures. Bit accesses are programmer-defined bit ranges in a given field, which means that the programmer has the joy of using C’s relatively arcane bit operators to determine or set values.

If you’ve been paying any sort of attention to what I’ve said already about code complexity, you’ll guess that I use ints for flags in every case where I’m not required to use a different access method. “That’s wasteful of space,” I’ve heard, and it’s true – on Linux, and every other 32-bit operating system, you can fit thirty-two single-bit fields in a single int, and if I had that many flags, I’d be wasting 124 bytes. Wow. Sounds like a lot when it’s put that way, doesn’t it – especially in today’s systems with a minimum of 16MB of RAM. In addition to being extraordinarily simple to write and use, this kind of flag is usually faster, too, since simply loading the variable into a processor register tends to set the zero flag – viola! Instant comparison to zero.

I haven’t used bit fields in a while, but they used to be rather slow. Their access methods are quite easy, though, since they look just like ints, except their allowable ranges are much smaller (unless for some reason you defined a bit field to be the same length as an int).

Bit accesses are relatively fast, although not as fast as using whole ints, but the for a programmer to be comfortable (and thus skilled) with them, the programmer has to spend a lot of time with them. This falls under my “neat-o C feature!” heading, so I avoid their use. Once again, the less thought you have to put into how a given statement works, the better – since you can then concentrate on what the statement does.

Another issue with bit fields is that ANSI doesn’t specify the sign of a bit field, so that’s a potential portability issue. To be honest, I don’t know of any situations where this has caused a problem, because I don’t use bit fields and I tend to eliminate them when I see them unless they’re used for hardware interfaces. Quick rule of thumb: declare the sign of your bit fields, if you have to use them.


Author’s note: the following section was drastically revised on November 4, 1997.

Another common complexity that bears some looking at is conditional assignments in loops. Conditional assignments in single if() statements aren’t worth the trouble; save everyone just a bit of hassle, and break the assignment out.

Let’s look at a different construct, though, the while() loop. I used to advocate (here, in fact) that the rule about “no assignments in evaluation expressions” applied to while(), too, but I’m going to issue a retraction.

So here’s what’s going on. I used to say that the construction below wasn’t very good, in the long run, because the while() actually does two things instead of one: it assigns the variable, and then compares the variable’s value to something else.

int InputChar;
while((InputChar=getchar())!=EOF)
  {
  DispatchToKeyHandler(InputChar);
  }

There are other ways to do this, and other ways are necessary in languages like COBOL (yes, I know COBOL, and I’ve worked with it; no, I don’t do Y2K). There are priming reads, do { } while() constructs, and others, I’m sure.

However, after a year or so of playing around with this, I’ll come down firmly on the side of the assignment with evaluation… in this one case. All others, I’ll still say to break them apart.

So why do I suggest accepting this construct?

Strangely enough, it’s not performance-related; priming reads are just as efficient. It’s not complexity-related, because if the assignment is complex, it should be broken out. If anything, it’s culture-related. The assignment and evaluation construct is very, very common. Even inexperienced programmers can recognize it, so it’s not a matter of catering to the newbies among us.

What’s more, compilers can still catch bad syntax, as long as you use parentheses properly. If you have an assignment instead of a test for equality, the compiler will complain that it can’t assign a value to an expression (an expression being different than a variable) and you’ll instantly hunt down the error and fix it.

Of course, the caveat is there: “As long as you use parentheses properly.” I tend to wrap parentheses around everything, even when order of evaluation is on my side, just in case I ever want to go back and change things… and I’m in such a hurry that I don’t notice an evaluation order conflict.


Related to both the issue of individual line complexity and loop blocks is the issue of block complexity. A good block of code is simple, performing as little as possible within the block itself. For example, a while() loop might have a set of related functions within the block that perform the repetetive code, instead of containing every last bit of repeated code within the block. This makes understanding the block easier, and since you naturally name your functions well, it even adds a bit of self-documentation to the program. In our example above, DispatchToKeyHandler() is the repeated function. We don’t know how complex DispatchToKeyHandler() is; we don’t care, really, unless it’s got a problem in it somewhere. If it does, it’s not going to be related to this loop, so we can discard the loop code as being relevant to the problem. This helps in debugging and in maintenance. You have been warned.


Some astute programmers out there are saying that sometimes, writing super-clean code just does not pay. I’ll put my foot down on the issue, and declare: they’re right. Sometimes it’s better to have a non-obvious algorithm, or a tricky bit of code, than it is to have plain-jane stuff everywhere. The key here is to know why you have non-obvious stuff in there, and document it. Are you using the Bresenham algorithm (for drawing a line, for instance) instead of the high-school geometry method of determining the slope? Good! Bresenham is an integer-only method, and for that and other reasons it’s super-fast compared to the slope method, but it’s also much longer, and isn’t quite as obvious. This is a candidate for good documentation of what’s going on, and the documentation should mention something about performance gains outweighing the need for obvious code.

That example stretches the issue a bit, really, because any graphics programmer worth his or her salt knows about the Bresenham line algorithm or a variation of it, so you might not have to explain that particular algorithm. The principle is sound, though.

Another reason you might favor less obvious code is optimization. Optimization is usually a red herring; I’ve seen programmers optimize minor code that’s executed once in the program’s execution. While their effort may have been well-intentioned, what I’ve found is that usually optimization is entirely unnecessary. Bottlenecks are usually found in tiny portions of the overall source, and are valid candidates for optimization, but the rest of the code should probably be made as simple as possible.

Even optimized code should be approached with the idea of future maintenance. According to Steve Maguire in “Writing Solid Code,” Microsoft has two versions of Excel’s calculation engine; one is in straight C, and the other is in super-optimized assembler. The C version is used to check the assembler version, and was probably used to help determine the assembler version’s algorithms in the first place. This two-pronged approach helps the code detect its own errors, and it also makes future maintenance a lot easier.

Naming Variables and Functions

Naming variables is crucial in programming; after all, a name is how you reference something. A good name carries with it information about the variable it references, without being language-dependent if at all possible, and does so in a non-intrusive manner. There’s really no more recommendation I can make. The best possible thing to do is establish a standard and stick to it, no matter what it is.

Variable names are rarely simple, like “color” would be, because you’re likely to have to track more than one color in a program. This brings up the issue of how to separate descriptors. One school of thought says to use an underline (for example, “normal_text_color“, known in 2016 as “snake case“) and another school of thought uses upper-cased starts of words (e.g., “NormalTextColor” – 2016’s “camel case“). There are variations on this theme, including a combination (“Normal_Text_Color“), and positional requirements (one style asks that the first word be lower-cased, and the rest upper-cased, like so: “normalTextColor“.)

Personally, I use “normalTextColor,” because it’s easy to type and easily readable (for me) but your preferences may be otherwise. Some avoid case-sensitive names because of the potential for abuse, a la Microsoft Windows… and if you’ve read Windows code and recoiled in horror, you might want to reconsider using case-sensitive names.

Now that we’ve lightly addressed how to name your variables (and ducked the temptation to dictate a style), the next question is “What do you name your variables?”

There’s no good, solid answer to this, in my opinion. There are three or four prevalent naming styles, applied with varying success and in different degrees. Three are: functional naming, hungarian notation, and type-based naming. (If you have another that is not a strong variant of one of these, please tell me about it.)

Probably the strongest out of these is functional naming. Functional names consist of a description of what the variable represents in the program’s overall design. Examples of this might be InputFile, or InputChar, as used above. This style is strong because it’s usually language-independent, it’s usually very clear, and it’s easy to remember. Of course, it can be abused by assumptions, such as assuming everyone knows what abbreviations like TTF mean.

Hungarian Notation is a standard originally proposed by Dr. Charles Simonyi from Microsoft, and a bad form of it is extensively used in Microsoft Windows programming. Hungarian is very strong, once you’re used to it; it incorporates some of the strengths of functional naming in with a well-specified set of type descriptors. This means that you can look at a variable and not only know what it represents, but how it’s used and maintained. For example, pfInput might be an equivalent to InputFile from functional naming. hwinInput (from Windows code) would be read as “handle to a window used for Input.” The weaknesses of Hungarian notation are probably obvious: in addition to being quite arcane for those with little or no exposure to it, it also requires a well-defined prefix set, which may or may not be easy to design (especially in object-oriented environments like C++). The difficulty of designing prefix sets tends to give rise to type-based naming, a bastard and evil offshoot of hungarian notation.

Type-based naming is usually what gives Hungarian notation its bad reputation. Hungarian notation uses prefixes based on the type and usage of a given variable (for the most part), rarely making references to specific language implementations, whereas type-based naming relies almost exclusively on the underlying language implementation. Occasionally this is useful, but rarely, and in specific environments. For example, for teaching materials (used to teach a specific language), type-based naming can be used to show exactly what types are being passed back and forth (iInput=getchar() would illustrate that getchar() returns an int, for example.) I use type-based naming extensively in my tutorial pages for this very reason. For complex (or just plain real) programs, though, type-based naming turns into a multi-headed, immortal monster.

Consider our iInput=getchar() example. Suppose, somewhere down the line, getchar() is replaced by a user function that allows for extensions (for mouse input, for example). We might still be able to fit the returned values in an int, but what happens when we decide our new input routine should return a data structure containing positional input, button selection, or the key pressed? At this point, a maintenance programmer gets to make a few changes: the first change is to the type declaration, to make it a structure. The next change is to fix the now incorrect name from “iInput” to “sMouseInput,” or whatever the standard is for a structure type in type-based naming. That’s relatively easy for an editor to do, since global search-and-replace is common now, but why? Why not make changes as small as possible? Such a philosophy makes future maintenance much easier, and less buggy.

Type-based naming also runs smack-dab into an issue I’ve covered lightly so far: what is a variable, exactly? Data types are not real; they’re only representations. a C char type may or may not represent a character (wide characters, anybody?), and an int can represent any number of things: time of day, a count of oranges, a count of apples. getchar() suffers from having a poor definition (it returns two different things in one variable) but the name is not necessarily misleading if you are able to remember that a keyboard character is not necessarily a C char.

As usual, the best way to go about naming conventions is to first try them out to the best of your ability, and then – once you’ve found one you’re comfortable with – apply it as strongly as you can tolerate. Watch how other people react to your convention, if other people might look at or use your code. If they gag when they see it, and they’re not giving you a knee-jerk reaction, you might need to revise your naming conventions to be a little more conventional for where you work.

Jeroen C. van Gelderen mentions that type-based naming isn’t that big of a deal, even in the case mentioned above. Even in the case where the type changes, you’ll have to walk the code to fix the references anyway, or the code won’t compile and link, right? So why not fix the name along the way?

My response is still the same: why not get the name right in the first place? You still have to change the references if the usage changes (a likelihood in C, and slightly less likely in C++), true. That’s enough, in my opinion. Why stack an extra task in there? The way I see it is this: If I can get a job done with less work, less runs through a compiler, that’s good.


Personally, I went through a phase where I thought Hungarian was the thing to do. Hungarian has a lot of advantages, when it’s used properly, but unfortunately it’s a pain to use properly. In addition to being cryptic, it also lends itself greatly to a great amount of confusion about the proper prefix conventions. For example, consider C++ classes, which may have data members; Microsoft’s convention is to prefix members with m_, which, while clear, introduces a bad precedent, one of positional notation in naming (something most Hungarian conventions suffer from already). It’s bad enough that Hungarian requires study of a standard before it can be effectively used, but adding more dross to an already-crowded prefix is asking a bit much, in my opinion.

Designing Functions

Naming Variables and Functions and Using Language Features are both closely related to function design. Good function design will normally yield relatively simple code and well-formed names. So how do you design functions well?

The first step is determining exactly what a given function will do. For example, a function’s purpose might be: “Return a value indicating the collation sequence of two strings, based on their 20th character onward, for use with qsort().” (Yes, this is our example from the Writing and Formatting Code section.) If the purpose can’t be stated cleanly, then either the designer doesn’t know what he wants the function to do, or the function is doing too much. Either situation is a problem. At any rate, once we’ve got what a function will do, we can come up with a name: “CompareStrings()” is too generic, since that’s what strcmp() does; “MyCompareStrings()” is also too generic; how about “QSortCompareStrings()?” I don’t care for this one, either, for the same reason. In this case, we probably need to look at the need we’re fulfilling, which might be stated as “We need to sort this input file based on last name, which starts at position 20.” If that’s our purpose, we’re not sorting strings as much as we are sorting input records by last name. Hey! We have now stated our function’s purpose directly; the object of our design facilitates sorting input records. Why don’t we call it “CollateInputRecords()“, since that’s exactly what it’s doing?

The next step is to design how your function will accomplish its task. In our simple example, you might first say, “This function will call strcmp() with the two string values it’s passed.” Of course, this statement doesn’t take the right view; for one thing, it’s not really considering the offset nature of the data (we’re sorting from position 20 onward, remember?), takes no account of the nature of our input, and does no error checking whatsoever. It’s a decent start, though, since it’s direct. Let’s keep working with it.

How about “This function returns an indicator of the collation order of two records”? Now this has potential. It doesn’t rely on any specific language features, specifies that the data structures being passed to it are complex, and is direct. If we were doing more design stuff related to the rest of the program, we’d probably say what kind of records they were.

Now we can determine a little about implementation. We know that this function is to be called from qsort(), and because we’re astute developers, we know that the structures are actually held in C strings (pointers to char). (That’s a bad idea, in my opinion, but we’re just working for the sake of the discussion.) Since we’re collating, that sounds like a perfect job for strcmp or one of its related functions. We’ll be very simple here and just assume strcmp() is enough.

After determining what the function does and how it will work on a very basic level, it’s time to look at the actual function design. We have already thought of how the function should work – by calling the C library function strcmp() – but we haven’t really looked at what the function has to work with or on, or any issues like error checking. The next step we have is to work out what parameters we have.

qsort() is fairly rigid; it passes two void pointers (that point to the data items being sorted) to a collating function, which returns an int, which should be -1 if the first element is to be before the second element, and 1 if the second element is to be before the first element, and zero if they are collated identically. Because of the processing requirements of qsort(), our inputs are pre-determined for us: const void*, and const void*. qsort() passes them indiscriminately, so let’s call them FirstElement and SecondElement. Our output is also determined by qsort(): our CollateInputRecords() is to return an int – which strcmp() can do quite well for us.

Outputs are important, and my example here really doesn’t go into the issue very much. Why not? Because it’s a utility function for qsort(), which has specific requirements; we can either fulfill them or not, and we obviously want to fulfill them. However, we can’t really discuss function design without hitting the issue of function outputs a bit harder.

As an example of poor output design, consider getchar(), a C library function. getchar() returns an int, which is slightly misleading (its name makes it sound like it returns a char, not an int – but if you’ve read the section on variable types, you understand why this isn’t a serious gripe on my part). However, in this int, we have a dangerous problem: the int contains either a C char (in the low byte), or EOF, normally defined as -1. We have our output doing double duty as a status indicator and as a data container. To me, that’s bad design; I don’t want a function whose outputs I might have to explain someday. I’d be happier having getchar() reprototyped as “int GetChar(char *value)“, and having it returning whether it was successful or not – and if it was, the character retrieved would be in *value. That’s much clearer, in my opinion, and since each variable involved has only one purpose, much better.

Back to our example: we now have our name, our inputs, and our outputs:

int
CollateInputRecords(const void *FirstElement, const void *SecondElement)
  {
  /* Code follows */
  }

Now we’re down to the nitty-gritty. The next step is to work out how the code will accomplish its task of collating… we can do this with by writing pseudocode into the function that describes the function’s workings.

Here’s our next view of the code, after pseudocode:

int
CollateInputRecords(const void* FirstElement, const void* SecondElement)
  {
  /* Offset the elements to the 20th position */
  /* Determine the collating position of FirstElement */
  /* Return the result */
  }

Let’s see – does that accomplish our task properly? No – we’re still not adding any debugging code up front, to make sure our input records are present and, if present, long enough. We’ll rectify that in the next revision of our function, but it’s good to think about debugging up front. Adding code to help you determine errors is a good practice; it’s why I spend a lot more time testing than I do debugging.

It’s now time to start fleshing out our function. First, let’s add our grunt code, since that’s not only the easiest part, but it’s going to tell us what we need in terms of variables and testing code.

int
CollateInputRecords(const void* FirstElement, const void* SecondElement)
  {
  /* Offset the elements to the 20th position */
  FirstElement+=20;
  SecondElement+=20; /* Author's note: will this work? */

  /* Determine the collating position of FirstElement */
  Collation=strcmp(FirstElement, SecondElement);

  /* Return the result */
  return Collation;
  }

Well! We’ve now got the guts of our code written, but it won’t work, for a lot of reasons. The most obvious reason is that we don’t have any declared variables yet. Another is that we’re passing strcmp() two void pointers, which a strict type-checking compiler won’t like. Also, we’re modifying the void pointers, and they’re declared const, as the qsort() prototype specifies.

There’s a side issue mentioned above, about strict type-checking compilers. I enable every possible check I can find when I write code. I use lint, when I can. I want the compiler to warn me about every last possible screwup it thinks I could have made when I typed the code, down to an accidental “==” where I meant “=” (and vice versa) to … well, whatever it can find. I think it’s a good idea. There’s an anecdote that Steve Maguire tells, where programmers spent a good amount of time tracking down a bug, when a compiler warning had more or less told them of the problem the whole time…

Anyway, we can see that first off, we need three variables: two “Element” placeholders, and a return value, which we’re calling “Collation” according to the code as we’ve written it so far. Let’s re-write our code to include the variables, and initialize them properly:

int
CollateInputRecords(const void *FirstElement, const void *SecondElement)
  {
  char *First;
  char *Second;
  int Collation;

  First=*(char **)FirstElement;
  Second=*(char **)SecondElement;

  /* Offset the elements to the 20th position */
  First+=20;
  Second+=20;

  /* Determine the collating position of our pseudo-FirstElement */
  Collation=strcmp(First, Second);

  /* Return the result */
  return Collation;
  }

This code now works, but it still isn’t perfect. The biggest problem is a lack of error checking. qsort() should never pass NULL pointers to its internal routines – I think – but what if it does? We might crash our machine entirely, if we’re using Windows, or merely dump core if we’re using Linux or another UNIX variant. (Hmm, which do we think is better?) That’s a broken qsort(), if that occurs, but do we want to rest our program’s validity on an external library? What if we have written our own version of qsort() – don’t we want to make sure it works, as well? Another possible error – and a more likely one – is that our input records might be too short. In this case, we’re sorting strings read in from a file, presumably in ASCII. What if a user inserted an errant end-of-line, and shortened the record to ten characters? In that case, we’re attempting to collate based on memory that doesn’t contain what it’s supposed to contain – a drastic error indeed.

Another gripe I have about this code is the final comment – “Return the result.” That’s fine pseudo-code; it allows us to see exactly how the function works before we code it, but now it’s just taking up space. (In Pascal, this may not be true, because Pascal assigns function result values by assigning the value to the function name.) We can yank the comment, and – more importantly – put in code to assure that our input is valid. We can leave the NULL checking for debug-only code, but the record length issue is one that we probably ought to leave in production code. Our last revision of this function might thus be:

int
CollateInputRecords(const void *FirstElement, const void *SecondElement)
  {
  char *First;
  char *Second;
  int Collation;
  int ElementLength;

  /* Debug-only error-checking */
  assert(FirstElement!=NULL);
  assert(SecondElement!=NULL);

  First=*(char **)FirstElement;
  Second=*(char **)SecondElement;

  /* Production error checking, if reading input doesn't do it */
  ElementLength=strlen(First)
  if(ElementLength<21)
    {
    fprintf(stderr, "FirstElement is too short!\n");
    exit(1);
    }
  ElementLength=strlen(Second);
  if(ElementLength<21)
    {
    fprintf(stderr, "SecondElement is too short!\n");
    exit(1);
    }

  /* Offset the elements to the 20th position */
  First+=20;
  Second+=20;

  /* Determine the collating position of our pseudo-FirstElement */
  Collation=strcmp(First, Second);

  return Collation;
  }

We probably could put more informative results in our error messages, but for the sake of this example we’ll leave them mercifully short. However, the point should be obvious: we should be able to pass this function any kind of bad data, and it will crash and burn – without harming anything else. If the programmer passes bad data, this code will explode so obviously that any programmer working on the project would be able to zero in on exactly what the problem is, and stamp out the bug. (In this case, a NULL pointer is a problem in the sort routine; a short element is a problem in the input file, which probably should be handled in the routine that builds the array of elements to be sorted.) This routine is ready to be tested – and if it fails a test, then seeing the error would be very simple, even without a source-level debugger. I think this is a good function. It’s wordy, yes, especially when you consider that it could be reduced to a one-line statement, but it also is easier to write, easier to watch run, easier to debug, easier to maintain, and easier to guarantee. And that’s very good.

Comments in Programs and Functions

Comments are a form of documentation, one that the user rarely (if ever!) sees. It’s also a way of making code clearer, if done well, and that goes against the hacker’s mentality of “make the code as complex as possible.” So how does one comment well?

The first thing to think about here is: are comments really necessary? Consider three projects I’ve worked on recently. The first was a Pascal “split” program, that simply split up a file into pieces of equal size. It was probably eighty lines long, and was written in about fifteen minutes (after a bit of research into how Pascal handles files. ASSIGN(), RESET(), CLOSE(). That’s too intuitive – sarcasm intended.) The second was an economy simulator, in C++, roughly three thousand lines of code. (I had this counted originally at one thousand. Looking at an incomplete source listing that reached some 60 pages, I think it’s fair to adjust the number up.) The third was a program that formatted data according to a column description and output specifier, in C, designed to be sent and ported to various platforms and compilers.

The first project had absolutely no comments. None. Zippo. It was tested, verified, and run… and discarded.

The second project, the economic simulator, had a good bit of documentation at the start of the code, which primarily discussed how the simulator should work. Mixed in with the code itself were occasional references to the design documentation, which was meticulously maintained.

The third project had comments for every function and every module. Each function had a description of what the function’s purpose was, how it did it, inputs, outputs, modified parameters, maintenance histories, program design, the whole nine yards.

Personally, I think all three were documented properly. The first project had a lifetime of roughly thirty minutes, from start to finish; it was a spot-write, to split a single file for conversion to an OS that couldn’t read the file system I was using. The second program was pretty simple, really, once you understood how the system worked and all the data interacted; occasional references to the documentation (which was in the source file itself) was enough. The third program was also documented properly. We couldn’t test our program on every platform the code would run on, and we had to anticipate someone not familiar with the code modifying it; because of that, we documented everything so that if a problem was found, the bug could be stamped out without our customers spending an awful lot of time on it. (We never actually got a bug report, but we were trying to anticipate one.)

All that is to answer the question: “Should I comment?” The answer is, of course, “Probably.” The main issue to consider in deciding whether you should comment is the life and complexity of the project; if it’s a once-off that’ll never see the light of day again, don’t worry about it – but if you ever want to revisit the code, or if anyone else has to, comment enough to make the code clear.

The best way to do that is to practice good function design, as mentioned above. If you turn your design into comments, you’ve already done a good job; keeping the actual code simple helps, too. Other things you can comment are: function inputs and outputs, coupling issues, references, optimization notes, and possible enhancements.


I might as well add a note or two related to Java here. My rule of thumb for Java documentation is this: if a class has a package statement in it, it should be fully and completely documented.

That sounds dictatorial, so I’ll detail what that means: every variable and method should get at least a blurb with some sort of detail. (Not so dictatorial after all, is it?) That’s because Java has a tool called javadoc, which can build HTML documentation for you. By using javadoc comments, you enhance your documentation with very little effort – effort that can pay you back in spades.

Writing for the Future

I’ve written an awful lot of utility programs in my career so far. I’ve even re-written utility programs, as the inadequacies of earlier versions – internally and externally – became painfully clear. I’m not proud of it, but there’s a point to be made here, as well.

In some cases, I was quite proud of the earlier versions. “Look at my killer state machine… it’s complex, but man, it works somehow.” “Check out this indexing scheme. It’s fast, it’s beautiful, it’s mine, all mine.” This is an immature attitude in both cases. My “state machine” I was so proud of was entirely unnecessary for its task; my “indexing scheme” was nice, but not documented very well. Both programs were satisfactory at first; the second (a subroutine, actually) is still in common usage, I believe.

In retrospect, I’d say both of these were vastly flawed. My state machine program was junked for a cleaner, much less complex program; my subroutine should have been commented and simplified. In both cases, I wrote for myself, and it turned out to be a problem. I should have written code that was less convoluted, relied less on tricks or hacks (or “Using Language Features“), and aimed more at the other people in my office. In both cases, there were very few other programmers that could really have delved into both sets of code and appreciated their complexity, and if I’d have presented the code to them, they’d have been right to point out that the complexity seemed to be there for complexity’s sake.

As it is, now that other programmers have to maintain the subroutine, it’s not maintained at all. The indexing scheme takes too long to work out, and while the gains might be worth it, nobody has the time or the will to slog through the code to work it out. That’s the danger of writing code for yourself as an audience; you tend to get caught up in tricks and traps, and the code turns out to be worth much less in the long run because the code costs too much to maintain. I could have kept modifying that subroutine, keeping it current, after I’d left that section, but I didn’t want to bother with legacy code like that. If I’d written it better, I wouldn’t have had the issue come up.

The moral? Write for the lowest common denominator, and you won’t feel like one as time goes on, and it improves co-worker relations as other programmers take on your legacy code. Well-written programs won’t require your particular expertise to modify, freeing you for more interesting projects.

Debugging

By now you can see how I write code, and how I think about code. I think that this style is under-emphasized in the industry, because of the ivory-tower concept (i.e., “if you’re not a hacker, you aren’t a programmer”), and because many early authors didn’t understand what a problem writing terse, misunderstood code would be. If you write code as I and some other authors have suggested, you end up spending less time debugging because your code has fewer bugs to begin with… and when testing reveals a bug, the bug is so obvious that it takes relatively little time to solve the problem.

That said, there’s still a bit to say about this area.

First: Test, test, test! You should be testing your modules as often as you can, under as many circumstances as you can. If you allocate memory, simulate memory failures. Use an allocation tester (a malloc library) to track your memory allocations, to make sure you’ve no unfreed blocks, to make sure you’re not using memory after you’ve freed it. Testing should stress your system to its limits, and throw as much at your program as it can expect. You want to expose bugs early, if you can, and you want to make their cause as obvious as possible.

Trace your code in a debugger. This allows you to really see what’s going on, instead of assuming you know what’s going on, especially if it’s code you didn’t write, or code that isn’t written simply.

The assert() macro, used properly, is good for exposing bugs that make it past your trace. I use asserts on each input that needs to be ranged, and I keep my assertions simple (i.e., one expression only). This means that if I get an assertion, I can zero in on exactly what the problem was. Looking at our code for sorting input records, if the line I receive an assertion on is “FirstElement!=NULL,” I have no problem at all figuring out that my function received an invalid first element. Tracking that down is relatively easy.

Of course, debugging tends to be an ad hoc design revision process. If you can, refrain from this odious practice, since it tends to lead to invalid design documentation, and poor consequence analysis. If you find your design is flawed, change your design – in the design documentation – and then change your code. Often, programmers use the source code as a proving ground for ideas, trying one, then the other. This is a recipe for disaster, for a number of reasons.

One reason is that the practice implies that the programmer doesn’t know what he or she is doing. “Hmm,” you can hear, “How do I do this? Why don’t I try this… no, fatal error. How about this… okay, this seems to work for some voodoo reason.” Try again, says I. Know what you’re doing and why you’re doing it. If something doesn’t work out the way you think it should, then investigate. If you had a misconception about how exec() worked, for example, this is your chance to fix it and learn the right way.

Another reason is that you don’t change the design document until you’ve actually found an algorithm that works… and by the time you’ve tried and discarded various algorithms, you’ve probably gotten used to not following the design. The standard programmer thinks, “Why change it now? It’s already out of date.” This attitude invalidates and undermines the design document, so that when you need to have a set of requirements… you don’t have one. The design document, which is supposed to outline how to fulfill your requirements, is outdated and incorrect.

An anecdote: At one miserable point in my career I used a supposedly object-oriented, proprietary language. The language was at revision 2, and the documentation for it was for revision 1… in addition to the revisions being off, the language specification for revision 1 was incorrect in various minor ways, too. All we could really do was try to follow the documentation we had, and if that was wrong, we had no other recourse but to try things out until it seemed like the language was doing what we thought it should. The moral: if the proprietary authors had actually documented the design, they would have had a better design, and they would have solved everybody a lot of headaches in the process.

There’s really not a lot more I can say about debugging, except “Be aware, test, and prevent bugs before they start.”

Improving

So how does a programmer improve? I think I’ve improved as a programmer and as an analyst as time goes by, as I’ve indicated by using my own mistakes as examples. How did I do so?

There are two good answers. The first is that I read a lot, and the second answer is that I write a lot of code.

I can’t overstate the value of being well-read. I don’t think I’m well-read enough, mind you, but I spend a lot of time reading magazines (such as C User’s Journal [CUJ’s content now apparently exists via Dr. Dobb’s Journal but apparently the website itself is gone- note, 2019]), manuals (such as O’Reilly and Associates’ books on UNIX tools, and other things), books on coding per se (such as “Writing Solid Code”), books on software construction (such as “Code Complete”), and sometimes even ads (such as Gimpel Software’s PC-Lint ads). All that material shows me various ways of doing things, possibly even with figures to back up their statements. If what I see makes sense, I try it. If it works for me, it gets incorporated into my personal informal standard.

If reading a lot about my profession is important, coding a lot is critical. I think an inexperienced programmer might be able to get by without reading “Deep C Secrets,” but there’s no way a new programmer will ever get better without trying to apply his knowledge. If pointers are a foreign concept to you, you’re never going to be able to figure them out by just reading about them – you’ve simply got to play with them until their use is natural to you. Pointers are a simple example; the same applies to TCP/IP sockets. If you don’t use them, they’ll never be familiar to you.

So, go code!

But What About Perl?

Well, by now you’ve hopefully read all of my previous pages on programming style. You might have even gotten some useful things out of them. After all that, I know that in the back of your head there’s this still, small voice saying “But what about Perl?”

After all, pretty much everything I’ve said so far can be summed up in a few sentences: Standardize. Use simple constructs. Think about what could go wrong before it goes wrong.

Use One Way To Do It.

Well, the Perl community’s motto is “There’s More Than One Way To Do It,” often abbreviated as TMTOWTDI, pronounced as “tim-toady,” unless you want to pronounce it some other way, which is all right – after all, there’s more than one way to do it.

This flies in the face of pretty much everything I’ve said so far. You might think I dislike Perl, that I think it espouses poor or sloppy or unreadable code, that it encourages a lack of analysis.

Hoo-boy, are you wrong. Perl does all these things, and I love it.

Perl is wierder than I can possibly describe in a single web page. Larry Wall describes it incompletely in an O’Reilly book, Programming Perl; I don’t think I can compete with Perl’s author. Let’s just say that Perl has a neat way of bending itself around your problem.

Ay, there’s the rub.

I had an interesting discussion with some co-workers yesterday (August 11, 1998) about their desire to learn C. I like C; I use C. I have no problem with their desire to learn C, but they were less interested in C than they were in getting things done.

C is a decent language for expressing code. It’s also capable of handling problem solutions on a high level. The lower you get, though, the uglier it becomes, with the programmer wrestling with memory allocation, byte overruns, type casts, pointers to pointers, neat C hacks, ugly C “features,” and more warts. A simple C program has more error checks than I care for, and the definition and implementation of C more or less requires that you go through all this stuff so that C doesn’t bite you. As I mentioned earlier, most of my pages on style have focused on minimizing that bite.

Perl, on the other hand, doesn’t care. Perl makes a lot of assumptions for you, and frees you from translating your solution to code; in Perl, you just write your solution, generally in Perl, although you don’t have to. (And that’s a neat trick.) If your machine has a facility to do something, Perl will happily use it to accomplish that something, and manipulating data in Perl is a dream come true – and for those of us who have done a lot of data manipulation in languages like C, Perl establishes what the dream should have been in the first place.

Even the code is cool. For instance, in C, if you want to execute code based on a condition, you’re going to have this kind of code (unless you’re truly pathological):

if(somethingIsTrue)
    {
    doSomething();
    }
else
    {
    doSomethingElse();
    }

...
/* perception would say not to do this; poor idiom */
(somethingIsTrue?doSomething():doSomethingElse());

In Perl, you’re nowhere near as restricted in perception or in reality. Hold on to your hat, it could be a wild ride:

doSomething if $somethingIsTrue;
doSomethingElse unless $somethingIsTrue;
doSomethingElse until $somethingIsTrue;
doSomething while $somethingIsTrue;
doSomething or doSomethingElse;

These aren’t quite identical, and don’t assume the exact equivalence of code, in particular the last example.

Just as one more example of some of the differences, let’s open a file, dump it, and then close it. We’ll assume the filename is already established.

In C:

/* filename is a char* */
FILE* inputFile=fopen(filename, "rt");
char buffer[1024];

if(NULL==inputFile)
  {
    fprintf(stderr, "can't open %s for reading.\n", filename);
    /* "fopen" isn't quite right, but it will do. */
    perror("fopen");
  }

do
  {
    fgets(buffer, sizeof(buffer), inputFile);
    if(feof(inputFile))
      {
        continue;
      }
    printf(buffer);
  } while(!feof(inputFile))
fclose(inputFile);

Now in Perl:

open INPUT, $file or die "can't open $filename: $!";
while(<INPUT>)
  {
  print "$_\n";
  }
close INPUT;

This code does effectively the exact same thing. It’s much shorter, spends almost no time in error checking (one part of a line against seven and one conditional!), and it expresses exactly what it does. It opens the file or it dies. Then, while data exists on the input stream, it prints the data. (The notation is odd, but that’s Perl for you.) Then it closes the file. Six lines against almost twenty.

I can’t really show you how great Perl is at molding itself around your specific problems. All I can do is recommend you investigate it for yourself, and mention a few caveats.

The Blessing of Individuality

s/Blessing/Curse/

Author’s note: that’s Perl’s expression for “replace ‘Blessing’ with ‘Curse.'”

Perl molds itself around your solution. In one way, that’s fantastic – in another, it’s not so hot. What this means is that if you think about something in a non-standard way, you might catch yourself without a clue when you revisit your solution. Document or pray – your choice, unless you’re very consistent.

What about my precious Standards?

Naming conventions still apply in Perl! In most other cases, standards are irrelevant, because of their problem domains (e.g., memory allocation, which Perl does for you.) Programming style is still an issue, but dictating style to a Perl programmer is a lot like dictating that all athletes should wear cleats… and pity the poor tennis and basketball players.

Last Words

Try Perl! It’s a wonderful language. It’s very UNIXish, so it may take you a bit to get used to farming out small tasks, but you’ll grow to love it. It’s fast, it’s relatively small, and it’s simply amazing in how it allows you to do things.