Rules for NULL
(Programming)
To NULL or not to NULL
In this essay, when I speak of database engines, I'm talking specifically about those which support PostgreSQL, MySQL or SQLite, and the PHP language. I am quite specifically not talking about MSSQL or Oracle. Oracle is massively overpriced and no better for most purposes than PostgreSQL. I use Linux almost exclusively, and don't and never have used MSSQL. Further, I've heard from some people whose opinion I trust that MSSQL is a botch compared to MySQL.
In databases, NULL is the value that usually gets inserted when a value is not specified for the contents of a field. Unfortunately, standards documents for SQL don't fully specify the behavior of database queries when NULL values are involved. This means that different database engines handle NULL differently depending on the circumstances.
That makes NULL a problematic value, as consistency is key when databases and programming are involved. So my preference is to avoid NULL whenever possible. Doing that involves creating some rules which will go forward from the design phase of every table.
First, let's tackle the worst offender of all: NULL reference values in dealing with foreign keys. Let's say we have an invoice table, and one field of the invoice table is a customer number. That number refers to a customer number in the customer table. Because of this referential relationship, it's possible to print out an invoice with the actual customer name and address. The number in the invoice table can be used to reach into the customer table and fetch the name and address for printing or listing. That works okay until you, for one reason or another, decide to try to store nothing in the invoice table for the customer number field. A NULL value would normally get inserted there. Except that, if you've specified a constraint that a valid customer number must be stored in the invoice table's customer number field, and there isn't a customer record indexed by a NULL value, your invoice insert queries are going to fail. In fact, even if there is a customer record with a NULL index value, this still might not work. The database engine may not treat the NULL value in the invoice customer number field as being the same value as the NULL customer number value in the customer table.
In this case, there are two options. First, at the outset (upon creation of the database and its tables) don't put referential constraints on foreign keys in. The problem with this is that you're robbing the database engine of one of its key functions: ensuring the referential integrity of every record in the database. It means that it's now up to you to ensure database referential integrity when it comes to your customer and invoice tables. You've got to set up extra queries every time you store an invoice record, in order to make sure that the integrity is enforced. This will slow down your programs.
The second solution for this problem is to (as one of your first actions) to store a bogus customer with a 0 customer number in your customer table, but maintain your built-in constraint. This is tricky, since customer numbers may not be actual numbers at all. They may be strings. In which case the answer is similar: storing some bogus value for the customer "number", like 'TEST', or perhaps even the empty string (''). This might or might not work, depending on the RDBMS.
In any case, the answer for any situation involving referential integrity, is to avoid NULLs. Even if it means you have to do the integrity processing yourself.
Next is the case where NULLs could be stored for the various types. For example, dates. In PostgreSQL (and possibly other engines) it's not possible to store an empty string in a date field. Normally, PostgreSQL wants its dates in ISO 8601 format and quotes (like a string). To wit: '2015-04-02'. It's understandable that PostgreSQL would want a date value quoted, since otherwise it looks like a math value involving subtraction. But the fact that it's quoted might make you believe that a "blank" quoted string would work when you try to store it in a date field. But no. A good example of a date field where you might store nothing is a reconciliation date field for a check table. When you first put in the check, there probably is no reconciliation date. But you'll come back later and update your check records with reconciliation dates. But when you first insert the records, you have to come up with a way of storing nothing or some "signal" or "sentinal" value in the date field. In this case, I'd recommend (for ease of use), allowing NULL as the "blank" value and specifying it as the default value. This is assuming, of course, that you know you're going to have to store "nothing" in that field at some point. In our case, we were talking about a reconciliation date. For a transaction date in the same table, a "blank" value would be forbidden. In that case, do a field constraint of NOT NULL, so that a NULL value couldn't possibly creep in. And don't specify a default value.
That about covers dates. The rule then is, if you anticipate having to store a blank value in a date field, allow and enforce NULL as a default value. Otherwise (if a "blank" value should be forbidden).
Next up, numbers. This includes integers and floating point values. If you anticipate having to store records which have a "blank" value in a numeric field, prefer 0 to NULL. Example from our checkbook program: the amount field on a void check. Normally checks and deposits would have plus or minus numbers in their amount fields, unless the record was a voided transaction. But for a voided transaction, a zero value in the amount field would be a perfectly valid entry. So enforce that. Ensure this is the default value, and specify as a default field constraint a zero value (0). This same logic might apply to the line item quantity on an invoice line item. It's probably difficult to imagine why one would have a line item with a zero quantity of something, but it could happen (I've seen it). So specify 0 and NOT NULL.
So for numbers, in almost all cases I would suggest avoiding NULL and specifying zero or some other sentinel value as the "blank" field default. Maybe a negative number. Maybe the negative maximum integer value for your platform. It's worth noting that for languages like PHP, the contents of database queries are always returned as strings or NULLs, not numerics. So comparing a retrieved number to the maximum (or minimum) integer value would be relatively easy.
Next up, strings. Most of your fields will likely be strings if you're storing data in an RDBMS. That includes CHAR, VARCHAR and the non-standard TEXT typed strings. Let's take these in turn.
CHAR typed fields store a fixed number of characters in a field (as specified in your CREATE TABLE statement. VARCHAR fields store a variable number of characters up to whatever you specify in your CREATE TABLE statement. For example, VARCHAR(255). This means the database will store up to 255 characters in this field and no more. But it also means that if you store less than that maximum number of characters, the field will only contain that many characters. Thus, in the course of many updates, the field may swell and shrink in size, but never exceed 255 maximum characters. The TEXT type (which I never use, is non-standard and which I don't recommend) appears to allow unlimited text storage, up to perhaps some maximum depending on the database engine. I also don't recommend the use of the BLOB type in database design. If you're going to store information which is not text (like a JPEG), store it in a file, not a database table. Store the data in a file and store a filename reference in the database, if you're going to use a database to refer to it. DBMSes were not devised to store rocks, automobiles, frozen foods and posters of beautiful women in them. They were designed to store text, numbers and boolean values in them. Stick to that and you'll be happier, even if you think it's easier to just store JPEGs, PNGs and Word documents in database fields. (See what I did there? I slipped Word documents in as forbidden database content as well. Same advice applies: store Word, spreadsheets or other types of "whole documnents" as files and refer to them in your database with filenames).
So what do we do when faced with storing a "blank" value in a text (CHAR or VARCHAR) field? The most convenient answer is to store a "blank string" (i.e. '') wherever such is needed. If we're talking about a CHAR value with a reasonably small size, I'd recommend storing a series of blanks equal to the size of the field. This is particularly true for single character fields. If you don't want to store a multi-space blank string, I would suggest simply an empty string, i.e. ''. For VARCHAR fields, I would store a blank string ('').
Last up (or at least the last I will comment on) are boolean fields. Some DBMSes don't provide full support for this type. For example, in MySQL, it's really just a tiny integer. So theoretically, you could store the number 18 in the field and get away with it. For SQLite, though the documentation says it's stored as an integer, you can store the letters 't' or 'f' in a boolean field. PostgreSQL will accept the words TRUE and FALSE for values; SQLite will not. I haven't thoroughly researched it, but it may be that the drivers for all these databases will accept TRUE and FALSE (in PHP) values and translate on the back end. I don't know. In light of the fact that the common DBMS choices don't handle the boolean type consistently, it might be wisest to cast this type as a straight integer at table-design time, and be careful what's inserted there at insert or update time.
If you're certain there will never be an "I don't know" case for your boolean field, I suggest disallowing the NULL value. For example, in the case where you have a hosting table with one domain per record and a field of "hostbyus", it should be clear whether you do or don't host a domain or not. In this case, disallow the NULL value (NOT NULL). Only in the case where there's compelling case to be made for the insertion of NULL values should they be allowed.
As you can see from the above, the allowance of NULL is always an option, depending on the circumstances. But it can be problematic in queries, field references and sorting. And because it's handled in different ways depending on the driver, it's best to adopt a standard for field typing which avoids it as much as possible.