Running without a database (Blog Software)

Paul M. Foster (30 March 2011 23:19:08 Revised )

One of the goals of grotblog was to run it without a database engine on the back end. Why? First, I can't think of any earthly reason to store a completely variable, but doubtless huge, block of text in a DBMS, as you would if you were storing blog posts in a database. I personally think that kind of RDBMS use is actually an abuse. If you're going to store a large block of text with markup in it, store it in a file. And if you have to use a database, store a pointer to the file, like the path to the filename. Second, DBMSes slow down response time. Third, blogs see most of their traffic in surfers reading blog posts. They see very little traffic in storing blog posts and such. So if you're not storing blog posts in the database, you're going to see very few actual updates, inserts and deletes in the database. If that's the case, why not go the extra few steps and remove the database altogether?

Grotblog was a test case to see if this could be done. I found it could, for a limited application like this. Don't get me wrong, I'm all in favor of using databases for all kinds of things. But this was a chance to see if we could live without them in certain circumstances.

I still needed to store a variety of information in some sort of table or file. Things like usernames and passwords, details about posts, topic names, etc. So we were talking about a lot of different kinds of information. Normally, if you're going to store things in a flat file, you store like things. That is, all the records in a flat file "database" would have similar record structures. But why be bound to that conventional wisdom? Databases store lots of kinds of records. And COBOL had the concept of "variant records". Could we do that in a flat file? We could, if each record had some sort of indicator in it which would tell what kind of record it was. Each type of record would have its own indicator. And from a preliminary survey, I believed that any record would fit on a single line (no embedded newlines).

So how would we do this? First, we'd need to be able to uniquely identify each record with an ID, like the OID in modern databases. Whether the IDs were sequential or not wouldn't be important, but the fact that they were all unique would be. So we were going to need a separate file which contained a number. Each time we added a record, we'd get the number from that file, use it for our next ID, and increment the value in that file.

Next, we'd need a "segment ID" (a term from from EDI) or "key" which would indicate the type of record being stored. Oh, and we'd need some way to separate fields within a record, which wouldn't be likely to show up in the fields themselves. A good character for this would be the caret (^). If we saw this character in field input, for some reason, we could translate it into its HTML encoding (94 decimal, 5E hex). Would be run into this character a lot? Not likely. We'd be storing numbers, hashes, blog post titles and such. So we should be okay there.

Now to specify each of the record types we'd need, their anatomy in terms of fields, and what their "segment IDs" would be.

I could have done this with XML, but as I've said before, I'm not a fan of the overhead represented by XML. JSON would have worked too, but I wanted to see if I could do without this type of thing. This meant that, rather than store the field name with every field, I'd have to rely on the placement of the field within the record. In other words, field placement would be incremental. First field would be phone number. Second field would be address. Third field would be city. Etc.

To make this easier, we could add define() statements for each of the fields in each record. For example:

define('AUTHOR_LOGIN', 2);
define('AUTHOR_PASSWORD', 3);
define('AUTHOR_NAME', 4);
define('AUTHOR_EMAIL', 5);
define('AUTHOR_NONCE', 6);
define('AUTHOR_LEVEL', 7);

Notice we start the index with two. That's because the first field in every record is its record ID, and the second field is the "segment ID". These would be universal and go above all the others.

define('ID', 0);
define('SEGID', 1);

Each record type would have its own set of define() directives. We would put them in some include file or at the top of the class file which defined the "database" functions for our flat file system.

So far, so good. Now we needed to define our "database" class. I called this class the "variable record length flat file database" class. VFDB for short. For this type of application, I didn't need a full SQL language. Besides, if I had to do that, I might as well use PostgreSQL or MySQL (or SQLite).

Now, while I'm not a big fan of the "active record" paradigm, here was a case where imitating it with some method chaining might be just right. The idea was that you could issue a call like:

$db->from('AUTHOR')->where(AUTHOR_LOGIN, 'bob')->get();

In writing the methods for this class, I wasn't entirely sure I could pull this off. As I was writing the model methods for posts, comments, topics, etc., I would write a new VFDB method as needed. It turned out I really needed a very limited set of query and command functions.

One last question remained. The idea here was that, since the data stored in the database was very limited, there was no reason not to hold it all in memory. So the question was, when would we need to flush the contents to disk? The answer was that we'd need to flush the data to disk after every update, insert or deletion. In short, every time we changed the contents of the "tables" in the database, we'd have to synchronize the on-disk representation of the data with what was in memory. Otherwise, simultaneous visitors to the website wouldn't get the right data.

One more issue that came up was how to store the data in memory. I could have stored the data in memory as lines of text with field delimiters, just as it was on disk. But that would require a lot of parsing as we used the data. Better to store the data as arrays, since array operations in PHP are pretty straightforward and fast. Thus, the only time we needed to actually parse the database would be when we read it off of disk (and into the arrays), and when we had to write it badk to disk (from our internal arrays to lines of text).

So that was about it. I wrote the VFDB code as I went along and it all seemed to work out pretty well. It's fast, and in testing, it seemed to work well without any problems.

I should add one caveat: In adding records to the database, I did not and do not check to ensure the programmer is storing the right number of fields in the right order for a specific record type. This is not a DBMS. It's up to the programmer to make sure they're doing the right thing. I had to do the same thing when I was developing the methods for the model-- make sure I storing the right things in the right places.

The VFDB code is now part of Grotblog, and available when you download the project at SourceForge. Will I use it in other projects? I might. It's tested code, and it might work in a lot of cases where I don't need full relational database support.