Noferblatz

Why I dislike XML
(Programming)

noferblatz (13 March 2011 06:36:40)

Some years ago, XML began taking over as the format of choice for files which were not otherwise stored in RDBMSes. For example, the native storage format for transactions in KMyMoney is XML. It’s a format used all over the place. It has some advantages, especially now that it has garnered such widespread use. First, it is theoretically human-readable, unlike other flat file formats and RDBMS files. Second, it made use of existing experience and expertise in SGML and HTML, though its rules were a bit less strict and complex than SGML (from whence it came). Third, it had near infinite flexibility, allowing various types of “records” and all manner of nesting of records within documents. Fourth… well, you get the idea. It was way kewl.

Those of you who have been around a looong time will recall debates about the wisdom of using HTML as a formatting option for email. When email was first conceived, it was unformatted text. Users found various ways to “spice up” and convey special meaning to words and phrases, by surrounding them with special punctuation marks, like asterisks (*) around words for emphasis. Then Microsoft and others started doing email, and ushered in the era of email which was formatted and readable as HTML. The objection to this came mainly from email lists, where the extra “baggage” of HTML formatting was seen as a waste of space and bandwidth on busy mail servers. In addition, many users were still using console-based email clients which couldn’t render HTML properly.

But that primary objection to HTML in email is the same one I have against XML for file formats. Each entity in an XML file is surrounded by some tag or associated with an explicit attribute which is supposed to impart context and meaning to the value it’s associated with. As a result, the size of any XML document can easily be three or more times the size of the document containing the values themselves. This is just a waste.

I know, I’ve had this argument countless times with people, particularly on the PHP mailing list. The counter argument is that CPU, disk, memory are cheap, and besides, there’s always caching or this or that to compensate. And while this is true, it misses the point. Yes, of course you can flog CPUs and disks with vast amounts of extra bytes, but does that mean you should? I can’t see a compelling reason to waste resources when you don’t have to. That doesn’t mean we should all program in assembler. But there’s also no reason to throw 100K bytes at something when 1000 bytes will do the job as well. (But then, what do I know? My first five cars were 60s era Volkswagens because they were cheap, simple, easy to work on and didn’t cost much to run.)

The second most important objection I have to XML is that it’s supposedly human-readable. Ever actually look inside XML documents? Human-readable is not something they are. Yes, they are straight text. But there are codes in there which are only understandable to the programmers who write the files and computers which read them. And in fact, there is little reason to even have human-readable files in the first place. How often will you be called upon to dig in to an XML file and figure out what it means? I’m guessing never. Data files like that are meant for computers to generate and computers to read. Humans will seldom have any reason to ever look at those files. So the supposed “advantage” of XML files being human-readable is just so much vapor.

Now, I can’t argue with the flexibility of XML. It really is massively flexible. Graphically, it is represented as a two or three dimensional tree, and there’s almost nothing of any real world value which can’t be represented that way. And XML does leverage existing expertise in SGML and particularly, HTML. That is, the same techniques and technology used to parse HTML can be utilized to parse XML.

Now, let me shift gears for a minute. Most of you probably know of something called “EDI”, even if you don’t know exactly what it is. So I’ll explain. For those who already know about EDI, forgive any oversimplifications.

Before XML was in use, there was a need for customers, vendors and trading partners to communicate and exchange data electronically. As a result, various standards and authorities were conceived to facilitate this type of “electronic data interchange”. Committees met, standards documents were drafted, and various companies were created to facility the transmission of EDI documents and to encode and decode them. Things like purchase orders and invoices were regularly transmitted between trading partners using EDI standards. I haven’t checked in quite a few years, but I suspect there is still a fair amount of EDI traffic still taking place, even now that we have readily available email and such.

EDI was pretty flexible. It was made by standards committees to be so. Each type of document (for example, a purchase order) was designed to be able to express whatever variations could be conceived for that type of document. It was also a very concise and compressed format. Not compressed in the sense of zip compression and the like, but compressed in the sense that there was very little unnecessary information passed along inside an EDI document.

Now, there were a couple of major drawbacks to EDI. First, the “template” for a given type of document was only available from standards entities, and then only for a sometimes stiff fee. So, for example, if you wanted to transmit invoices to one of your trading partners, sooner or later you were probably going to have to pony up some money to purchase the standards document for invoices. Moreover, the standards had so much flexibility built in that even if all your trading partners followed the standard, the documents they sent might still differ in significant ways. This meant that you couldn’t just write one block of code for all trading partners. Bringing on a new trading partner meant that you had to look at a few of their documents first, in order to determine how they differed from your idea of what the standard said. Meaning a slightly different block of code for this trading partner.

So EDI wasn’t perfect. But it had some advantages. And one of them, like I mentioned before, was that it was quite “sparse”. Only the data you needed was transmitted. For example, here’s what an EDI document might look like inside:

ST*810*940002
BIG*941208*342043*941208*03228***DR
REF*RF*342043
REF*VR*775
N1*SE*WARNER MANUFACTURING*12*612-559-4740
N3*13435 INDUSTRIALPARK BLVD.
N4*MINNEAPOLIS*MN*55441
N1*BY*ALLPRO CORPORATION
N2*000562
N3*3014 US HWY 301 N.*SUITE 200
N4*TAMPA*FL*33619**SN*000562
N1*ST*RAINBOW PAINT & DEC
N3*5200 CAHABA RIVER RD
N4*BIRMINGHAM*AL*35243
ITD*14*3*2****30*****2% 10 DAYS NET 30
DTM*011*941208
IT1**50*EA*2.38**SW*3290
CTP***2.38
PID*F****GUIDE PAINT WLCVG ALLPRO 80030
REF*ZZ*119
IT1**30*EA*0.68**SW*7291
CTP***0.68
PID*F****PUTTY KNIFE,1 1/2F ALLPR 80182
IT1**20*EA*1.24**SW*7297
CTP***1.24
PID*F****BROAD KNIFE 6 ALLPRO     80186
IT1**10*EA*1.71**SW*7338
CTP***1.71
PID*F****SCRAPER, PLST 4 ED ALLPR 80731
IT1**120*EA*0.84**SW*7303
CTP***0.84
PID*F****HOOK POT SWIVEL ALLPRO   80411
IT1**40*EA*0.51**SW*7392
CTP***0.51
PID*F****GUIDE, PAINT 15   ALLPRO 80432
IT1**10*EA*3.72**SW*7855
CTP***3.72
PID*F****KNIFE, TAPING 12  ALLPRO 80235
IT1**20*EA*1.74**SW*430
CTP***1.74
PID*F****SHIELD SET 3 PIECE
IT1**4*EA*31**SW*104
CTP***31
PID*F****SCRAPER, SAFETY, 50 W/BUCKET
IT1**20*EA*8.6**SW*499
CTP***8.6
PID*F****MIXER, 5 GAL HD 5 TUBES/BOX
TDS*60345
CAD*****PREPAID
CTT*10
SE*51*940002
GE*2*94
IEA*1*000000094

Unless you’ve seen EDI before, it probably looks like gobbledegook to you. But it’s actually a real invoice, with addresses, terms, line items, costs, tax, totals and the like. Each line is called a segment and each field within a segment (separated by asterisks) is called an element. Each line is preceded by a segment identifier which specified what sort of elements would be in that segment and in what order. Some segments would also have as their first element an ID value which would specify that, yes this is an address line, but more specifically it is an address line for the shipper. Looping was allowed under certain circumstances, so you can see above certain sequences of segments repeat over and over (they’re line items, in case you’re curious).

Now, what’s missing from this which would be there if this were an XML document? The “human-readable” explanation of what each and every field is. And here’s why: you don’t need it. You know that the fifth element of PID segment is an item description. That’s what the standard says it should be, and that’s what it is. And the first element of an N3 segment is a street address. In some ways, it’s like “place value” in numbers. You know that the third digit in a base 10 number represents the number of hundreds. That’s what the “standard” says. And that’s one of the major ways in which an EDI document could be considered “sparse” or “compressed” compared to an XML document. Just for fun, here’s what a number might look like if subjected to XML formatting:

<number>
<thousands>1</thousands>
<hundreds>2</hundreds>
<tens>3</tens>
<ones>4</ones>
</number>

Now, what if we combined the best of both XML and EDI in one sort of standard? EDI is plagued by the lack of a “template” carried along with each document; you have to purchase that from the standards committee. XML also doesn’t contain its template within the document, but it’s usually external, and often referenced in the beginning of an XML document. You’ve probably seen what this kind of XML template looks like. These days, most HTML pages reference some external template at the very beginning of the HTML page. So why not put that “template” in the beginning of each document? And then, instead of accompanying each attribute and value in the file by an explanatory tag or attribute name, use the “place value” assets of EDI to indicate the significance of each value, without having to explicitly state it?

The result is a template which travels along with the document, and allows it to be interpreted by any program familiar with the language of the templates, and a document containing all the information you need, but none of the redundant explanation which bloats XML documents.

XML is a language which strikes me as something dreamed up by a committee in much the same way as the computer language Ada was. And you all know the jokes about Ada. I suspect if we’d left it up to someone with more attention on conserving computer resources, we would have an XML which is a lot more compact and easier to decode. In the meantime, I think it might be an interesting exercise to devise something along the lines of what I’ve described above. I’m in the middle of too many projects at the moment, but you’re welcome to take a stab at it. I’d like to see what you come up with.