I've been struggling for a while now with my dislike of XML, and with whether I could effectively make use of JSON instead. Two things I've been reading lately have converged in my mind into a realization that for me at least, XML has become irrelevant.
One of the things is a description of the workings and philosophy of a new database system called CouchDB. The other is a thread on the XRI TC mailing list concerning the signing of XRDs, into which has again reared the ugly head of XML DSig.
XML DSig is a good example of one of the things that sucks about XML. Because applications can lawfully munge any XML stanza in myriad ways, and because it's desirable to be able to pass along a cryptographic signature over the said XML stanza, the signature must be taken over a canonicalized version of the XML in question, so it can later be compared with a re-canonicalized version of the potentially munged end product. The rules for canonicalizing XML, although apparently well loved by academics, standards-body geeks and other merchants of complexity, are enough to make developers gnash their teeth and rend their clothing.
It looks like some progress is being made towards including XML canonicalization in the ubiquitous libxml2 library, but too late for me I'm sorry to say - I just don't care anymore!
The light bulb went off in my head when I read that CouchDB was a schema-free database.
Because I've been working with relational databases for many years, I've understood schema to mean basically the names and datatypes of the database's table columns. That's straightforward, and it's not meant to be useful outside of the management of the database itself.
But with XML, schema means something subtly different. XML is meant to carry communication across applications, and an XML schema, besides the list of element names and their datatypes, is generally understood to carry, or least to have imposed on it, the semantics of those elements. Standards bodies like OASIS exist to create and codify XML semantics for various domains.
But an XML schema does not and cannot carry any semantic information in and of itself. There is implicit in the semantic web concept a notion that at some point machines will be able to make use of the schema doc to "understand" the XML payload. This is the AI bait and switch, so elegantly examined by Steve Talbott in his book Devices of the Soul. First program the machine to carry out some simple interaction, then make wild extrapolations from this, all the while carefully ignoring the fact that the interaction is actually, asynchronously with the programmer of the machine, not the machine itself.
So in practice the schema URI can reassure a developer that the information domain she is expecting is the correct one, but the schema itself is only useful, during the information processing stage, for the somewhat self-referential verification that the XML content conforms to the schema. And if it doesn't? Postel's Law and developer practice suggest that you'll probably use it anyway if you can.
In order to use the information being transmitted by the XML, you have to know and understand the information domain. You have read the documentation for the information domain, or otherwise acquired that knowledge, and you proceed to parse the XML into a data structure in your programming language of choice and make use of that data structure to achieve your aims.
Switching focus for a moment, let's consider CouchDB. CouchDB is a highly efficient object database supported by the Apache Foundation. They prefer to call it a "document database", the documents in question being serialized javascript objects, i.e. JSON. Because JSON primitives are a lowest common denominator for almost any other object-oriented language, JSON can be easily and efficiently transposed into and out of other languages, and libraries for this purpose exist in all the currently common languages.
To use JSON as a replacement for XML, you would have to know the semantics of the XML elements, and convert those element tags into JSON keys, with the contents of the tags becoming JSON values. But because JSON does not have to be parsed in the XML sense - it's already a data structure - you can just call up the keys you're looking for and use their values. If there are other keys in there that you don't know about, you can just ignore them. Or, you know, go find out what they mean.
This is what is meant by a "schema-free" data structure. It is organized as a key:value dictionary or hash, rather than the top to bottom spacial organization of XML. Keys are randomly accessed, and can be added or taken away without disrupting the lawfulness or usability of the other keys.
CouchDB internals use a modified B-tree structure that is extremely fast, fault tolerant and corruption resistant. Input and output are via HTTP REST so it's a piece of cake to access the database from any programming environment, and you can write and store javascript functions that it will use internally to index, select, sort and format data. Replication is built in at a low level.
Oh, and canonicalizing the JSON for signing? Just remove non-required whitespace, an operation for which there are a variety of tools available since that's also a javascript compression technique. Given the advantages of CouchDB, I'm having a hard time finding any reason to use anything but JSON as a data interchange format.