TATWD on Nginx

January 1st, 2010

This blog is now running on Nginx instead of Apache, lean and speedy and using about one quarter of the memory that Apache took.

Configuration was fairly straightforward, and once you get used to it a lot cleaner than the rather rococo Apache config files.

There was one little surprising thing though, having to do with Passenger, that I discovered by accident while troubleshooting the Nginx config. I was getting errors which on examining the logs seemed to be caused by the server not having permissions to write to the site cache. It was then that I discovered that the Passenger ApplicationSpawner was running as "nobody" instead of the expected user.

Now, when you run mongrels you set the application user in the mongrel config, but with Passenger I had never given a thought to how the user was set -- it just seemed to do the right thing. Turns out that Passenger by default runs the application as whichever user owns the config/environment.rb file. Except that if that file is owned by root, which in the course of my moving things around it somehow now was. In that case Passenger runs the app as nobody. A quick chown and restart and all was well.

I guess this feature makes sense, if you think about it (too much). But it definitely violated the principle of least surprise for me. Hope this helps someone else out there.

Recently a friend sent me a link to the FOAF+SSL site. This page and its linked articles were an entertaining journey into virtuoso hand-waving. A protocol that depends on having all of your users create FOAF files, um, somewhere, and then generate self-signed certificates and install them in their browsers? What user population did you have in mind exactly?

As serendipity would have it, right afterwards I came across an article entitled "Anti-Social Networks" by John Shade. It's a PDF, so you have to download it and then turn to page 38. Shade is a funny guy, and his article skewers the kind of thinking that went into FOAF+SSL. Here's my favorite paragraph:

"It used to be enough to make the software work. But when software is all about human-human interaction, the goal becomes to make the human-human interaction work. And its worse than that, because social software is not about individual users. You have to understand groups, which, it turns out, can't be done by understanding an individual user and iterating."

This of course got me thinking. Everyone wants to build the next big thing. No one seems to be asking whether we really need another big thing.

One capability of the internet that has been celebrated from its inception is that it makes it possible for anyone in the world to connect to anyone else. The first big thing that exploited this was email. The most recent big things are large social networks like MySpace and Facebook. They have their place, but they also create a big problem -- when anyone in the world can connect to you, anyone does.

People often try to filter this problem by creating small ad hoc groups within the open space, and various social software systems facilitate this with varying degrees of transparency and privacy. Smaller networks like LivingDirectory and Ning formalize group creation within their respective networks.

My little epiphany was that this network within which groups are created is an unnecessary construct. Even if a single web service has created many different groups, each group has its own identity and does not need to partake of an artificially created enclosing identity. A group's identity is formed by its stated purpose, its history and its participants and their contributions to the group. Since it is possible to move all of these things from one software host to another, the identity of the group does not of necessity have any relationship to its current host, any more than it does to the current hardware it is running on.

The whole notion of user-centric identity has been fraught from the start. The thing is, to be a "user" you have to be a user of something. The something of which you are a user is as much a part of the online identity created as you are. This is imho the misdirection facing efforts such as the Data Portability Project.

Your identity in what we are pleased to call the "real world" is based on your physical body. It came into being when you were born, and it will cease when you die. You can have many identity documents, but they presumably all point to one human individual. You only get one body (at a time). Online identities however are disembodied, and you can have as many as you like, but you probably have to share them with the group they exist within.

Stand-alone groups only need a few rules and a protocol. Some rules needed are: who can join and how; authorizations required for different kinds of access; how content moderation happens. The protocol must define how groups can affiliate with each other to share data if they wish, and a serialization standard for the group data and rules so they can be moved from one host to another.

The ability to affiliate means that a group doesn't have to choose between being a virgin or a slut, i.e. a walled garden or globally available.

Shade uses the phrase "anti-social network" ironically. I'm going to appropriate it to mean (slightly less ironically) networks that don't want to connect to the whole world, just to their own and affiliated participants.

Like most tech people these days, when I need to know how to do something, the first thing I do is just google it.

This almost always works, and I feel gratitude for the software documentation gift economy.

On those occasions when I can't find an answer that way and I have to, you know, figure it out myself, I'd like to pay it forward and make my discoveries potentially available to fellow googlers, so I've started a new section called Tech Notes for that purpose.

My first Tech Note is about DataMapper sessions in Rails.

I'm using DataMapper/MySQL with this blog software for session management. CouchDB is great for storing schema-less mostly-read and only occasionally-write data. But for something like session management (which I need for the admin section), where you are writing to the data store with each access, and where Eventual Consistency is not good enough, CouchDB is not a good fit.

Of course, I could have easily used ActiveRecord sessions, but this whole blog software project is about teaching myself things I want to learn, and I've been liking what I see in DataMapper.

There's plenty of stuff out there about how to get DataMapper to work in Rails, and using DataMapper for your Rails sessions is easy -- in config/initializers/session_store.rb do:

ActionController::Base.session_store = :data_mapper_store

But how to expire old sessions so that they don't remain forever in the sessions table? I couldn't find out anything about that on the google. With ActiveRecord the standard advice is to run a cron job that looks something like:

@hourly /path/to/app/script/runner \
'CGI::Session::ActiveRecordStore::Session.destroy_all( \
["updated_at < ?", 12.hours.ago ] )' -e production

I couldn't figure out anything like that which would work with DataMapper sessions. DataMapper does however make grabbing a bare metal DB connection really easy, so what I finally came up with was a Rake task like this:

namespace :session_sweeper do
 desc 'Remove stale sessions from DB'
 task :sweep => :environment do
  sess_ids = repository(:default).adapter.query("SELECT id FROM 
      sessions WHERE updated_at < 
      '#{(Time.new - 3600).to_formatted_s(:db)}'")
  repository(:default).adapter.execute(%Q{DELETE FROM sessions 
      WHERE id IN (#{sess_ids.join(",")})}) unless
      sess_ids.empty?
 end
end

Then I call it from a cron job like so:

@hourly /usr/bin/rake -f "/path/to/app/Rakefile" \
session_sweeper:sweep >/dev/null 2&>1

Hope that helps someone else out there. If you have a better idea, you can reach me with the Contact link over on the right.

TATWD on CouchDB

October 19, 2009

Until just now, this blog was running on a somewhat ancient version of Mephisto, which is a perfectly fine Ruby on Rails based blogging software. However, the company where it was hosted decided to upgrade their version of Ruby, and this broke the version of Mephisto/Rails that I was using. The blog was still visible, but I was unable to log into the admin interface to add any new posts.

Sigh. I knew perfectly well how to go about upgrading Mephisto, but doing it fell onto the chore side of the fun/not-fun scale.

On the other hand I've been wanting to learn about CouchDB, and there is a javascript-CouchDB-based blogging application called Sofa! (Sooner or later the list of synonyms for couch, and the G-rated things you can do on a couch, will be exhausted, and we will be delivered from this naming cuteness. ;-)

Thus I was seduced into a journey that has been as fun as it was time consuming. As Fen Labalme says, "open source software is only free if your time is worth nothing." Sofa, as it turns out, will only run on an edge version of CouchDB at this moment, which try as I might I could not get to install and pass its tests on my Mac. So I got a more stable version of CouchDB installed and went to plan B -- use Ruby on Rails as a front end and CouchDB as the data store, and make my own blog. It's a very simple blog, with zero bells and whistles, but it works for me.

I'm using a Ruby gem called CouchRest, which seems to be well thought out and finds a nice balance between abstracting the details of communicating with the database, and not hiding the CouchDB paradigm. You're looking at the result.

The CouchDB paradigm can seem a bit impenetrable a first to someone who is steeped in the SQL thought system, especially given the spotty state of the existing documentation. I read everything I could find about it, and then launched myself into the experience that is so aptly described by the aphorism that "in theory, theory and practice are the same, but in practice they're not."

So far I'm liking CouchDB a whole lot -- it changes the way you think about storing and sharing data.

The Irrelevance of XML

May 29th, 2009

I've been struggling for a while now with my dislike of XML, and with whether I could effectively make use of JSON instead. Two things I've been reading lately have converged in my mind into a realization that for me at least, XML has become irrelevant.

One of the things is a description of the workings and philosophy of a new database system called CouchDB. The other is a thread on the XRI TC mailing list concerning the signing of XRDs, into which has again reared the ugly head of XML DSig.

XML DSig is a good example of one of the things that sucks about XML. Because applications can lawfully munge any XML stanza in myriad ways, and because it's desirable to be able to pass along a cryptographic signature over the said XML stanza, the signature must be taken over a canonicalized version of the XML in question, so it can later be compared with a re-canonicalized version of the potentially munged end product. The rules for canonicalizing XML, although apparently well loved by academics, standards-body geeks and other merchants of complexity, are enough to make developers gnash their teeth and rend their clothing.

It looks like some progress is being made towards including XML canonicalization in the ubiquitous libxml2 library, but too late for me I'm sorry to say - I just don't care anymore!

The light bulb went off in my head when I read that CouchDB was a schema-free database.

Because I've been working with relational databases for many years, I've understood schema to mean basically the names and datatypes of the database's table columns. That's straightforward, and it's not meant to be useful outside of the management of the database itself.

But with XML, schema means something subtly different. XML is meant to carry communication across applications, and an XML schema, besides the list of element names and their datatypes, is generally understood to carry, or least to have imposed on it, the semantics of those elements. Standards bodies like OASIS exist to create and codify XML semantics for various domains.

But an XML schema does not and cannot carry any semantic information in and of itself. There is implicit in the semantic web concept a notion that at some point machines will be able to make use of the schema doc to "understand" the XML payload. This is the AI bait and switch, so elegantly examined by Steve Talbott in his book Devices of the Soul. First program the machine to carry out some simple interaction, then make wild extrapolations from this, all the while carefully ignoring the fact that the interaction is actually, asynchronously with the programmer of the machine, not the machine itself.

So in practice the schema URI can reassure a developer that the information domain she is expecting is the correct one, but the schema itself is only useful, during the information processing stage, for the somewhat self-referential verification that the XML content conforms to the schema. And if it doesn't? Postel's Law and developer practice suggest that you'll probably use it anyway if you can.

In order to use the information being transmitted by the XML, you have to know and understand the information domain. You have read the documentation for the information domain, or otherwise acquired that knowledge, and you proceed to parse the XML into a data structure in your programming language of choice and make use of that data structure to achieve your aims.

Switching focus for a moment, let's consider CouchDB. CouchDB is a highly efficient object database supported by the Apache Foundation. They prefer to call it a "document database", the documents in question being serialized javascript objects, i.e. JSON. Because JSON primitives are a lowest common denominator for almost any other object-oriented language, JSON can be easily and efficiently transposed into and out of other languages, and libraries for this purpose exist in all the currently common languages.

To use JSON as a replacement for XML, you would have to know the semantics of the XML elements, and convert those element tags into JSON keys, with the contents of the tags becoming JSON values. But because JSON does not have to be parsed in the XML sense - it's already a data structure - you can just call up the keys you're looking for and use their values. If there are other keys in there that you don't know about, you can just ignore them. Or, you know, go find out what they mean.

This is what is meant by a "schema-free" data structure. It is organized as a key:value dictionary or hash, rather than the top to bottom spacial organization of XML. Keys are randomly accessed, and can be added or taken away without disrupting the lawfulness or usability of the other keys.

CouchDB internals use a modified B-tree structure that is extremely fast, fault tolerant and corruption resistant. Input and output are via HTTP REST so it's a piece of cake to access the database from any programming environment, and you can write and store javascript functions that it will use internally to index, select, sort and format data. Replication is built in at a low level.

Oh, and canonicalizing the JSON for signing? Just remove non-required whitespace, an operation for which there are a variety of tools available since that's also a javascript compression technique. Given the advantages of CouchDB, I'm having a hard time finding any reason to use anything but JSON as a data interchange format.