Let's talk about modeling revisioned, localized content in SQL! We're talking database engine agnostic here, people. This should work in SQLite, MySQL, PostgreSQL, and Bob's ValueSQL.

But first, if you're easily bored like me, here's a pretty visual aid I made for you describing the model we came up with. It's supposed to visualize how independently revisioned parts of an object can be combined into a single revision history. The rest of the post talks about how we ended up with this model. Click on it!

revisions.png

Content can mean any set of fields that applies to one type of object. For example, a page might have a title and text and some flags like whether or not to show up in the site's menu and search results.

Supporting revisions is useful for a couple of reasons. One is that it allows users to undo their actions in case something goes horribly wrong. Secondly, you might want a workflow with some approval process such that new changes can be in a pending state while older versions are still active.

A simple way to model this might look like the following. You can easily imagine what the SQL looks like.

revision
revision_id
A surrogate primary key. In an alternate schema it could be used as a composite key with content_id, in which case it could be interpreted as a revision number.
content_id
Some number identifying this content that won't change across revisions.
author
The user who made this change.
timestamp
The time of the change.
comment
The author's description of the change.
status
The workflow status (could be anything, depending on your workflow needs).
page
revision_id
A surrogate primary key that points to the revision table.
content_id
A foreign key to the revision table.
title
The title text.
text
The main page content.
show_in_menu
A flag indicating whether or not to show up in the site's menu.
show_in_search
A flag indicating whether or not to show up in the site's search results.

Pretty easy, right? revision_id and content_id are foreign keys to the revision table, which will store information like the author, timestamp, comment, and workflow status for revisions of any type. content_id could also only exist in the revision table and be retrieved with a join, but I'm including it in the page table to make the examples easier.

This is simple enough if you don't need to support content translations. If you want a software solution that people will take seriously, however, localization shouldn't just be possible—it should be easy. And sadly, it's not the easiest thing to model correctly. A Drupal developer has a good summary of why this is so hard. We need some way to have, for each content_id, multiple translations of the localizable fields—in this case, title and text.

A bad solution would be to add a column for every language-field pair:

page
...
en_title
en_text
jp_title
jp_text
es_title
es_text

Like I said, that's obviously no good. Not only do we not want to add columns for every new translation, but we also don't want a new record with all of these columns when only one of the translations is updated. In other words, we want the translations to be independently revisioned, which better reflects how sites are normally translated. Another solution might be to have locale-independent title and text fields whose values are identifiers for strings in a translation table shared among all content types. Unfortunately this makes it harder to create and update page records, and also forces us to give up on any fields that need a unique schema, such as a maximum content length.

A better solution would be to just add a locale field:

page
...
locale
The locale of the translated fields in this row. For example, "en_US".

So we might have some entries that look like this:

revision_id content_id locale title text show_in_menu show_in_search
1 1 en_US Warning! The kitchen is on fire! true true
2 1 es ¡Aviso! ¡La cocina se arde! true true
3 2 en Bathroom status... Safe! false false

There are two pages here. One has two translations in U.S. English and Spanish. The other has just an English translation. They all have a different revision_id—remember, this is just a link to another table that stores information about the change.

You might be able to see why this isn't ideal. What happens when we want a new revision that changes show_in_menu or show_in_search, two non-locale-specific fields? We'd have to insert a new record with the updates for every translation. That's a lot of work, and could be a lot of duplication!

To minimize duplication of data, we can split up the locale-dependent and locale-independent fields into two tables for each content type. Our tables would look like this:

page_generic
revision_id
content_id
show_in_menu
show_in_search
page_localized
revision_id
content_id
locale
title
text

So, to get whole page revisions now we can just line up the revision_id from each table, and if an equivalent revision_id doesn't exist, just use the previous one that does exist from the table where it's missing. For instance, if the revision history looks like this:

page_generic 1257
page_localized13456

The revisions would be the pairs (1, 1), (2, 1), (2, 3), (2, 4), (5, 5), (5, 6), (7, 6). But we need to take locale into account...

page_generic 1257
page_localized13456
enenjpjpen

There are more pairs now, since each page_generic record needs to be paired with not just the previous page_localized record, but the previous page_localized record in each locale. The pairs are (1, 1), (2, 1), (2, 3), (2, 4), (5, 3), (5, 5), (5, 6), (7, 5), (7, 6).

There's one more aspect to cover in this model. What about the fields that all content types will need, and what if they, too, should be independently revisioned? For example, every content type might have a url field and it would be nice if we could undo that, too. But moving a page isn't really changing the page, it's more like changing the site, and besides, it would be nice to have the url for every object, regardless of content type, in one place. Luckily this looks very similar to what we have now, only this time the table is shared among content types. If we call this shared table node, the revision history might look like this:

node 1378
page_generic 1257
page_localized13456
enenjpjpen

The revision triplets would then be (1, 1, 1), (1, 2, 1), (3, 2, 3), (3, 2, 4), (3, 5, 3), (3, 5, 5), (3, 5, 6), (7, 7, 5), (7, 7, 6), (8, 7, 5), (8, 7, 6).

For the curious, the SQL to select all these triplets is complex but certainly not complicated:

SELECT revision.revision_id,
       revision.content_id,
       revision.author,
       revision.timestamp,
       revision.comment,
       node.revision_id as node_revision,
       node.url,
       page_generic.revision_id as generic_revision,
       page_generic.show_in_menu,
       page_generic.show_in_search,
       page_localized.revision_id as localized_revision,
       page_localized.title,
       page_localized.text
FROM revision, node, page_generic, page_localized
WHERE node.revision_id = (
    SELECT max(node.revision_id)
    FROM node
    WHERE node.revision_id <= revision.revision_id
      AND node.content_id = revision.content_id
) AND page_generic.revision_id = (
    SELECT max(page_generic.revision_id)
    FROM page_generic
    WHERE page_generic.revision_id <= revision.revision_id
      AND page_generic.content_id = revision.content_id
    
) AND page_localized.revision_id IN (
    SELECT max(page_localized.revision_id)
    FROM page_localized
    WHERE page_localized.revision_id <= revision.revision_id
      AND page_localized.content_id = revision.content_id
    GROUP BY page_localized.locale
) AND (
    revision.revision_id = node.revision_id OR
    revision.revision_id = page_generic.revision_id OR
    revision.revision_id = page_localized.revision_id
)
GROUP BY node.revision_id,
         page_generic.revision_id,
         page_localized.revision_id

Notice that the revision table is correlated in those subqueries - that is, it refers to the revision table in the outer select. Very important!

We went through many designs of our revision model before ending up with this one, and so far it seems to alleviate all the problems our previous designs faced. I hope I've shown you that there are a lot of things to consider when using multilingual revisioned records, and I hope you believe me when I say that this solution is serving us pretty well. SQLAlchemy handles it beautifully, although we've run across many chances for SQLAlchemy to improve along the way, and even submitted a patch.

Since one of our goals in Pagoda is to allow people to easily create content types using this model, we have a few simple helpers that will hide many of the details. Next post: eat the sandwich!