Polymorphic, multilingual, revisioned content!
posted by brian at 07:58 PM
Let's talk about modeling revisioned, localized content in SQL! We're talking database engine agnostic here, people. This should work in SQLite, MySQL, PostgreSQL, and Bob's ValueSQL.
But first, if you're easily bored like me, here's a pretty visual aid I made for you describing the model we came up with. It's supposed to visualize how independently revisioned parts of an object can be combined into a single revision history. The rest of the post talks about how we ended up with this model. Click on it!
Content can mean any set of fields that applies to one type of object. For example, a page might have a title and text and some flags like whether or not to show up in the site's menu and search results.
Supporting revisions is useful for a couple of reasons. One is that it allows users to undo their actions in case something goes horribly wrong. Secondly, you might want a workflow with some approval process such that new changes can be in a pending state while older versions are still active.
A simple way to model this might look like the following. You can easily imagine what the SQL looks like.
- A surrogate primary key. In an alternate schema it could be used as a composite key with
content_id, in which case it could be interpreted as a revision number.
- Some number identifying this content that won't change across revisions.
- The user who made this change.
- The time of the change.
- The author's description of the change.
- The workflow status (could be anything, depending on your workflow needs).
- A surrogate primary key that points to the
- A foreign key to the
- The title text.
- The main page content.
- A flag indicating whether or not to show up in the site's menu.
- A flag indicating whether or not to show up in the site's search results.
Pretty easy, right?
content_id are foreign keys to the
revision table, which will store information like the author, timestamp, comment, and workflow status for revisions of any type.
content_id could also only exist in the
revision table and be retrieved with a join, but I'm including it in the
page table to make the examples easier.
This is simple enough if you don't need to support content translations. If you want a software solution that people will take seriously, however, localization shouldn't just be possible—it should be easy. And sadly, it's not the easiest thing to model correctly. A Drupal developer has a good summary of why this is so hard. We need some way to have, for each
content_id, multiple translations of the localizable fields—in this case,
A bad solution would be to add a column for every language-field pair:
Like I said, that's obviously no good. Not only do we not want to add columns for every new translation, but we also don't want a new record with all of these columns when only one of the translations is updated. In other words, we want the translations to be independently revisioned, which better reflects how sites are normally translated. Another solution might be to have locale-independent title and text fields whose values are identifiers for strings in a translation table shared among all content types. Unfortunately this makes it harder to create and update page records, and also forces us to give up on any fields that need a unique schema, such as a maximum content length.
A better solution would be to just add a locale field:
- The locale of the translated fields in this row. For example, "en_US".
So we might have some entries that look like this:
|1||1||en_US||Warning!||The kitchen is on fire!||true||true|
|2||1||es||¡Aviso!||¡La cocina se arde!||true||true|
There are two pages here. One has two translations in U.S. English and Spanish. The other has just an English translation. They all have a different
revision_id—remember, this is just a link to another table that stores information about the change.
You might be able to see why this isn't ideal. What happens when we want a new revision that changes
show_in_search, two non-locale-specific fields? We'd have to insert a new record with the updates for every translation. That's a lot of work, and could be a lot of duplication!
To minimize duplication of data, we can split up the locale-dependent and locale-independent fields into two tables for each content type. Our tables would look like this:
So, to get whole page revisions now we can just line up the
revision_id from each table, and if an equivalent
revision_id doesn't exist, just use the previous one that does exist from the table where it's missing. For instance, if the revision history looks like this:
The revisions would be the pairs (1, 1), (2, 1), (2, 3), (2, 4), (5, 5), (5, 6), (7, 6). But we need to take locale into account...
There are more pairs now, since each
page_generic record needs to be paired with not just the previous
page_localized record, but the previous
page_localized record in each locale. The pairs are (1, 1), (2, 1), (2, 3), (2, 4), (5, 3), (5, 5), (5, 6), (7, 5), (7, 6).
There's one more aspect to cover in this model. What about the fields that all content types will need, and what if they, too, should be independently revisioned? For example, every content type might have a
url field and it would be nice if we could undo that, too. But moving a page isn't really changing the page, it's more like changing the site, and besides, it would be nice to have the
url for every object, regardless of content type, in one place. Luckily this looks very similar to what we have now, only this time the table is shared among content types. If we call this shared table
node, the revision history might look like this:
The revision triplets would then be (1, 1, 1), (1, 2, 1), (3, 2, 3), (3, 2, 4), (3, 5, 3), (3, 5, 5), (3, 5, 6), (7, 7, 5), (7, 7, 6), (8, 7, 5), (8, 7, 6).
For the curious, the SQL to select all these triplets is complex but certainly not complicated:
SELECT revision.revision_id, revision.content_id, revision.author, revision.timestamp, revision.comment, node.revision_id as node_revision, node.url, page_generic.revision_id as generic_revision, page_generic.show_in_menu, page_generic.show_in_search, page_localized.revision_id as localized_revision, page_localized.title, page_localized.text FROM revision, node, page_generic, page_localized WHERE node.revision_id = ( SELECT max(node.revision_id) FROM node WHERE node.revision_id <= revision.revision_id AND node.content_id = revision.content_id ) AND page_generic.revision_id = ( SELECT max(page_generic.revision_id) FROM page_generic WHERE page_generic.revision_id <= revision.revision_id AND page_generic.content_id = revision.content_id ) AND page_localized.revision_id IN ( SELECT max(page_localized.revision_id) FROM page_localized WHERE page_localized.revision_id <= revision.revision_id AND page_localized.content_id = revision.content_id GROUP BY page_localized.locale ) AND ( revision.revision_id = node.revision_id OR revision.revision_id = page_generic.revision_id OR revision.revision_id = page_localized.revision_id ) GROUP BY node.revision_id, page_generic.revision_id, page_localized.revision_id
Notice that the
revision table is correlated in those subqueries - that is, it refers to the
revision table in the outer select. Very important!
We went through many designs of our revision model before ending up with this one, and so far it seems to alleviate all the problems our previous designs faced. I hope I've shown you that there are a lot of things to consider when using multilingual revisioned records, and I hope you believe me when I say that this solution is serving us pretty well. SQLAlchemy handles it beautifully, although we've run across many chances for SQLAlchemy to improve along the way, and even submitted a patch.
Since one of our goals in Pagoda is to allow people to easily create content types using this model, we have a few simple helpers that will hide many of the details. Next post: eat the sandwich!