Polymorphic, multilingual, revisioned content!
posted by brian at 07:58 PM
Let's talk about modeling revisioned, localized content in SQL! We're talking database engine agnostic here, people. This should work in SQLite, MySQL, PostgreSQL, and Bob's ValueSQL.
But first, if you're easily bored like me, here's a pretty visual aid I made for you describing the model we came up with. It's supposed to visualize how independently revisioned parts of an object can be combined into a single revision history. The rest of the post talks about how we ended up with this model. Click on it!
Content can mean any set of fields that applies to one type of object. For example, a page might have a title and text and some flags like whether or not to show up in the site's menu and search results.
Supporting revisions is useful for a couple of reasons. One is that it allows users to undo their actions in case something goes horribly wrong. Secondly, you might want a workflow with some approval process such that new changes can be in a pending state while older versions are still active.
A simple way to model this might look like the following. You can easily imagine what the SQL looks like.
- revision
-
- revision_id
- A surrogate primary key. In an alternate schema it could be used as a composite key with
content_id, in which case it could be interpreted as a revision number. - content_id
- Some number identifying this content that won't change across revisions.
- author
- The user who made this change.
- timestamp
- The time of the change.
- comment
- The author's description of the change.
- status
- The workflow status (could be anything, depending on your workflow needs).
- page
-
- revision_id
- A surrogate primary key that points to the
revisiontable. - content_id
- A foreign key to the
revisiontable. - title
- The title text.
- text
- The main page content.
- show_in_menu
- A flag indicating whether or not to show up in the site's menu.
- show_in_search
- A flag indicating whether or not to show up in the site's search results.
Pretty easy, right? revision_id and content_id are foreign keys to the revision table, which will store information like the author, timestamp, comment, and workflow status for revisions of any type. content_id could also only exist in the revision table and be retrieved with a join, but I'm including it in the page table to make the examples easier.
This is simple enough if you don't need to support content translations. If you want a software solution that people will take seriously, however, localization shouldn't just be possible—it should be easy. And sadly, it's not the easiest thing to model correctly. A Drupal developer has a good summary of why this is so hard. We need some way to have, for each content_id, multiple translations of the localizable fields—in this case, title and text.
A bad solution would be to add a column for every language-field pair:
- page
-
- ...
- en_title
- en_text
- jp_title
- jp_text
- es_title
- es_text
Like I said, that's obviously no good. Not only do we not want to add columns for every new translation, but we also don't want a new record with all of these columns when only one of the translations is updated. In other words, we want the translations to be independently revisioned, which better reflects how sites are normally translated. Another solution might be to have locale-independent title and text fields whose values are identifiers for strings in a translation table shared among all content types. Unfortunately this makes it harder to create and update page records, and also forces us to give up on any fields that need a unique schema, such as a maximum content length.
A better solution would be to just add a locale field:
- page
-
- ...
- locale
- The locale of the translated fields in this row. For example, "en_US".
So we might have some entries that look like this:
| revision_id | content_id | locale | title | text | show_in_menu | show_in_search |
|---|---|---|---|---|---|---|
| 1 | 1 | en_US | Warning! | The kitchen is on fire! | true | true |
| 2 | 1 | es | ¡Aviso! | ¡La cocina se arde! | true | true |
| 3 | 2 | en | Bathroom status... | Safe! | false | false |
There are two pages here. One has two translations in U.S. English and Spanish. The other has just an English translation. They all have a different revision_id—remember, this is just a link to another table that stores information about the change.
You might be able to see why this isn't ideal. What happens when we want a new revision that changes show_in_menu or show_in_search, two non-locale-specific fields? We'd have to insert a new record with the updates for every translation. That's a lot of work, and could be a lot of duplication!
To minimize duplication of data, we can split up the locale-dependent and locale-independent fields into two tables for each content type. Our tables would look like this:
- page_generic
-
- revision_id
- content_id
- show_in_menu
- show_in_search
- page_localized
-
- revision_id
- content_id
- locale
- title
- text
So, to get whole page revisions now we can just line up the revision_id from each table, and if an equivalent revision_id doesn't exist, just use the previous one that does exist from the table where it's missing. For instance, if the revision history looks like this:
| page_generic | 1 | 2 | 5 | 7 | |||
|---|---|---|---|---|---|---|---|
| page_localized | 1 | 3 | 4 | 5 | 6 |
The revisions would be the pairs (1, 1), (2, 1), (2, 3), (2, 4), (5, 5), (5, 6), (7, 6). But we need to take locale into account...
| page_generic | 1 | 2 | 5 | 7 | |||
|---|---|---|---|---|---|---|---|
| page_localized | 1 | 3 | 4 | 5 | 6 | ||
| en | en | jp | jp | en |
There are more pairs now, since each page_generic record needs to be paired with not just the previous page_localized record, but the previous page_localized record in each locale. The pairs are (1, 1), (2, 1), (2, 3), (2, 4), (5, 3), (5, 5), (5, 6), (7, 5), (7, 6).
There's one more aspect to cover in this model. What about the fields that all content types will need, and what if they, too, should be independently revisioned? For example, every content type might have a url field and it would be nice if we could undo that, too. But moving a page isn't really changing the page, it's more like changing the site, and besides, it would be nice to have the url for every object, regardless of content type, in one place. Luckily this looks very similar to what we have now, only this time the table is shared among content types. If we call this shared table node, the revision history might look like this:
| node | 1 | 3 | 7 | 8 | ||||
|---|---|---|---|---|---|---|---|---|
| page_generic | 1 | 2 | 5 | 7 | ||||
| page_localized | 1 | 3 | 4 | 5 | 6 | |||
| en | en | jp | jp | en |
The revision triplets would then be (1, 1, 1), (1, 2, 1), (3, 2, 3), (3, 2, 4), (3, 5, 3), (3, 5, 5), (3, 5, 6), (7, 7, 5), (7, 7, 6), (8, 7, 5), (8, 7, 6).
For the curious, the SQL to select all these triplets is complex but certainly not complicated:
SELECT revision.revision_id,
revision.content_id,
revision.author,
revision.timestamp,
revision.comment,
node.revision_id as node_revision,
node.url,
page_generic.revision_id as generic_revision,
page_generic.show_in_menu,
page_generic.show_in_search,
page_localized.revision_id as localized_revision,
page_localized.title,
page_localized.text
FROM revision, node, page_generic, page_localized
WHERE node.revision_id = (
SELECT max(node.revision_id)
FROM node
WHERE node.revision_id <= revision.revision_id
AND node.content_id = revision.content_id
) AND page_generic.revision_id = (
SELECT max(page_generic.revision_id)
FROM page_generic
WHERE page_generic.revision_id <= revision.revision_id
AND page_generic.content_id = revision.content_id
) AND page_localized.revision_id IN (
SELECT max(page_localized.revision_id)
FROM page_localized
WHERE page_localized.revision_id <= revision.revision_id
AND page_localized.content_id = revision.content_id
GROUP BY page_localized.locale
) AND (
revision.revision_id = node.revision_id OR
revision.revision_id = page_generic.revision_id OR
revision.revision_id = page_localized.revision_id
)
GROUP BY node.revision_id,
page_generic.revision_id,
page_localized.revision_id
Notice that the revision table is correlated in those subqueries - that is, it refers to the revision table in the outer select. Very important!
We went through many designs of our revision model before ending up with this one, and so far it seems to alleviate all the problems our previous designs faced. I hope I've shown you that there are a lot of things to consider when using multilingual revisioned records, and I hope you believe me when I say that this solution is serving us pretty well. SQLAlchemy handles it beautifully, although we've run across many chances for SQLAlchemy to improve along the way, and even submitted a patch.
Since one of our goals in Pagoda is to allow people to easily create content types using this model, we have a few simple helpers that will hide many of the details. Next post: eat the sandwich!
Comments
Nice and versatile schema you have here, thanks. A couple of observations.
First, what about localizing the URL too? It's nicer for the user if URLs are in the same language as the text, and not in another "master" language. It should "only" take moving the url field from the node the localized table.
Second, and more important, where do relations among content types (in the Semantic Web sense) figure, in this schema? For instance, relations between a blog entry and subject tags?
Good questions, Nicola.
We considered localizing the URLs and decided that we should start off the base URL with a locale or put it in the subdomain, but not change the path components in the URL from a primary language. Why? First, I think we decided that since Unicode characters in URLs need to be encoded, for some locales the URL wouldn't be meaningful in the ISO-Latin character set. And even if there is some meaningful translation (for example, Japanese sites commonly use phonetic spelling in URLs), I think that using the same path will allow people to browse different translations more easily. So while it could be as simple as putting the URL in the localized table, we're choosing to keep it non-locale-dependent.
As for relating the content types using this model to other content types, it would be pretty similar to how this is normally modeled, with the exception that there is more than one obvious primary key to use. The revision ID triplets (as demonstrated above) are one unique identifier, but another is the "dominant" revision ID (the highest of the triplets), the content ID, and the locale. I would recommend, when modeling relations among content types, to use the latter as the identifying key, since it's not dependent on the content type's exact tables.