JSON as a Migration Format

Paul Brown @ 2007-02-22T16:11:12Z

I'm making slow progress on my personal publishing platform rewrite in Haskell (see earlier posts here and here), so herein part 3 of n, wherein I experiment with data migration and an embarrassingly simple data model. A forthcoming part 4 will be really simple Atom serialization.

Data Out, Data In

As I experiment with the new platform, I'd like a way to move the data from the typo instance into the new environment on-demand; this post is my lab notebook for the export/import experiment.

One of the things that Rails has gotten 100% right is the ability to (easily) access configured environments via interactive (script/console) or scripting (script/running) front-ends. (Using a framework like Spring in the Java space can provide similar functionality by constructing an application context, but it's more awkward to separate out the services that the runtime container would be providing.) My first thought on exporting was to use YAML, but the significant whitespace and cryptic annotations ("|" for a free-form text block with a trailing linebreak and "|-" for a free-form text block without a trailing linebreak) just rubbed me the wrong way. JSON turns out to be a better choice because ActiveRecord supports JSON export (via ActiveSupport::JSON), and there are a couple of JSON libraries for Haskell. One is a predecessor version of the other, and I'm going to work with the earlier version because it has no dependencies other than a baseline GHC 6.6 install.

Getting an entry out to play with is a piece of cake:

./script/runner 'puts Article.find_by_state("Published").to_json' \
  > /tmp/entry.json

Parsing the JSON is similarly straightforward:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :load /tmp/JSON.hs
[1 of 1] Compiling JSON             ( /tmp/JSON.hs, interpreted )
Ok, modules loaded: JSON.
*JSON> entry <- P.parseFromFile json "/tmp/entry.json"
Loading package parsec-2.0 ... linking ... done.
Right (Object (fromList [("attributes",Object (fromList [
[...]

(JSON.hs aliases Data.Map as M and Text.ParserCombinators.Parsec as P, so that's where the P.parseFromFile is coming from.) The map of values is wrapped up a bit, but a couple of simple functions will get it out from behind the type constructors:

*JSON> let unR = \(Right r) -> r
*JSON> let unO = \(Object o) -> o
*JSON> :t (unO.unR) entry
(unO.unR) entry :: M.Map String Value

which gets us down to the level of the first map with one entry under the key "attributes". To get the map of attributes we want out:

*JSON> let m = unO ((((M.!).unO.unR) entry) "attributes")
*JSON> m
fromList [("allow_comments",String "1"),("allow_pings",String "1"),...

(Haskell uses ! for dereferencing keys in a Data.Map.) And now the components of the entry are easy to extract:

*JSON> let atts = ((M.!) m)
*JSON> atts "allow_comments"
String "1"
*JSON> atts "updated_at"
String "2006-09-15 02:12:45"
*JSON> atts "body" 
String "<p>Although I really do like the...

The first-cut data model for entries looks like this:

data BlogPost = BlogPost { p_title :: String,
               p_summary :: Maybe String,
               p_permalink :: String,
               p_metadata :: PostMetadata,
               p_body :: String,
               p_tags :: [String],
               p_uid :: String,
               p_comments :: [BlogPost]
             }
        deriving (Show)

data PostMetadata = PostMetadata { m_created :: CalendarTime,
                   m_publish :: CalendarTime,
                   m_updated :: CalendarTime,
                   m_author :: PostAuthor,
                   m_published :: Bool }
          deriving (Show)

data PostAuthor = PostAuthor { p_name :: String,
                   p_uri :: Maybe String,
                   p_email :: Maybe String,
                   p_showEmail :: Bool
                 }
          deriving (Show)

And interpolating from typo's model to the new model is just putting the fields in the right place with a little bit of date munging, since the new model has the expectation that dates are represented as Haskell CalendarTime. The reuse of the BlogPost structure for comments is intentional, both for Atom syndication and to support threaded comments.

Pulling all of the entries out is also straightforward:

$ ./script/runner 'puts (Article.find_all_by_state("Published")).to_json' \
  > /tmp/entry.json
$ ./script/runner 'puts (Article.find_all_by_state("ContentState::Published")).to_json' \
  >> /tmp/entry.json

and pulling comments and trackbacks is a similar exercise:

$ ./script/runner 'puts (Comment.find_all_by_state("ContentState::Ham")).to_json' \
  > /tmp/comments.json
$ ./script/runner 'puts (Trackback.find_all_by_state("ContentState::Ham")).to_json' \
  > /tmp/trackbacks.json

although it takes a little doing to collate comments and trackbacks with their parent posts. So far, so good — unlike any of the other migrations (Radio Userland→SnipSnap, SnipSnap→WordPress, and WordPress→typo) I've done, this looks to be neither lossy nor labor-intensive.

As an aside, over about four years of blogging (2003-02-17 through the present), I've accumulated the equivalent of ~110 single-spaced pages of content.

Meta

Tags: (tag) (tag) (tag) (tag)

(comment bubbles) 1 comment
2453 direct views

Comment from ejboy @ 2007-02-24T05:11:35Z # permalink

If anybody is interested in JSON support for migrations based on Scriptella ETL, do not hesitate to ask me for it. I can add support for JSON export if I see any significant interest.