Haskell, del.icio.us, and JSON

Paul R. Brown @ 2008-01-27T00:35:15Z

I'd like to add both a sidebar with my bookmarks and some per-entry chrome for posts bookmarked on del.icio.us, but I don't want to use client-side Javascript to do it. The alternative is to pull, cache, and manage the data on the server side. As a prototype, I whipped up a simple Haskell program that uses the del.icio.us JSON APIs (for posts and for URLs), and it contained a couple of surprising detours.

Some Haskell

First up, some Haskell. After going shopping on Hackage, I installed Network.HTTP, Thomas DuBuisson's pureMD5 package, and the JSON package from Masahiro Sakai and Jun Mukai (cabalized version is here). Like all code that builds on a decent set of libraries, the Haskell code to hit del.icio.us is straightforward; full source is here, so I'll just post some fragments below to give a flavor of the code.

Create a structure to hold the data:

data DeliciousBookmark = DeliciousBookmark { bookmark_url :: String
                                           , description :: String
                                           , tags :: [String] }
                         deriving ( Show, Eq, Ord )

Build the request:

bookmarks_fragment :: String
bookmarks_fragment = "http://del.icio.us/feeds/json/"

request_for_bookmarks :: String -> Request
request_for_bookmarks user = Request ( fromJust . parseURI $
                                       bookmarks_fragment ++ user ++ "?raw" )
                             GET [] ""

Send it:

fetch_bookmarks :: String -> IO [DeliciousBookmark]
fetch_bookmarks user = do { res <- simpleHTTP . request_for_bookmarks $ user
                          ; case res of
                              Right (Response (2,0,0) _ _ body) ->
                                  return $ process_bookmarks_body body
                          }

And then parse and walk through the response body:

parse_crufty_json :: String -> J.Value
parse_crufty_json = parse_json . unescape . utf8_decode
    where
      parse_json = \s -> case (parse J.json "" s) of
                           Left err -> error . show $ err
                           Right v -> v

process_bookmarks_body :: String -> [DeliciousBookmark]
process_bookmarks_body body =
    case parse_crufty_json body of
      J.Array a ->
          map (process_bookmark . uno) a

process_bookmark :: M.Map String J.Value -> DeliciousBookmark
process_bookmark m =
    DeliciousBookmark { bookmark_url = uns $ M.findWithDefault blank "u" m
                      , description = uns $ M.findWithDefault blank "d" m 
                      , tags = map uns $ una $ M.findWithDefault empty_array "t" m }

blank = J.String ""
empty_array = J.Array []
uno (J.Object o) = o
uns (J.String s) = s

And that's all there is to it, except that — as might be expected from the parse_crufty_json function — there were a few things that didn't work on the first pass.

Bytes and Characters

The first wrinkle I ran into with the simple del.icio.us client occurred in process_bookmarks_body. The Haskell String that comes from the HTTP response structure is just a straight conversion of the response body from bytes to character ordinals. This is all well and good if the body is encoded in ISO-8859-1, but it's fraught with peril otherwise. The del.icio.us service sends back UTF-8 (and ignores an Accept-Charset header instead either returning a correctly encoded response or a 406 response code), so any interesting characters will cause problems. In this case, what should be Solutoire.com \8250 Plotr is coming through as Solutoire.com \226\128\186 Plotr. Writing a decoder is no big deal and an opportunity to play a quick round of golf.

In terms of making HTTP in Haskell better, there was apparently a Google SoC project proposed to integrate cURL via FFI and Haskell's ByteString API, but it doesn't look like anything's come of it.

RFC-compliant JSON versus Works For Me in JavaScript

The second wrinkle with the simple del.icio.us client is more pernicious. After I resolved the string encoding issues, I started getting errors of the form:

parse error at (line 1, column 1552):
unexpected "'"
expecting "\"", "\\", "/", "b", "f", "n", "r", "t" or "u"

And sure enough, on inspection, there's an escaped apostrophe lurking in the JSON. This probably wouldn't bother a client who simply evaluated the JSON as literal JavaScript (which seems to be the intent of the API), but it's not legal JSON and the parser correctly signals an error.

The JSON grammar (per RFC 4627) permits a few escapes, and apostrophe is not among them. To wit:

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ; \    reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; \

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

Apostrophe is U+0027.

As with the UTF-8 issues, it's a quick job to implement a filter to scan for escaped apostrophes and unescape them, but it would be nice if what is advertised as JSON was actually JSON.

Meta

Tags: (tag) (tag) (tag) (tag) (tag) (tag)

(comment bubbles) 10 comments
3016 direct views

Comment from Roberto @ 2008-01-27T03:21:56Z # permalink

Any programming language not having UTF-8 support these days can not be considered for production.

Sure, del.icio.us needs to fix its JSON serialization if its not right, but parsers should be able to have a permissive mode where some errors are accepted just like most important web browsers HTML parsers.

Comment from Paul Brown @ 2008-01-27T05:09:11Z # permalink

Haskell does support Unicode internally (which is the reason that Haskell Strings burn so much memory). Like most other languages (e.g., Java), it reads handles as text using whatever default encoding the operating system uses. Unlike most other languages (e.g., Java again), it doesn't provide direct support for encodings when serializing Strings or deserializing bytes.

Comment from timb @ 2008-01-27T12:00:47Z # permalink

dammit, i remember fixing that apostrophe-escaping bug YEARS ago

Comment from Andreas Krey @ 2008-01-27T12:10:02Z # permalink

Haskell primarily burns lots of memory because of the decision that strings are lists of chars; regular lists. Needing the next pointer anyway, there is no much point in using only 8 bits for the char. :-)

Btw. the recent trend to use bytestrings in parsec and elsewhere feels a bit like throwing out the kid (unicode support) with the bath (massive storage overhead).

Json being parseable as javascript was indeed a design goal; as far as I know there exists a regular expression which checks whether a json string actually only contains data literals and won't do anything bad when evaluated.

Comment from duncan @ 2008-01-27T12:54:24Z # permalink

Try using the utf8-string package from hackage to decode the strings.

Comment from Joshua @ 2008-01-27T19:17:26Z # permalink

I sent this on to the relevant folks.

Comment from dons @ 2008-01-27T20:12:45Z # permalink

The utf8-string package specifically allows you to use the existing IO operations, with encoded Strings. It is used in production systems.

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string

Comment from Paul Brown @ 2008-01-27T23:58:56Z # permalink

@dons and @duncan - I missed the utf8 package when I was rummaging on Hackage. I'll give it a go.

Comment from Erigami @ 2008-08-13T13:55:59Z # permalink

According to the RFC:

http://www.ietf.org/rfc/rfc4627.txt?number=4627

JSON's design goals were for it to be minimal, portable, textual, and a subset of JavaScript.

[...]

Any character may be escaped.

According to ECMAScript 262 (referenced in RFC 4627)

SingleEscapeCharacter is the character whose code point value is determined by the SingleEscapeCharacter according to the following table

[...]

\' [...] single quote [...] '

\' is a valid character in JSON.

Comment from Cowtowncoder @ 2009-02-17T07:24:07Z # permalink

Erigami: you are wrong.

Whether javascript allows it is irrelevant (except for historical interest): JSON RFC that you link to clearly lists allowed escape combinations, and this particular one is not included; hence it is invalid. And thereby it is not to be used for well-formed json content.

This is not really contradicting subsetting part; but the comment about aiming to be a subset is commentary, not definition of Json. Json != Javascript.