Up and Down and... Up

Paul R. Brown @ 2008-02-06T20:15:52Z

Doh. Starting last weekend, the server that hosts the virtual server that hosts this blog has been up and down, and then last night, it went down for the count. Linode, which I've been using since October 2006, has been a great hosting provider so far — this has been the only meaningful hiccup, and they dealt with it quick and transparently. The virtual server is back up on a new physical server, so I'm looking forward to another trouble-free couple of years.

Things are back to normal:

light load showing in htop

(comment bubbles) 1 comment

Here Come the Spambots...

Paul R. Brown @ 2008-01-27T00:59:56Z

It didn't take long for someone (or something) to send the first comment spam:

some comment spam in the review display

It's interesting in that the spambot appears to choose the same pages as on other blogs and that either someone wrote a quick plugin for their spambot platform or has a bot that figures out what links and form fields mean.

(comment bubbles) 0 comments

Haskell, del.icio.us, and JSON

Paul R. Brown @ 2008-01-27T00:35:15Z

I'd like to add both a sidebar with my bookmarks and some per-entry chrome for posts bookmarked on del.icio.us, but I don't want to use client-side Javascript to do it. The alternative is to pull, cache, and manage the data on the server side. As a prototype, I whipped up a simple Haskell program that uses the del.icio.us JSON APIs (for posts and for URLs), and it contained a couple of surprising detours.

Some Haskell

First up, some Haskell. After going shopping on Hackage, I installed Network.HTTP, Thomas DuBuisson's pureMD5 package, and the JSON package from Masahiro Sakai and Jun Mukai (cabalized version is here). Like all code that builds on a decent set of libraries, the Haskell code to hit del.icio.us is straightforward; full source is here, so I'll just post some fragments below to give a flavor of the code.

Create a structure to hold the data:

data DeliciousBookmark = DeliciousBookmark { bookmark_url :: String
                                           , description :: String
                                           , tags :: [String] }
                         deriving ( Show, Eq, Ord )

Build the request:

bookmarks_fragment :: String
bookmarks_fragment = "http://del.icio.us/feeds/json/"

request_for_bookmarks :: String -> Request
request_for_bookmarks user = Request ( fromJust . parseURI $
                                       bookmarks_fragment ++ user ++ "?raw" )
                             GET [] ""

Send it:

fetch_bookmarks :: String -> IO [DeliciousBookmark]
fetch_bookmarks user = do { res <- simpleHTTP . request_for_bookmarks $ user
                          ; case res of
                              Right (Response (2,0,0) _ _ body) ->
                                  return $ process_bookmarks_body body
                          }

And then parse and walk through the response body:

parse_crufty_json :: String -> J.Value
parse_crufty_json = parse_json . unescape . utf8_decode
    where
      parse_json = \s -> case (parse J.json "" s) of
                           Left err -> error . show $ err
                           Right v -> v

process_bookmarks_body :: String -> [DeliciousBookmark]
process_bookmarks_body body =
    case parse_crufty_json body of
      J.Array a ->
          map (process_bookmark . uno) a

process_bookmark :: M.Map String J.Value -> DeliciousBookmark
process_bookmark m =
    DeliciousBookmark { bookmark_url = uns $ M.findWithDefault blank "u" m
                      , description = uns $ M.findWithDefault blank "d" m 
                      , tags = map uns $ una $ M.findWithDefault empty_array "t" m }

blank = J.String ""
empty_array = J.Array []
uno (J.Object o) = o
uns (J.String s) = s

And that's all there is to it, except that — as might be expected from the parse_crufty_json function — there were a few things that didn't work on the first pass.

Bytes and Characters

The first wrinkle I ran into with the simple del.icio.us client occurred in process_bookmarks_body. The Haskell String that comes from the HTTP response structure is just a straight conversion of the response body from bytes to character ordinals. This is all well and good if the body is encoded in ISO-8859-1, but it's fraught with peril otherwise. The del.icio.us service sends back UTF-8 (and ignores an Accept-Charset header instead either returning a correctly encoded response or a 406 response code), so any interesting characters will cause problems. In this case, what should be Solutoire.com \8250 Plotr is coming through as Solutoire.com \226\128\186 Plotr. Writing a decoder is no big deal and an opportunity to play a quick round of golf.

In terms of making HTTP in Haskell better, there was apparently a Google SoC project proposed to integrate cURL via FFI and Haskell's ByteString API, but it doesn't look like anything's come of it.

RFC-compliant JSON versus Works For Me in JavaScript

The second wrinkle with the simple del.icio.us client is more pernicious. After I resolved the string encoding issues, I started getting errors of the form:

parse error at (line 1, column 1552):
unexpected "'"
expecting "\"", "\\", "/", "b", "f", "n", "r", "t" or "u"

And sure enough, on inspection, there's an escaped apostrophe lurking in the JSON. This probably wouldn't bother a client who simply evaluated the JSON as literal JavaScript (which seems to be the intent of the API), but it's not legal JSON and the parser correctly signals an error.

The JSON grammar (per RFC 4627) permits a few escapes, and apostrophe is not among them. To wit:

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ; \    reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; \

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

Apostrophe is U+0027.

As with the UTF-8 issues, it's a quick job to implement a filter to scan for escaped apostrophes and unescape them, but it would be nice if what is advertised as JSON was actually JSON.

(comment bubbles) 10 comments

The Blog has Ears

Paul R. Brown @ 2008-01-26T07:41:50Z

I opened up comments about a week ago on a provisional basis, and after fixing a couple of issues with ordering (and writing some unit tests), it should be good for general consumption.

A few design choices with the motivation behind them:

  • No CAPTCHA, AJAX-only forms, or other silliness to keep out the spammers, since those approaches don't really work and effectively punish the user instead of the spammers. In its place, moderation and a one-off platform that offers security through obscurity will have to do for starters. If things become a problem, I'd prefer a Bayesian approach anyway.
  • Comment formatting is provided through a simplistic macro language that's similar to the kind of markup supported in comments on Reddit. I thought about attempting to sanitize HTML or XHTML, but I wanted rigid limits on the types of formatting available and on the XHTML eventually stored in a comment and served in page views or feeds.
  • Unapproved comments use a separate internal channel and persistence mechanism, so other than request routing, spammers won't impair the experience for legitimate users.

What's Next...

I've got a pipeline of other features that I'd like to add, like backlinks to referrers, draft management workflow, some social chrome for del.icio.us and Reddit, and a Javascript-free Flickr collage; and those will follow along at the rate of one every few weeks.

(comment bubbles) 2 comments

Use the Cores, erl

Paul R. Brown @ 2008-01-19T07:26:23Z

In spite of the fact that my last Apple workstation failed rather ingloriously after only a couple of years of use, I went ahead and replaced it with another Apple workstation, an eight-core Mac Pro.

As an experiment, I decided to run the same Erlang benchmark (big.erl) that I ran on the quad-core machine, this time with Erlang R12B. The previous results showed that four schedulers was optimal on the four-core machine. Here are the results of the same test battery on the eight-core machine:

line chart of throughput per number of Erlang schedulers

Two things are odd about this chart:

  1. The running times appear to be about equal to the running times for the benchmark on the quad-core machine. The raw clock speeds aren't that different per core (2.5GHz G5 versus 2.8GHz Xeon), so maybe it's not unreasonable for that to be a draw.
  2. Four schedulers appears to be the optimum (from the set {1,2,4,8,16,32}), where eight would have been the expected value.

It turns out that the optimality of four schedulers in this case doesn't disprove the hypothesis that the optimum number of schedulers equals the number of cores, since the benchmark only appears to be utilizing three of the eight cores:

CPU information showing only 33% active

The question is why the Erlang VM isn't using the available CPU resources. (Two separate VMs running big.erl get utilization up to 85%.) The answer may be buried somewhere inside operating system limits (see, e.g., sysctl(3) and sysctl(8); maybe kern.clockrate?), but it might also be something more interesting. Meanwhile, I'll try to come up with a similar toy benchmark for Haskell to see if it achieves better utilization of the CPUs.

(comment bubbles) 0 comments

VCs Drive Beaters, Too

Paul R. Brown @ 2008-01-16T20:30:41Z

Back in 2003 or 2004, when FiveSight was thinking about raising VC money, I had an informal lunch meeting with a Chicago-area venture capitalist (pretty sure it was Znex Xbhytrbetr) somewhere in either the Gold Coast or Old Town neighborhoods in Chicago. I honestly don't remember much of the content of the conversation other than that an elevator pitch for a middleware company launching a new product effort neither got him excited nor appeared to spoil his lunch. We finished up and walked out together.

I was driving the same beater that I drove up until the birth of our first child in 2005, a 1991 Acura Integra that I'd bought used when I finished grad school. Along the same lines as the plant in the lobby, I didn't want to take a chance on sending the wrong message to a potential investor, so I was hoping that we weren't parked where he'd walk by my car on the way to his. I was happy when we got to the door and he dispatched the valet fetched his car, and then I was surprised when the valet pulled up in an even older beater than mine. (A rusty and rusty-colored VW Fox, I think.)

And then he bummed $10 off me to pay the valet because he wasn't carrying any cash...

(comment bubbles) 0 comments

A Little Lesson on Laziness and Unsafety

Paul R. Brown @ 2008-01-15T09:56:12Z

I learned a good lesson about Haskell and unsafe IO today.

As I added comment support to perpubplat, I did a little performance testing on comment submission just to make sure that it would behave decently, and I started picking up sporadic segmentation faults under load with no discernible pattern. The faults were occurring as the new comment was serialized to disk, which to the imperative programmer's eye, seems impossible, since the code that does the writing is plain vanilla Haskell code:

write_ :: B.Item -> IO ()
write_ i = do { let f = filename (B.internal_id i)
              ; h <- openFile f WriteMode
              ; hPutStr h $ B.to_string i
              ; hClose h }

On #haskell, dcoutts made the helpful suggestion to check for any foreign code, and thereby hangs a tale. (The GHC documentation makes the same suggestion.) Laziness means that computations may not be performed before their results are needed, so it's not enough to think about what's happening in the write_ function; you also have to think about the computation that created the value that's passed to write_, and on and on. With enough back-tracing, the data passed to write_ had its roots in fields read from an HTTP request handled by FastCGI, and the root cause was buried in the implementation of the function that reads the form fields: a ByteString being built out of a hunk of memory allocated to a FastCGI structure using... unsafeInterleaveIO. The segmentation fault was occurring when the evaluation of the write_ function (on a background thread) was effectively trying to read that hunk of memory that had been freed back when request processing was completed.

Copying the data in the request handling resolved the issue, but the FastCGI library should probably change slightly to insulate its clients from this issue.

(comment bubbles) 0 comments

All Posts contains 399 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57