Fatherhood, Take Two

Paul Brown @ 2008-04-09T06:00:00Z

Our son was born today.

(comment bubbles) 4 comments

Conditional GET Support for perpubplat

Paul R. Brown @ 2008-03-05T08:34:12Z

As part of being a good netizen, I added conditional GET support (per 9.3 in the HTTP 1.1 spec) to perpubplat in the form of ETag (MD5 of feed URI and last modified date) and Last-Modified headers on generated Atom feeds and corresponding If-None-Match and If-Modified-Since headers on requests for Atom feeds with proper precedence. (For precedence, the spec dictates that a successful If-None-Match assertion means that any If-Modified-Since assertion is ignored.) A quick curl experiment shows that things appear to work:

$ curl -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^ETag\|\^Last-Modified\|\^HTTP\/
HTTP/1.1 200 OK
Last-Modified: Sat, 23 Feb 2008 23:55:23 GMT
ETag: 78790a6a7d6bddd10f6f9c412f2aba97
$ curl -H 'If-None-Match: 78790a6a7d6bddd10f6f9c412f2aba97' \
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 304 Not Modified
$ curl -H 'If-Modified-Since: Sat, 23 Feb 2008 23:55:23 GMT' \
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 304 Not Modified
$ curl -H 'If-None-Match: foo' -H 'If-Modified-Since: Sat, 23 Feb 2008 23:55:23 GMT'
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 200 OK
(comment bubbles) 0 comments

First Twinklings of a Sense of Humor

Paul R. Brown @ 2008-02-26T22:49:30Z

As she's approaching her third birthday, the kid is showing signs of a sense of humor. We were in the car with the kid, discussing what to have for dinner, and my wife told her that we were going to have lamb, peas, and couscous.

kid: Dad, would you like some hotto-potaddo for dinner?

dad: [Decides to play along.] Sure; that sounds good.

kid: Would you like some hotto-potaddo for dinner, Mom?

mom: [Decides to play along, too.] OK.

kid: Hotto-potaddo for you, Dad, and hotto-potaddo for you, Mom. I will have lamb and peas myself.

I wonder where she learns this stuff.

(comment bubbles) 0 comments

There are Apparently Lots of Haskell Jobs...

Paul R. Brown @ 2008-02-23T23:55:23Z

... and some of them even involve trees.

A while ago, Rod Johnson asserted that Spring had overtaken EJB as a job requirement for Java developers, supported by some charts from indeed.com. (For follow-on analysis, see an article on InfoQ.) I'd call the Spring or EJB decision a false dichotomy in any case.

On a lark, I decided to try a three-way competition between Haskell, Erlang, and OCaml, with Lisp and Pascal thrown in just because, and the results surprised me, since I expected Erlang to beat both Haskell and OCaml:

chart of Haskell, OCaml, and Erlang job counts

So, what are these Haskell jobs...?

haskell job listings

Ah. A more reasonable query (like "haskell and functional") gives the expected results of a few jobs in financial services. The most intersting thing about the query is that twelve jobs is about 0.00025% of the total number of jobs in the engine, so I'll guess that they have on the order of 4.8 million jobs and thus that there are around 9,600 jobs (0.2%) listed for "spring and java" as requirements.

(comment bubbles) 0 comments

New Features for Perpubplat and Ruminations on Service APIs for the Web

Paul R. Brown @ 2008-02-18T20:02:39Z

I've added some new features to perpubplat, and each one presented a nice exercise in Haskell, working with Haskell libraries, and the design and consumption of web APIs.

Collage of Random Flickr Photos

Flickr Sidebar screenshotThe first feature is the collage of photos that uses the Flickr JSON API. The collage appears at the bottom of the sidebar under the "Photos" heading.

The implementation of the collage (Blog.Widgets.FlickrCollage; source here) uses a polite (i.e., supports conditional GET) HTTP poller (Blog.BackEnd.HttpPoller; source here) to call flickr.people.getPublicPhotos (docs here) every fifteen minutes and pull down the data for my most recent 500 photos. (I'll discuss the HTTP poller below.) To deal with concurrency — many readers (HTTP requests) and one writer (the polling thread) — an MVar holds the list of photos, with the writer taking the old value and putting the new and the reader taking the old value and then putting it right back. The implementation of MVar ensures that waiters are awakened in FIFO order, so this should (and does) work great.

The JSON parser that I've been using uses Haskell's datatype polymorphism to model polymorphism in JSON, and this means that you work with wrapped (JSON Array wrapped around a list, JSON String wrapped around a Haskell String, etc.) primitive values instead of primitive values. To make things a little more ergonomic, I've bundled up some one-line utility functions in Blog.Widgets.JsonUtilities (source here). My favorite of the bunch is </>:

(</>) :: J.Value -> String -> J.Value
(J.Object o) </> s = o M.! s
(J.Array a) </> s = J.Array $ map (flip (</>) $ s) a

This makes it possible to compactly express access to nested JSON objects. For example, from the Flickr integration:

to_photo :: J.Value -> FlickrPhoto
to_photo m = FlickrPhoto { photo_id = uns $ m </> "id"
                         , owner = uns $ m </> "owner"
                         , secret = uns $ m </> "secret"
                         , server = uns $ m </> "server"
                         , photo_title = uns $ m </> "title"
                         , farm = unn $ m </> "farm" }

The uns function pulls the value out of a wrapped JSON String, and the unn function pulls the value out of a wrapped JSON Number. With a bit more thought, someone could probably come up with a nice library for JSON handling along the lines of Jaql or something like Pig Latin.

HTTP Polling

My rough cut at an HTTP polling library built on top of Network.HTTP is Blog.BackEnd.HttpPoller (source here), and it supports the bare minimum of features that I needed:

  • Call a supplied function with signature String -> IO () with the body of a 200 response and ignore others.
  • Use "conditional GET" (RFC 2616, section 9.3) via ETag/If-None-Match and Last-Modified/If-Modified-Since.
  • Support for basic authentication via a header configured on the template request passed to the poller.
  • Tolerant of temporary failures but able to gracefully exit.
  • Detailed-enough logging in case APIs, endpoints, or policies change. (I omitted redirect support on purpose.)

del.icio.us Bookmarks on an Entry

The second feature is integration with del.icio.us bookmarks pointing to an entry via the del.icio.us JSON API, and it shows up as a trailer on entries in the detail view:

del.icio.us entry trailer screenshot

I've already blogged about most of the interesting stuff from integrating with the del.icio.us JSON API using Network.HTTP; see Haskell, del.icio.us, and JSON (encodings and non-standard JSON) and A Short Adventure with simpleHTTP (unclosed sockets).

The part I didn't cover was how to schedule queries against del.icio.us, and I'll probably go back to both simplify and enhance it. As present, it's a bit convoluted; three threads interact as follows:

  1. The driver triggers the scheduler on a fixed interval.
  2. The scheduler manages an ordered list of scheduled times and entries. In response to a trigger from the driver, if the head of the list is past due, the scheduler pops the head of the list, refreshes the data about bookmarks for that entry, sends it to the controller, and schedules the next refresh for that entry based on its age in days. The scheduler also receives information about new entries and adds them to the schedule.
  3. The controller manages a Data.Map of data about bookmarks per entry and either updates data in response to the scheduler or returns the current data for rendering a response.

The current design is in-memory only, so it gets repopulated each time the service is booted. I intend to add simple file-based persistence along the same lines used for entries and comments. The other major missing features are support for conditional GET and grouping requests into groups of 15, as allowed by the del.icio.us API.

I would have liked to use the delicious API, but Network.HTTP doesn't currently support HTTPS.

Personal Aggregation

StreamOfConsciousness Sidebar screenshotThe third feature is aggregation of my del.icio.us bookmarks (via RSS feed), Google Reader shared items (via Atom feed), and Twitter "tweets" (via JSON API). The aggregated flotsam, jetsam, dross, and detritus shows up in the sidebar under the "Stream of Consciousness" heading in the sidebar.

The feature is a bit like Moveable Type's Action Streams plugin, but the perpubplat implementation benefits from the fact that a Haskell FastCGI application can have background threads (so no crontab hacking).

The implementation is in the Blog.Widgets.StreamOfConsciousness.* modules:

  • Thought is a data structure that represents a tweet, post, shared item, etc. — date, link, content.
  • Twitter, GoogleReader, and DeliciousPosts encapsulate access to the respective services and parsing data into lists of Thoughts. Each worker uses an HTTP poller (same as with the Flickr collage) to poll a feed.
  • Controller manages the aggregate data structure and a pre-rendered HTML fragment.

To handle the multiple writers and multiple readers, I implemented a lightweight version of multi-version concurrency control where readers can always get data but writers may have to repeat a computation if someone else updated the data in the meantime. Here's a fragment from B.W.S.Controller (full source here):

commit :: SoCController -> [Thought] -> IO ()
commit socc new_items =
    do { snap <- get_data socc
       ; let items' = take (max_size snap) $ merge new_items $ items snap
       ; let rendered' = thoughts_to_xhtml items' 
       ; let snap' = snap { items = items'
                          , rendered = rendered' }
       ; ok <- update socc snap'
       ; if ok then
             return ()
         else 
             do { threadDelay collision_delay
                ; commit socc new_items }
       }

loop :: Chan SoCRequest -> Snapshot -> IO ()
loop ch snap = 
    do { req <- readChan ch
       ; snap' <- case req of
                   GetHtmlFragment c ->
                       do { putMVar c $ rendered snap
                          ; return snap }
                   GetData h ->
                       do { putMVar h snap
                          ; return snap }
                   Update ok snap'' ->
                       if (version snap) == (version snap'') then
                           do { putMVar ok True
                              ; let snap' = snap'' { version = (version snap) + 1 }
                              ; return snap' }
                       else
                           do { putMVar ok False
                              ; return snap }
       ; loop ch snap' }

The commit function runs in the HTTP polling thread doing the updating, and it's responsible both for merging the items into the sorted data and for updating the HTML representation that will get handed to the page rendering process.

The other interesting nut to crack was extracting data from XML using Haskell. I could have used the del.icio.us JSON feed and the JSON feed that the Google Reader shared items Javascript widget uses, but those lack the timestamps that I need to fold the streams together.

Extracting Data from RSS and Atom

I followed the standard trail for learning HXT, which involves building from source, reading the gentle introduction, and trying some of the practical examples. The only issue I had was with namespace handling.

Here's a code fragment from B.W.S.DeliciousPosts (source here) to read the RSS feed of my del.icio.us bookmarks:

import Text.XML.HXT.Arrow

handle_posts :: SoCController -> String -> IO ()
handle_posts socc body = do { posts <- runX ( readString parse_opts body >>> getItems )
                            ; commit socc posts }

parse_opts = [(a_validate, v_0), (a_check_namespaces,v_1)]
                                
atElemQName qn = deep (isElem >>> hasQName qn)
text = getChildren >>> getText
textOf qn = atElemQName qn >>> text

rdf_uri = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
rdf_RDF = QN "rdf" "RDF" rdf_uri

rss_uri = "http://purl.org/rss/1.0/"
rss_item = QN "rss" "item" rss_uri
rss_title = QN "rss" "title" rss_uri
rss_link = QN "rss" "link" rss_uri

dc_uri = "http://purl.org/dc/elements/1.1/"
dc_date = QN "dc" "date" dc_uri


getItem = atElemQName rss_item >>>
          proc i -> do
            t <- textOf rss_title -< i
            u <- textOf rss_link -< i
            d <- textOf dc_date -< i
            returnA -< Thought Delicious d u t

getItems = atElemQName rdf_RDF >>>
           proc r -> do
             items <- getItem -< r
             returnA -< items

HXT uses arrow notation; the quick and dirty explanation is that proc is like λ (but for arrows instead of functions), the <- is the usual monadic "bind" operator, and the -< feeds a value to the expression on the shaft of the arrow.

The first time I ran this against the RSS from del.icio.us, I got nothing back, so after looking at the XML for the RSS, I switched the prefix for the RSS QNames to the empty string to match the input file, and it worked. Grrr... That means that the (==) for QName is broken, and a quick look at the source in Text.XML.HXT.DOM.TypeDefs showed why:

data QName = QN { namePrefix    :: String
ualified name \"namePrefix:localPart\"
                , localPart     :: String
ed name \"namePrefix:localPart\"
                , namespaceUri  :: String
i
                }
             deriving (Eq, Ord, Show, Read, Typeable)

The derived (==) will just and together the (==) for the three components (prefix, local, uri), but XML QNames are equal if their local parts and URIs (as strings) are equal. It's easy to fix by dropping the derivation of Eq and supplying a good version:

-              deriving (Eq, Ord, Show, Read, Typeable)
+              deriving (Ord, Show, Read, Typeable)
+ 
+ instance Eq QName where
+     q1 == q2 = ((localPart q1) == (localPart q2))
+                && ((namespaceUri q1) == (namespaceUri q2))

After which, it works according to my expectations for namespace handling.

Couldn't You Do All That With JavaScript...?

Yes. I could. I didn't. Here are a few of the reasons that I chose not to:

  • My experiments showed that page loads would be several seconds instead of a fraction of a second. Other people have had the same experience. (It reminds me of the opening scene of I'm Gonna Git You Sucka where Junebug dies of an OG. Don't let your blog die of an OW...)
  • Some of the widgets are just plain fugly, IMHO. I'm looking at you, Google Reader shared item "clip" and Twitter Flash widget, although the availability of JSON for the Google Reader shared item "clip" (look in the JavaScript) and Twitter would allow me to come up with something more pleasing (to me).
  • Even though it's not a good idea — e.g., IE7 is broken, Firefox <3 doesn't do incremental display, etc. — I would like to be able to serve application/xhtml+xml, and document.write doesn't work.
  • The availability of background threads on the server side means that Javascript on the client side isn't the only option.

Other Integrations and Aggregations

The other two features that I'd like to add are backlinks to other blogs and backlinks to posts on community sites like Reddit and DZone. (I'm on the fence about implementing trackback support; you could twist my arm.)

Nonetheless, I'm on the fence about directing people to comment threads in other locations, i.e., Reddit. (My reasons are similar to Reg Braithwaite's.) It would be a simple matter to sniff referring URLs, deduce where an entry is posted on Reddit, and then integrate the comments together, but Reddit's draconian User Agreement forbids it:

The content, organization, graphics, text, images, video, design, compilation, advertising and all other material on the Website, including without limitation, the "look and feel" of this website, are protected under applicable copyrights and other proprietary (including but not limited to intellectual property) rights and are the property of Website Provider or its licensors. The copying, rearrangement, redistribution, modification, use or publication by you, directly or indirectly, of any such matters or any part of the website, including but not limited to the removal or alteration of advertising, except for the limited rights of use granted hereunder, is strictly prohibited.

Someone should implement a community hub that integrates discussion threads, followup posts, and blog comments on an original entry in a transparent and open fashion...

Postmortem

My first observation from this experiment is that APIs are preferable to feeds are preferable to widgets when it comes to integration of services on the web. (Note that I didn't say web serivces...) Even listing widgets is somewhat questionable in my opinion, since it's more of a "put my stuff on your page" than a "use my service".

My second observation is nothing new, but I now have experimental evidence — JSON is preferable to XML, whether or not the target client runs in a browser. If I were building a service, I'm not sure that I'd bother with supporting an XML API.

My third observation is that I would use Haskell to build a product or service, and I mean that in the sense that I can see how to train a team and build processes (prototyping, implementation, quality, deployment, support) around Haskell. The language does have a relatively steep learning curve (q.q. Kevin Scaldefarri's post on the subject and the comments that follow or Reg Braithwaite's general ruminations on learning languages), but the real problem is collectively getting through the challenges once. It reminds me of learning spectral sequences as a graduate student; fifteen minutes with my advisor to work an example was better than a week of staring at otherwise incrutable notation. As a measure of the view from my current location on the learning curve, I coded up a working rough cut of the "stream of consciousness" feature in an evening plus an afternoon cup of coffee, and I wouldn't regard myself as being fully around the curve yet (FFI, custom monads/transformers, etc. await).

(comment bubbles) 1 comment

Mea Culpa on Duplicate Posts

Paul R. Brown @ 2008-02-17T00:31:09Z

Apologies for the duplicate posts that showed up last night and are probably still stuck in some aggregators. I inadvertently deployed my development configuration to my production host, and that's why there were lots of posts with localhost:7007 in their permalinks. Needless to say, better configuration management moved up the list of features to be added to perpubplat.

(comment bubbles) 0 comments

A Short Adventure with simpleHTTP

Paul R. Brown @ 2008-02-08T05:51:28Z

I'll blog separately about adding support for some JavaScript-free del.icio.us and Flickr chrome to perpubplat (look at the sidebar and the bottom of entries in the detail view), but like the experiment with the del.icio.us JSON API, it had an interesting (to me at least) and unexpected turn with the Haskell Network.HTTP library not closing its connections on my development box and laptop but running just fine on a Linux server.

Too Many Open Files?

I fired up a local build of perpubplat and let it run for a bit. The hit rate was a little more aggressive than del.icio.us was happy with (999 response codes), so I tapered it back and let the service run for a while. After a couple of hours, the service crashed with a "too many open files" exception, and the only thing that could mean was that the connections to del.icio.us weren't getting closed properly. A quick restart, a little wait, and there are a bunch of open connections to del.icio.us hanging around; here's a representative pair of open connections:

$ sudo lsof -u _www | grep perpubpla
[...]
perpubpla 52499 _www    6u  IPv4 0xa642e64        0t0      TCP \
  coresaplenty:49364->badges1.del.vip.re1.yahoo.net:http (LAST_ACK)
[...]
perpubpla 52499 _www   22u  IPv4 0x9fa7270        0t0      TCP \
  coresaplenty:49457->badges1.del.vip.re1.yahoo.net:http (TIME_WAIT)
[...]

(My local Apache2 runs as the _www user.) A quick check showed that the number is steadily increasing:

$ while true; do sudo lsof -u _www | grep del.vip | wc -l; sleep 10; done
[...]
80
83
[...]
150
152
[...]

My first thought was that I was being overly lazy in the code that connects to the remote services. I added some strictness annotations in strategic places (enough to ensure that the response body was fully read), tinkered with relevant HTTP headers (e.g., Connection: close), and turned on debugging in the Network.HTTP library, which just reported that it was closing streams.

I polled folks on #haskell, but it didn't appear that others were having the same issue.

Detour into TCP

Rather than suspect the Network.HTTP library (which is where the problem was), the hung connections led me to initially suspect the network layer, and I dug into the TCP exchange. Here's a state chart from RFC 793 that provides a non-normative explanation of proper TCP behavior:

                              +---------+ ---------\      active OPEN  
                              |  CLOSED |            \    -----------  
                              +---------+<---------\   \   create TCB  
                                |     ^              \   \  snd SYN    
                   passive OPEN |     |   CLOSE        \   \           
                   ------------ |     | ----------       \   \         
                    create TCB  |     | delete TCB         \   \       
                                V     |                      \   \     
                              +---------+            CLOSE    |    \   
                              |  LISTEN |          ---------- |     |  
                              +---------+          delete TCB |     |  
                   rcv SYN      |     |     SEND              |     |  
                  -----------   |     |    -------            |     V  
 +---------+      snd SYN,ACK  /       \   snd SYN          +---------+
 |         |<-----------------           ------------------>|         |
 |   SYN   |                    rcv SYN                     |   SYN   |
 |   RCVD  |<-----------------------------------------------|   SENT  |
 |         |                    snd ACK                     |         |
 |         |------------------           -------------------|         |
 +---------+   rcv ACK of SYN  \       /  rcv SYN,ACK       +---------+
   |           --------------   |     |   -----------                  
   |                  x         |     |     snd ACK                    
   |                            V     V                                
   |  CLOSE                   +---------+                              
   | -------                  |  ESTAB  |                              
   | snd FIN                  +---------+                              
   |                   CLOSE    |     |    rcv FIN                     
   V                  -------   |     |    -------                     
 +---------+          snd FIN  /       \   snd ACK          +---------+
 |  FIN    |<-----------------           ------------------>|  CLOSE  |
 | WAIT-1  |------------------                              |   WAIT  |
 +---------+          rcv FIN  \                            +---------+
   | rcv ACK of FIN   -------   |                            CLOSE  |  
   | --------------   snd ACK   |                           ------- |  
   V        x                   V                           snd FIN V  
 +---------+                  +---------+                   +---------+
 |FINWAIT-2|                  | CLOSING |                   | LAST-ACK|
 +---------+                  +---------+                   +---------+
   |                rcv ACK of FIN |                 rcv ACK of FIN |  
   |  rcv FIN       -------------- |    Timeout=2MSL -------------- |  
   |  -------              x       V    ------------        x       V  
    \ snd ACK                 +---------+delete TCB         +---------+
     ------------------------>|TIME WAIT|------------------>| CLOSED  |
                              +---------+                   +---------+

Without any additional digging, the fact that the connections appeared to be hung in either of the final two states (TIME_WAIT or LAST_ACK) should have alerted me to the actual problem, as should the fact that the connection in the TIME_WAIT state failed to time out even after a long period of time (e.g., 30 minutes). But I didn't see it yet, so I kept digging.

With the aid of tcpdump and wireshark, the interaction that hangs in the LAST_ACK state:

perpubpla 52499 _www    6u  IPv4 0xa642e64        0t0      TCP \
  coresaplenty:49364->badges1.del.vip.re1.yahoo.net:http (LAST_ACK)
|29.402   |         SYN       |                   |Seq = 0 Ack = 3900749235
|         |(49364)  ------------------>  (80)     |
|29.500   |         SYN, ACK  |                   |Seq = 0 Ack = 1
|         |(49364)  <------------------  (80)     |
|29.500   |         ACK       |                   |Seq = 1 Ack = 1
|         |(49364)  ------------------>  (80)     |
|29.501   |         PSH, ACK - Len: 120           |Seq = 1 Ack = 1
|         |(49364)  ------------------>  (80)     |
|29.620   |         PSH, ACK - Len: 373           |Seq = 1 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         PSH, ACK - Len: 5             |Seq = 374 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         FIN, ACK  |                   |Seq = 379 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 374
|         |(49364)  ------------------>  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 379
|         |(49364)  ------------------>  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 380
|         |(49364)  ------------------>  (80)     |
|29.621   |         FIN, ACK  |                   |Seq = 121 Ack = 380
|         |(49364)  ------------------>  (80)     |
|29.722   |         ACK       |                   |Seq = 380 Ack = 122
|         |(49364)  <------------------  (80)     |

According to the statechart and the RFC, this looks acceptable.

As for the one that hangs in the TIME_WAIT state:

perpubpla 52499 _www   22u  IPv4 0x9fa7270        0t0      TCP \
  coresaplenty:49457->badges1.del.vip.re1.yahoo.net:http (TIME_WAIT)
|189.205  |         SYN       |                   |Seq = 0 Ack = 1062078327
|         |(49457)  ------------------>  (80)     |
|189.308  |         SYN, ACK  |                   |Seq = 0 Ack = 1
|         |(49457)  <------------------  (80)     |
|189.308  |         ACK       |                   |Seq = 1 Ack = 1
|         |(49457)  ------------------>  (80)     |
|189.308  |         PSH, ACK - Len: 120           |Seq = 1 Ack = 1
|         |(49457)  ------------------>  (80)     |
|189.507  |         ACK       |                   |Seq = 1 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         PSH, ACK - Len: 373           |Seq = 1 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         PSH, ACK - Len: 5             |Seq = 374 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 379 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 374
|         |(49457)  ------------------>  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 121 Ack = 380
|         |(49457)  ------------------>  (80)     |
|189.632  |         ACK       |                   |Seq = 380 Ack = 122
|         |(49457)  <------------------  (80)     |

As with the connection hung in the LAST_ACK state, this also looks to be within the behavior prescribed by the statechart and RFC.

This issues occurs both on my primary development box at home and on my laptop, both Intel hardware with Mac OS X 10.5.1, but on the (virtual) server that hosts this blog (Linux 2.6 kernel), the same code, libraries, and GHC version exhibit no problems.

The Actual Problem

Finger-tracing the code for the simpleHTTP function in the Network.HTTP module leads down into the Network.TCP module, and when I finally read the code, I kicked myself for the detour with packet sniffing. Here is the full text for the close function for the TCP stream with the issue highlighted in red:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c `Exception.catch` (\_ -> return ())
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            do { shutdown sk ShutdownSend
               ; suck ref
               ; shutdown sk ShutdownReceive
               ; sClose sk
               }

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)

The Exception.catch will trap any exception that occurs during connClose, but that doesn't mean that the connClose function has completed fully and closed the socket with sClose.

Here's a quick experiment to confirm my suspicions. First, some finer grained exception trapping and reporting:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c `Exception.catch` (flag "0")
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            do { ( shutdown sk ShutdownSend >> suck ref) `Exception.catch` (flag "1")
               ; shutdown sk ShutdownReceive `Exception.catch` (flag "2")
               ; sClose sk
               }

        flag s e = print $ s ++ ":" ++ show e

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)

And a ghci session to try it out:

$ cd ~/work/haskell-http/HTTP-3001.0.4
$ ghci Network.HTTP
GHCi, version 6.8.2: http://www.haskell.org/ghc/  :? for help
[.. startup stuff ...]
*Network.HTTP> :set prompt "> "
> :m + Data.Maybe Network.URI
> let teh_goog = fromJust $ parseURI "http://www.google.com"
> resp <- simpleHTTP $ Request teh_goog GET [] ""
"2:shutdown: invalid argument (Socket is not connected)"
Right HTTP/1.1 200 OK 
Cache-Control: private
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=[...]:TM=[...]:LM=[...]:S=[...]; \
  expires=Sun, 07-Feb-2010 05:02:22 GMT; path=/; domain=.google.com
Server: gws
Transfer-Encoding: chunked
Date: Fri, 08 Feb 2008 05:02:22 GMT
Connection: Close
Content-Length: 5367


Aha — the highlighted line shows that the code is exiting before closing the socket with sClose. Without the change to trap the exceptions within the closeConn function, the connection to Google would have remained open, but now:

> :! lsof | grep ghc | grep goog
>

With a little clean-up, we've got a patch ready to submit:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            mapM_ (flip Exception.catch $ \_ -> return ())
                  [ shutdown sk ShutdownSend
                  , suck ref
                  , shutdown sk ShutdownReceive
                  , sClose sk ]

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)
(comment bubbles) 2 comments

All Posts contains 399 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57