Conditional GET Support for perpubplat

Paul R. Brown @ 2008-03-05T08:34:12Z

As part of being a good netizen, I added conditional GET support (per 9.3 in the HTTP 1.1 spec) to perpubplat in the form of ETag (MD5 of feed URI and last modified date) and Last-Modified headers on generated Atom feeds and corresponding If-None-Match and If-Modified-Since headers on requests for Atom feeds with proper precedence. (For precedence, the spec dictates that a successful If-None-Match assertion means that any If-Modified-Since assertion is ignored.) A quick curl experiment shows that things appear to work:

$ curl -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^ETag\|\^Last-Modified\|\^HTTP\/
HTTP/1.1 200 OK
Last-Modified: Sat, 23 Feb 2008 23:55:23 GMT
ETag: 78790a6a7d6bddd10f6f9c412f2aba97
$ curl -H 'If-None-Match: 78790a6a7d6bddd10f6f9c412f2aba97' \
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 304 Not Modified
$ curl -H 'If-Modified-Since: Sat, 23 Feb 2008 23:55:23 GMT' \
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 304 Not Modified
$ curl -H 'If-None-Match: foo' -H 'If-Modified-Since: Sat, 23 Feb 2008 23:55:23 GMT'
> -i -s -o - http://mult.ifario.us/f/t/haskell/atom.xml | \
> egrep \^HTTP\/
HTTP/1.1 200 OK
(comment bubbles) 0 comments

A Short Adventure with simpleHTTP

Paul R. Brown @ 2008-02-08T05:51:28Z

I'll blog separately about adding support for some JavaScript-free del.icio.us and Flickr chrome to perpubplat (look at the sidebar and the bottom of entries in the detail view), but like the experiment with the del.icio.us JSON API, it had an interesting (to me at least) and unexpected turn with the Haskell Network.HTTP library not closing its connections on my development box and laptop but running just fine on a Linux server.

Too Many Open Files?

I fired up a local build of perpubplat and let it run for a bit. The hit rate was a little more aggressive than del.icio.us was happy with (999 response codes), so I tapered it back and let the service run for a while. After a couple of hours, the service crashed with a "too many open files" exception, and the only thing that could mean was that the connections to del.icio.us weren't getting closed properly. A quick restart, a little wait, and there are a bunch of open connections to del.icio.us hanging around; here's a representative pair of open connections:

$ sudo lsof -u _www | grep perpubpla
[...]
perpubpla 52499 _www    6u  IPv4 0xa642e64        0t0      TCP \
  coresaplenty:49364->badges1.del.vip.re1.yahoo.net:http (LAST_ACK)
[...]
perpubpla 52499 _www   22u  IPv4 0x9fa7270        0t0      TCP \
  coresaplenty:49457->badges1.del.vip.re1.yahoo.net:http (TIME_WAIT)
[...]

(My local Apache2 runs as the _www user.) A quick check showed that the number is steadily increasing:

$ while true; do sudo lsof -u _www | grep del.vip | wc -l; sleep 10; done
[...]
80
83
[...]
150
152
[...]

My first thought was that I was being overly lazy in the code that connects to the remote services. I added some strictness annotations in strategic places (enough to ensure that the response body was fully read), tinkered with relevant HTTP headers (e.g., Connection: close), and turned on debugging in the Network.HTTP library, which just reported that it was closing streams.

I polled folks on #haskell, but it didn't appear that others were having the same issue.

Detour into TCP

Rather than suspect the Network.HTTP library (which is where the problem was), the hung connections led me to initially suspect the network layer, and I dug into the TCP exchange. Here's a state chart from RFC 793 that provides a non-normative explanation of proper TCP behavior:

                              +---------+ ---------\      active OPEN  
                              |  CLOSED |            \    -----------  
                              +---------+<---------\   \   create TCB  
                                |     ^              \   \  snd SYN    
                   passive OPEN |     |   CLOSE        \   \           
                   ------------ |     | ----------       \   \         
                    create TCB  |     | delete TCB         \   \       
                                V     |                      \   \     
                              +---------+            CLOSE    |    \   
                              |  LISTEN |          ---------- |     |  
                              +---------+          delete TCB |     |  
                   rcv SYN      |     |     SEND              |     |  
                  -----------   |     |    -------            |     V  
 +---------+      snd SYN,ACK  /       \   snd SYN          +---------+
 |         |<-----------------           ------------------>|         |
 |   SYN   |                    rcv SYN                     |   SYN   |
 |   RCVD  |<-----------------------------------------------|   SENT  |
 |         |                    snd ACK                     |         |
 |         |------------------           -------------------|         |
 +---------+   rcv ACK of SYN  \       /  rcv SYN,ACK       +---------+
   |           --------------   |     |   -----------                  
   |                  x         |     |     snd ACK                    
   |                            V     V                                
   |  CLOSE                   +---------+                              
   | -------                  |  ESTAB  |                              
   | snd FIN                  +---------+                              
   |                   CLOSE    |     |    rcv FIN                     
   V                  -------   |     |    -------                     
 +---------+          snd FIN  /       \   snd ACK          +---------+
 |  FIN    |<-----------------           ------------------>|  CLOSE  |
 | WAIT-1  |------------------                              |   WAIT  |
 +---------+          rcv FIN  \                            +---------+
   | rcv ACK of FIN   -------   |                            CLOSE  |  
   | --------------   snd ACK   |                           ------- |  
   V        x                   V                           snd FIN V  
 +---------+                  +---------+                   +---------+
 |FINWAIT-2|                  | CLOSING |                   | LAST-ACK|
 +---------+                  +---------+                   +---------+
   |                rcv ACK of FIN |                 rcv ACK of FIN |  
   |  rcv FIN       -------------- |    Timeout=2MSL -------------- |  
   |  -------              x       V    ------------        x       V  
    \ snd ACK                 +---------+delete TCB         +---------+
     ------------------------>|TIME WAIT|------------------>| CLOSED  |
                              +---------+                   +---------+

Without any additional digging, the fact that the connections appeared to be hung in either of the final two states (TIME_WAIT or LAST_ACK) should have alerted me to the actual problem, as should the fact that the connection in the TIME_WAIT state failed to time out even after a long period of time (e.g., 30 minutes). But I didn't see it yet, so I kept digging.

With the aid of tcpdump and wireshark, the interaction that hangs in the LAST_ACK state:

perpubpla 52499 _www    6u  IPv4 0xa642e64        0t0      TCP \
  coresaplenty:49364->badges1.del.vip.re1.yahoo.net:http (LAST_ACK)
|29.402   |         SYN       |                   |Seq = 0 Ack = 3900749235
|         |(49364)  ------------------>  (80)     |
|29.500   |         SYN, ACK  |                   |Seq = 0 Ack = 1
|         |(49364)  <------------------  (80)     |
|29.500   |         ACK       |                   |Seq = 1 Ack = 1
|         |(49364)  ------------------>  (80)     |
|29.501   |         PSH, ACK - Len: 120           |Seq = 1 Ack = 1
|         |(49364)  ------------------>  (80)     |
|29.620   |         PSH, ACK - Len: 373           |Seq = 1 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         PSH, ACK - Len: 5             |Seq = 374 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         FIN, ACK  |                   |Seq = 379 Ack = 121
|         |(49364)  <------------------  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 374
|         |(49364)  ------------------>  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 379
|         |(49364)  ------------------>  (80)     |
|29.620   |         ACK       |                   |Seq = 121 Ack = 380
|         |(49364)  ------------------>  (80)     |
|29.621   |         FIN, ACK  |                   |Seq = 121 Ack = 380
|         |(49364)  ------------------>  (80)     |
|29.722   |         ACK       |                   |Seq = 380 Ack = 122
|         |(49364)  <------------------  (80)     |

According to the statechart and the RFC, this looks acceptable.

As for the one that hangs in the TIME_WAIT state:

perpubpla 52499 _www   22u  IPv4 0x9fa7270        0t0      TCP \
  coresaplenty:49457->badges1.del.vip.re1.yahoo.net:http (TIME_WAIT)
|189.205  |         SYN       |                   |Seq = 0 Ack = 1062078327
|         |(49457)  ------------------>  (80)     |
|189.308  |         SYN, ACK  |                   |Seq = 0 Ack = 1
|         |(49457)  <------------------  (80)     |
|189.308  |         ACK       |                   |Seq = 1 Ack = 1
|         |(49457)  ------------------>  (80)     |
|189.308  |         PSH, ACK - Len: 120           |Seq = 1 Ack = 1
|         |(49457)  ------------------>  (80)     |
|189.507  |         ACK       |                   |Seq = 1 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         PSH, ACK - Len: 373           |Seq = 1 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         PSH, ACK - Len: 5             |Seq = 374 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 379 Ack = 121
|         |(49457)  <------------------  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 374
|         |(49457)  ------------------>  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         ACK       |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 121 Ack = 379
|         |(49457)  ------------------>  (80)     |
|189.529  |         FIN, ACK  |                   |Seq = 121 Ack = 380
|         |(49457)  ------------------>  (80)     |
|189.632  |         ACK       |                   |Seq = 380 Ack = 122
|         |(49457)  <------------------  (80)     |

As with the connection hung in the LAST_ACK state, this also looks to be within the behavior prescribed by the statechart and RFC.

This issues occurs both on my primary development box at home and on my laptop, both Intel hardware with Mac OS X 10.5.1, but on the (virtual) server that hosts this blog (Linux 2.6 kernel), the same code, libraries, and GHC version exhibit no problems.

The Actual Problem

Finger-tracing the code for the simpleHTTP function in the Network.HTTP module leads down into the Network.TCP module, and when I finally read the code, I kicked myself for the detour with packet sniffing. Here is the full text for the close function for the TCP stream with the issue highlighted in red:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c `Exception.catch` (\_ -> return ())
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            do { shutdown sk ShutdownSend
               ; suck ref
               ; shutdown sk ShutdownReceive
               ; sClose sk
               }

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)

The Exception.catch will trap any exception that occurs during connClose, but that doesn't mean that the connClose function has completed fully and closed the socket with sClose.

Here's a quick experiment to confirm my suspicions. First, some finer grained exception trapping and reporting:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c `Exception.catch` (flag "0")
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            do { ( shutdown sk ShutdownSend >> suck ref) `Exception.catch` (flag "1")
               ; shutdown sk ShutdownReceive `Exception.catch` (flag "2")
               ; sClose sk
               }

        flag s e = print $ s ++ ":" ++ show e

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)

And a ghci session to try it out:

$ cd ~/work/haskell-http/HTTP-3001.0.4
$ ghci Network.HTTP
GHCi, version 6.8.2: http://www.haskell.org/ghc/  :? for help
[.. startup stuff ...]
*Network.HTTP> :set prompt "> "
> :m + Data.Maybe Network.URI
> let teh_goog = fromJust $ parseURI "http://www.google.com"
> resp <- simpleHTTP $ Request teh_goog GET [] ""
"2:shutdown: invalid argument (Socket is not connected)"
Right HTTP/1.1 200 OK 
Cache-Control: private
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=[...]:TM=[...]:LM=[...]:S=[...]; \
  expires=Sun, 07-Feb-2010 05:02:22 GMT; path=/; domain=.google.com
Server: gws
Transfer-Encoding: chunked
Date: Fri, 08 Feb 2008 05:02:22 GMT
Connection: Close
Content-Length: 5367


Aha — the highlighted line shows that the code is exiting before closing the socket with sClose. Without the change to trap the exceptions within the closeConn function, the connection to Google would have remained open, but now:

> :! lsof | grep ghc | grep goog
>

With a little clean-up, we've got a patch ready to submit:

close ref = 
    do { c <- readIORef (getRef ref)
       ; closeConn c
       ; writeIORef (getRef ref) ConnClosed
       }
    where
        -- Be kind to peer & close gracefully.
        closeConn (ConnClosed) = return ()
        closeConn (MkConn sk addr [] _) =
            mapM_ (flip Exception.catch $ \_ -> return ())
                  [ shutdown sk ShutdownSend
                  , suck ref
                  , shutdown sk ShutdownReceive
                  , sClose sk ]

        suck :: Connection -> IO ()
        suck cn = readLine cn >>= 
                  either (\_ -> return ()) -- catch errors & ignore
                         (\x -> if null x then return () else suck cn)
(comment bubbles) 2 comments

Haskell, del.icio.us, and JSON

Paul R. Brown @ 2008-01-27T00:35:15Z

I'd like to add both a sidebar with my bookmarks and some per-entry chrome for posts bookmarked on del.icio.us, but I don't want to use client-side Javascript to do it. The alternative is to pull, cache, and manage the data on the server side. As a prototype, I whipped up a simple Haskell program that uses the del.icio.us JSON APIs (for posts and for URLs), and it contained a couple of surprising detours.

Some Haskell

First up, some Haskell. After going shopping on Hackage, I installed Network.HTTP, Thomas DuBuisson's pureMD5 package, and the JSON package from Masahiro Sakai and Jun Mukai (cabalized version is here). Like all code that builds on a decent set of libraries, the Haskell code to hit del.icio.us is straightforward; full source is here, so I'll just post some fragments below to give a flavor of the code.

Create a structure to hold the data:

data DeliciousBookmark = DeliciousBookmark { bookmark_url :: String
                                           , description :: String
                                           , tags :: [String] }
                         deriving ( Show, Eq, Ord )

Build the request:

bookmarks_fragment :: String
bookmarks_fragment = "http://del.icio.us/feeds/json/"

request_for_bookmarks :: String -> Request
request_for_bookmarks user = Request ( fromJust . parseURI $
                                       bookmarks_fragment ++ user ++ "?raw" )
                             GET [] ""

Send it:

fetch_bookmarks :: String -> IO [DeliciousBookmark]
fetch_bookmarks user = do { res <- simpleHTTP . request_for_bookmarks $ user
                          ; case res of
                              Right (Response (2,0,0) _ _ body) ->
                                  return $ process_bookmarks_body body
                          }

And then parse and walk through the response body:

parse_crufty_json :: String -> J.Value
parse_crufty_json = parse_json . unescape . utf8_decode
    where
      parse_json = \s -> case (parse J.json "" s) of
                           Left err -> error . show $ err
                           Right v -> v

process_bookmarks_body :: String -> [DeliciousBookmark]
process_bookmarks_body body =
    case parse_crufty_json body of
      J.Array a ->
          map (process_bookmark . uno) a

process_bookmark :: M.Map String J.Value -> DeliciousBookmark
process_bookmark m =
    DeliciousBookmark { bookmark_url = uns $ M.findWithDefault blank "u" m
                      , description = uns $ M.findWithDefault blank "d" m 
                      , tags = map uns $ una $ M.findWithDefault empty_array "t" m }

blank = J.String ""
empty_array = J.Array []
uno (J.Object o) = o
uns (J.String s) = s

And that's all there is to it, except that — as might be expected from the parse_crufty_json function — there were a few things that didn't work on the first pass.

Bytes and Characters

The first wrinkle I ran into with the simple del.icio.us client occurred in process_bookmarks_body. The Haskell String that comes from the HTTP response structure is just a straight conversion of the response body from bytes to character ordinals. This is all well and good if the body is encoded in ISO-8859-1, but it's fraught with peril otherwise. The del.icio.us service sends back UTF-8 (and ignores an Accept-Charset header instead either returning a correctly encoded response or a 406 response code), so any interesting characters will cause problems. In this case, what should be Solutoire.com \8250 Plotr is coming through as Solutoire.com \226\128\186 Plotr. Writing a decoder is no big deal and an opportunity to play a quick round of golf.

In terms of making HTTP in Haskell better, there was apparently a Google SoC project proposed to integrate cURL via FFI and Haskell's ByteString API, but it doesn't look like anything's come of it.

RFC-compliant JSON versus Works For Me in JavaScript

The second wrinkle with the simple del.icio.us client is more pernicious. After I resolved the string encoding issues, I started getting errors of the form:

parse error at (line 1, column 1552):
unexpected "'"
expecting "\"", "\\", "/", "b", "f", "n", "r", "t" or "u"

And sure enough, on inspection, there's an escaped apostrophe lurking in the JSON. This probably wouldn't bother a client who simply evaluated the JSON as literal JavaScript (which seems to be the intent of the API), but it's not legal JSON and the parser correctly signals an error.

The JSON grammar (per RFC 4627) permits a few escapes, and apostrophe is not among them. To wit:

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ; \    reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; \

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

Apostrophe is U+0027.

As with the UTF-8 issues, it's a quick job to implement a filter to scan for escaped apostrophes and unescape them, but it would be nice if what is advertised as JSON was actually JSON.

(comment bubbles) 10 comments

HTTP 100 Tidbit

Paul Brown @ 2007-11-16T18:35:46Z

Before Dan's presentation, I hadn't heard of 100 (Continue), but then I ran into it today. A client had an issue with a .NET client talking to a Jetty instance, and the conversation went something like this:

POST /uri/foo HTTP/1.1
[...]
Expect: 100-continue
Connection: Close

100 Continue
[Jetty closes the connection.]

It turns out that this is an issue with the way that Jetty handles the Connection: Close header. Even though it seems reasonable to drop the connection according to 14.10 (if you think of the 100 Continue as a response), dropping the connection violates 8.2.3:

Upon receiving a request which includes an Expect request-header field with the "100-continue" expectation, an origin server MUST either respond with 100 (Continue) status and continue to read from the input stream, or respond with a final status code.

Experiences like this are the reason that I always smile when someone tells me that they've built their own HTTP client or server implementation; it's not that simple.

(comment bubbles) 0 comments

Feedburnerizing Typo, Part II

Paul Brown @ 2006-07-03T19:49:34Z

Last year, I wrote a rudimentary sidebar to display Feedburner feed links in Typo, but I didn't really get it to the point I wanted at the time. So, I took another fifteen minutes to rewrite the sidebar to work with the enhanced API, ditch the auto-subscribe chiclets, add links for category feeds, and muck with routes.rb. In routes.rb, I mapped a new set of feed URLs for Feedburner onto the controller that currently serves feeds, switched the existing mappings to a two-line controller that 301's to the Feedburner equivalents, and left holes so that people can subscribe directly to article-specific or tag-specific feeds if they wish. (The bonus in this approach is that autodiscovery gets taken care of for free, because the autodiscovery feed is one that gets 301'd.)

Just for grins, here's the two-line controller implementation:

class FbController < ContentController
  def redirect   
    headers["Status"] = "301 Moved Permanently"
    redirect_to "http://feeds.feedburner.com/Multifarious" +
      params[:type].to_s.capitalize + params[:id].to_s.capitalize
  end
end

Sometimes I think that the cornucopia of methods on some of the Ruby core classes (like capitalize on String) is overkill, and sometimes, it's convenient.

I hope that the enhanced setup is useful to any readers (since Feedburner should ensure QoS), but mostly I hope that it's transparent. (FWIW, NetNewsWire did the Right Thing and changed the feed URL for my self-subscription to the new one in response to the 301.) If for some reason you can't see this, let me know...

(comment bubbles) 0 comments

WordPress to Typo Migration, Part II

Paul Brown @ 2005-12-10T04:45:00Z

The initial migration (and a subsequent upgrade from 2.6.0 to svn trunk) was pretty much painless, but the database migration didn't take care of mapping permalinks or date queries from the WordPress scheme to the typo scheme. Enter a little mod_rewrite and Ruby (at which I'm a newbie).

The first step is to grab the query string on the old server, e.g., to grab the WordPress-style permalink:

http://oldblog/?p=69

to a new entry point in typo like:

http://newblog/wp/69

the required bit of mod_rewrite script is:

RewriteCond %{QUERY_STRING} p=([^&;]+)
RewriteRule ^/$ http://newblog/wp/%1? [R=301,L]

(The trailing ? drops the query string in the redirected URL, and back references use the % in place of the $.) The 301 response code is "moved permanently", so well-behaved clients should get the idea. The same technique applies to the query string-defined syndication protocols that WordPress uses for the RSS and Atom feeds.

The next bit of work in Ruby is a bit painful because of the way that the database migration script maps the IDs. (If I was in the mood, I could have modified the migration script to dump an id cross-references, but I wasn't and didn't.) The first piece is a new route in config/routes.rb:

map.connect 'wp/:wpid',
  :controller => 'articles', :action => 'wp'

And then a bit of Ruby in app/controllers/articles_controller.rb:

def wp
  begin
    wpid = params[:wpid].to_i
    case wpid
      when (109..109) then wpid = 97
      when (100..101) then wpid -= 15
      # etc.
    end
    # imitate the "read" method here...
  end

And that should do it. (So far, the most difficult part of Ruby is not typing a ; at the end of a line...)

(comment bubbles) 0 comments