HTTP 100 Tidbit

Paul Brown @ 2007-11-16T18:35:46Z

Before Dan's presentation, I hadn't heard of 100 (Continue), but then I ran into it today. A client had an issue with a .NET client talking to a Jetty instance, and the conversation went something like this:

POST /uri/foo HTTP/1.1
[...]
Expect: 100-continue
Connection: Close

100 Continue
[Jetty closes the connection.]

It turns out that this is an issue with the way that Jetty handles the Connection: Close header. Even though it seems reasonable to drop the connection according to 14.10 (if you think of the 100 Continue as a response), dropping the connection violates 8.2.3:

Upon receiving a request which includes an Expect request-header field with the "100-continue" expectation, an origin server MUST either respond with 100 (Continue) status and continue to read from the input stream, or respond with a final status code.

Experiences like this are the reason that I always smile when someone tells me that they've built their own HTTP client or server implementation; it's not that simple.

(comment bubbles) 0 comments

Tyranny of the Average

Paul Brown @ 2007-11-04T01:00:20Z

Steve Vinoski, building on comments that Steve Jones made about comments from Stefan Tilkov, says:

On the dynamic language front, the worst code I have seen in my career has always, always, always been in compiled imperative languages, most often Java and C++.

From my perspective, I've had the privilege of seeing some really awful code and design. The language has always been C/C++ or Java, since that's what virtually all large systems are built in and where there are tools to help people get things written and built. The root cause for the bad code and design has always been relative ignorance on the part of the developers. The use of Ruby or Python wouldn't have changed things for the better, while the use of Haskell or Erlang might, purely because I doubt that the developers in question would have been able to get the Haskell or Erlang programs to run...

It shouldn't surprise anyone who took statistics in college that once enough people start using a language, crappy software appears, independent of the language. With a few liberties taken with respect to hypotheses, it's a consequence of the central limit theorem, and this tyranny of the merely average is absolute and inescapable. It doesn't mean software is any more or less doomed than the rest of our society; rather, it is exactly as doomed as software's reach broadens.

(comment bubbles) 5 comments

Tuppence Tour of Haskell Concurrency Constructs

Paul Brown @ 2007-10-21T05:04:00Z

One of the little widgets that I need for an experiment is a sequence number generator, and it's a good way to get a tuppence (i.e., less than half a nickel) tour of Haskell's concurrency constructs with a little lesson on laziness thrown in for spice.

Requirements

The generator should produce unique Int values on demand and support concurrent access, and I want to try out a couple of methods, one that uses GHC's baseline concurrency capabilities and one that uses STM. (Also, I'd like a better feel for the concepts in practice, so this makes a good exercise!)

Take One: Asynchronous Channels

The base GHC concurrency packages (Control.Concurrent and its descendents) include a great set of buildings blocks: one-place buffers (MVar), asynchronous channels (Chan) that can be combined into one-to-many broadcast channels, and quantity semaphores (QSem and QSemN).

The design I have in mind uses two asynchronous channels, one for requests and one for responses. All clients of the generator receive responses on the one channel, which means that one might get the number that another one requested, but that's no big deal.

First, one data structure for the requests and one for the client view of the generator:

import Control.Concurrent ( ThreadId, forkIO )
import Control.Concurrent.Chan ( Chan, newChan, writeChan, readChan )

data Request = Get
             | Set { initial_value :: Int }
             | Die

data Generator = Generator { thread_id :: ThreadId,
                             request_channel :: Chan Request, 
                             response_channel :: Chan Int }

Then some functions (whose signatures I'll match with the STM version below) to manipulate the generator:

reset :: Generator -> Int -> IO ()
reset g i = writeChan (request_channel g) (Set i)

get_next :: Generator -> IO Int
get_next g = (writeChan (request_channel g) (Get)) >> 
             readChan (response_channel g)

stop :: Generator -> IO ()
stop g = writeChan (request_channel g) Die

Each client function is implemented in terms of sending the request data structure to the generator on the request_channel and then, in the case of the Get operation, blocking to read a value from the response_channel.

Finally, the request-handling event loop in a separate lightweight thread:

new_counter :: IO Generator
new_counter = do { in_chan <- newChan
                 ; out_chan <- newChan
                 ; tid <- forkIO $ loop in_chan out_chan 0
                 ; return $ Generator tid in_chan out_chan }

loop :: Chan Request -> Chan Int -> Int -> IO ()
loop ic oc i = do { req <- readChan ic
                  ; case req of 
                      Die -> return ()
                      (Set n) ->
                          (loop ic oc n)
                      Get ->
                          do { writeChan oc i
                             ; loop ic oc $! (i+1) } }

A similar pattern (request channel, response channel or one-place buffer (MVar) either pre-set or passed with the request, tail-recursive event loop) works for a wide variety of problems and presents a reasonable API for clients.

The single-threadedness of the loop makes it intuitively easy to conclude that it does the right thing (returns unique values to clients), but it's easy enough to check with some experiments in ghci:

> :m + Control.Concurrent
> g <- new_counter
> get_next g
0
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 111
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 112
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 113
[... wait a while ...]
> get_next g
300000001

Which is the right answer. Another experiment will show how fair the scheduler is in terms of multiple client threads:

> g <- new_counter
> :set -fno-print-bind-result -fglasgow-exts

By the way, [TAB]-completion in ghci means that the above can be obtained by typing:

:set -fno-pr[TAB] -fgl[TAB]

The -fno-print-bind-result prevents ghci from spoiling our attempts to be lazy by printing (and thus evaluating), and the -fglasgow-exts lets us use a type specification to specify what kind of channel we're creating. (Normally, the compiler would figure it out from context, but that won't work in ghci.)

> tid::(Chan (Int,Int)) <- newChan
> mapM_ (\n -> (forkIO $ sequence_ $ replicate 1000 (get_next g >>= (writeChan tid).((,) n)))) [1..10]

The last expression looks dense, but it breaks down simply. In plain English, fork 10 threads numbered 1 through 10, and on each thread, do the following 1000 times: get a sequence number from the generator and then send the ordered pair of the threads number and the sequence number on the tid channel. (The (,) function makes ordered pairs out of its arguments.) To inspect the contents of the channel once the threads are done:

> x <- getChanContents tid

(Without the -fno-print-bind-result above, this would run forever.) And now a couple of quick checks:

> :m + List
> let y = take 10000 x
> length $ nub $ map snd y
10000
> [ length $ filter (((==) j).fst) y | j <- [1..10] ]
[1000,1000,1000,1000,1000,1000,1000,1000,1000,1000]
> let w = [ length $ nub (take 10 (drop k (map fst y))) | k <- [0..9990] ]
> [ length $ filter ((==) j) w | j <- [1..10] ]
[0,0,0,0,0,0,1,1,522,9467]

So, most of the time, in 10 turns, 10 client threads get a chance.

A Quick Word About Laziness

The $! is a piece of Haskell magic that ensures that the value on the right is evaluated. Without it, this happens:

> g <- new_counter
> get_next g
0
> sequence_ $ replicate 100000000 (get_next g)
> get_next g
*** Exception: stack overflow

The reason lies in Haskell's laziness. At the time that the second get_next is evaluated, there are 100,000,000 (i+1) queued-up and waiting to be evaluated because no one ever actually consumed the values passed back on the response channel. This is just the way that Haskell works: You can tell the runtime about a ridiculous computation, but it won't complain until you ask for the result. The $! ensures that the (i+1) value is evaluated each time Get is performed.

Take Two: TVar

Haskell's STM (software transactional memory; read/watch Simon Peyton Jones on the subject) implementation provides another set of building blocks in the form of atomically mutable cells (TVar), asynchronous channels (TChan), one-place buffers (TMVar), and arrays (TArray).

The sequence generator implementation with the same API but using a TVar to hold the current value is shorter and simpler (no backing thread) than the one above that uses channels:

import Control.Concurrent.STM
import Control.Concurrent.STM.TVar
import Control.Monad

data Generator = Generator { counter :: TVar Int }

new_counter :: Int -> IO Generator
new_counter i = (atomically $ newTVar i) >>= (return . Generator)

get_next :: Generator -> IO Int
get_next g = atomically $ do { n <- readTVar $ counter g
                             ; (writeTVar (counter g)) $! (n+1)
                             ; return n }

reset :: Generator -> Int -> IO ()
reset g n = atomically $ writeTVar (counter g) n

Note the $! that appears in get_next; it is there for the same reason as it appears in the Chan version.

The same set of basic verifications as above ends with:

> [ length $ filter ((==) j) w | j <- [1..10] ]
[9901,90,0,0,0,0,0,0,0,0]

Or, the 10 threads took blocks of 1000 numbers from the sequence because the scheduler had no reason to switch. For a slightly different spawning of clients with a yield added, we get more regular results:

> mapM_ (\n -> (forkIO $ sequence_ $ replicate 1000 (get_next g >>= (writeChan tid).((,) n) >> yield))) [1..10]
[... same steps ...]
[0,0,0,0,0,0,0,0,0,9991]
> (map fst y) == (concat $ replicate 1000 [1..10])
True

Conclusions

The STM version is probably the one that I'll use, but I also need some more complicated components where the channel-based design will work well.

(comment bubbles) 2 comments

The Real Lesson in the Sneetches

Paul Brown @ 2007-10-10T00:25:17Z

The Sneetches is the kid's superfavorite book right now, and I've read it at least twice per bedtime (excepting the story about the two Zaxes and the bit about 23 Daves) every night for the past week. The obvious message in the book is about superficial differences and tolerance.

Call me cynical or jaded from doing a tour of duty as an entrepreneur, but I see another message in the book:

Stars, schmars. You want to be Sylvester McMonkey McBean.
(comment bubbles) 0 comments

λ○λcats

Paul Brown @ 2007-10-09T14:21:57Z

This (seen via the Haskell cafe mailing list) nearly caused me to shoot coffee out my nose:

fillBelly = foldr (++) [] fridgeContents

(comment bubbles) 0 comments

Wiring Haskell into a FastCGI Web Server

Paul Brown @ 2007-10-03T01:37:00Z

Herein part six in my hobby project to rewrite my personal publishing software in Haskell. In part five (and its addendum), I roughed-out a persistence and concurrency model for the back-end. The next two pieces are rendering content (which will be done programmatically using the Text.XHtml.Strict module; that's a separate post) and integrating with a web server via FastCGI. This post covers FastCGI integration for Lighttpd and Apache2 in the form of smoke-testing a simple FastCGI handler.

Units of Concurrency

For the concurrency model that I plan to use in the actual application, a single OS process is critically important, as multiple processes wouldn't be aware of who was doing what within the other processes. Multiple active threads within that one process are fine. Most web-based systems use a single process as a concurrency pinch point, but that process is usually the database as opposed to the web application.

Haskell in the form of GHC provides two flavors of concurrency, which I'll refer to as forkIO and forkOS (after the functions forkIO and forkOS, respectively). The forkIO flavor uses Haskell's internally managed, lightweight threads, and the forkOS flavor uses threads from the underlying operating system. (For some perspective on what an OS thread really means, take a look at my post on SMP Erlang on Mac OS.) The FastCGI binding library provides a mechanism to use forkIO, forkOS, or some other mechanism to assign a worker thread to a request, and I want to compare the two fork flavors for performance and stability.

It's worth reading the fine print in the Control.Concurrent documentation. For the present application, every thread does make foreign calls as part of handling a FastCGI request and every request is likely to complete in less than Haskell scheduler's default granularity of 20ms. I'm less interested in performance and more interested in looking for leaks, deadlocks, or other bad behavior.

Building the Right Network.FastCGI

Both the 1.0 and 3000.0.0 versions of the Network.FastCGI appear in the Hackage directory, but the darcs head version (3001.0.0 as of this posting) is the one with the multi-threaded bindings exposed. (For the uninitiated, darcs is a distributed source code management system implemented in Haskell. The darcs codebase is in literate Haskell, so it's an interesting read for that if nothing else.) Darcs is available from the usual package managers; I use MacPorts on the Mac.

Get the latest Network.FastCGI from the darcs repository:

darcs get http://darcs.haskell.org/fastcgi/

The fastcgi.cabal file documents the dependencies, but a GHC 6.6.1 install is sufficient. Then build and install it the usual Cabal way:

cd fastcgi
runghc Setup.hs configure
runghc Setup.hs build
sudo runghc Setup.hs install

One more step is needed to register the package with the compiler:

sudo runghc Setup.hs register

And then to make sure that it worked:

$ ghc-pkg list 
/opt/local/lib/ghc-6.6.1/package.conf:
    Cabal-1.1.6.2, GLUT-2.1.1, HGL-3.1.1, HUnit-1.1.1, OpenGL-2.2.1,
    QuickCheck-1.0.1, X11-1.2.1, base-2.1.1, cgi-3001.1.1,
    fastcgi-3000.0.0, fastcgi-3001.0.0, fgl-5.4.1, filepath-1.0,
    (ghc-6.6.1), haskell-src-1.0.1, haskell98-1.0, html-1.0.1,
    mtl-1.0.1, network-2.0.1, parsec-2.0, readline-1.0,
    regex-base-0.72, regex-compat-0.71, regex-posix-0.71, rts-1.0,
    stm-2.0, template-haskell-2.1, time-1.1.1, unix-2.1, xhtml-3000.0.2

It takes a bit more doing to get it going with the GHC tip (to-be 6.8) because the Data.ByteString modules have been promoted into core GHC packages and rearranged a bit, but no meaningful code changes beyond some of the import statements and the fastcgi.cabal file are required. (I've sent a patch to the maintainer.)

A Simple FastCGI Handler for Process/Thread Information

The following short Haskell program (test_IO.hs) sends back a plain text response with process and thread information:

import Control.Concurrent
import System.Posix.Process (getProcessID)

import Network.FastCGI

test :: CGI CGIResult
test = do setHeader "Content-type" "text/plain"
          pid <- liftIO getProcessID
          threadId <- liftIO myThreadId
          let tid = concat $ drop 1 $ words $ show threadId
          output $ unlines [ "Process ID: " ++ show pid,
                             "Thread ID:  " ++ tid]

main = runFastCGIConcurrent' forkIO 10 test

(This is an adaptation of the printinput.hs example that uses the multi-threaded API.) To build it:

$ ghc -threaded -package fastcgi --make -o test_IO.fcgi test_IO.hs
[1 of 1] Compiling Main             ( test_IO.hs, test_IO.o )
Linking test_IO.fcgi ...

The equivalent program (test_OS.hs) with forkOS in place of forkIO does the job for OS threads.

I can use these two FastCGI handlers with different possible compiler version, web server, and FastCGI module combinations and see how things do under some simulated loads. The only gotcha with this approach is that some HTTP benchmarking tools use response byte counts as an assertion of a correct response, and they will complain as the thread ID goes from one digit to two to three, etc. My current favorite is Jef Pozkanzer's simple http_load with a tiny tweak to show the response code if a byte count comes out off. Using a different tool, e.g., ab or httperf, produces similar results.

The Web Servers: Apache2 and Lighttpd

There are probably other alternatives that I'm overlooking, but I'm going to try the two web servers that I'm familiar with, Lighttpd 1.4.15 and Apache 2.2.4, both on Mac OS X.

Configuring Lighttpd

A Lighttpd configuration file fragment for a FastCGI handler with a single process would be:

fastcgi.server = ( ".fcgi" =>
                   ( "localhost" =>
                     (
                       "socket" => "/tmp/test.sock",
                       "bin-path" => "/path/to/test_OS.fcgi",
                       "min-procs" => 1,
                       "max-procs" => 1
                     )
                   )
                 )

See the Lighttpd FastCGI documentation for the full rundown on parameters.

Also, at least as of Lighttpd 1.4.15, which is the version that MacPorts installed for me, the following configuration change is necessary to avoid a bug:

server.event-handler = "poll"

(The default value is freebsd-kqueue; see the Lighttpd performance documentation.)

After copying the file into place, we can spin-up Lighttpd and hit the URL:

$ lighttpd -f lighttpd.conf
lighttpd -f lighttpd.conf 
$ curl http://localhost:8181/test.fcgi
Process ID: 21139
Thread ID:  4
$ curl http://localhost:8181/test.fcgi
Process ID: 21139
Thread ID:  5

The thread ID changes and the process ID doesn't, so things are good. For a bigger kick:

$ echo 'http://127.0.0.1:8181/test_OS.fcgi' > /tmp/lighttpd_OS
$ http_load -parallel 20 -fetches 1000 /tmp/lighttpd_OS 2>&1 | grep -v 8181
1000 fetches, 20 max parallel, 33908 bytes, in 0.375528 seconds
33.908 mean bytes/connection
2662.92 fetches/sec, 90294.2 bytes/sec
msecs/connect: 0.19518 mean, 1.036 max, 0.09 min
msecs/first-response: 7.26042 mean, 25.558 max, 4.31 min
996 bad byte counts
HTTP response codes:
  code 200 -- 1000

The 996 bad byte count errors are expected, since the responses for thread IDs 10 through 1005 have a different number of bytes than those for thread IDs 6,7,8, and 9. In any case, so far, so good:

$ curl http://localhost:8181/test_OS.fcgi
Process ID: 21139
Thread ID:  1006

Configuring Apache2 with mod_fastcgi

The single-process configuration file fragment for Apache2 with mod_fastcgi is:

LoadModule fastcgi_module modules/mod_fastcgi.so

FastCgiConfig -maxClassProcesses 1 -processSlack 1

<Location /fastcgi>
        SetHandler fastcgi-script
        Options ExecCGI
        allow from all
</Location>

This configuration passes the basic smoke test with no issues. Under load, the forkIO version burns about half the CPU and the same amount of memory as the forkOS version. Both versions use three OS threads most of the time, and as expected based on the comments above about the way that Haskell handles scheduling, the forkOS version never uses more than four OS threads no matter how hard the server is hit.

Configuring Apache2 with mod_fcgid

The configuration fragment for Apache2 with mod_fcgid is:

LoadModule fcgid_module modules/mod_fcgid.so

MaxProcessCount 1

<Location /fcgid>
  SetHandler fcgid-script
  Options ExecCGI
  allow from all
</Location>

With the same smoke testing approach as above (with a redirect to silence the byte count complaints):

$ echo 'http://127.0.0.1:7007/fcgid/test_OS.fcgi' > /tmp/fcgid_OS
$ curl http://127.0.0.1:7007/fcgid/test_OS.fcgi
Process ID: 16854
Thread ID:  4
$ http_load -parallel 20 -fetches 1000 /tmp/fcgid_OS 2>&1 | grep -v fcgid
1000 fetches, 20 max parallel, 34704 bytes, in 1.2484 seconds
34.704 mean bytes/connection
801.028 fetches/sec, 27798.9 bytes/sec
msecs/connect: 0.294162 mean, 2.339 max, 0.051 min
msecs/first-response: 12.8977 mean, 1009.92 max, 2.758 min
986 bad byte counts
HTTP response codes: 
  code 200 -- 998
  code 500 -- 2
$ curl http://127.0.0.1:7007/fcgid/test_OS.fcgi
Process ID: 16869
Thread ID:  7

Fail, since 16854 /= 16869, and based on the mod_fcgid's stated goals of keeping FastCGI handlers "fresh" by killing them at the first sign of an issue, not that surprising.

Aggregated Results and Additional Observations

The results in these tables were generated using http_load. For the "6000/min" test:

$ http_load -rate 100 -seconds 60 url_file 2>&1 | grep -v port

For the "60000/min" test:

$ http_load -rate 1000 -seconds 60 url_file 2>&1 | grep -v port

For the fixed rate tests, the number of nines is determined by the proportion of 200 responses out of the total number of responses (the others being 500 and 503). For the requests per second mark:

$ http_load -parallel 20 -seconds 10 url_file 2>&1 | grep -v port

First, with the current GHC version:

GHC 6.6.1, 4-core G5
ServerFastCGI SupportforkIOforkOS
Lighttpd 1.4.15built-inJust OK
6000/min - all good
60000/min - incomplete
max ~3000 req/sec
JUST OK
6000/min - all good
60000/min - incomplete
max ~2200 req/sec
Apache 2.2.4mod_fastcgiGOOD
6000/min - all good
60000/min - three 9's
max ~2700 req/sec
BEST
6000/min - all good
60000/min - four 9's
max ~2100 req/sec
Apache 2.2.4mod_fcgidFAIL
Process not stable.
FAIL
Process not stable.

None of these really cause the machine to break a sweat, with the web server doing most of the work and the FastCGI handler never consuming more than 60% of a core and a couple megabytes of resident memory. An overnight run showed the mod_fastcgi and forkOS combination to perform flawlessly under moderate load for over 108 requests.

With the latest GHC release candidate used to compile both the FastCGI package and the handlers:

GHC 6.9.20070918 (darcs tip), 4-core G5
ServerFastCGI SupportforkIOforkOS
Lighttpd 1.4.15built-inGOOD
6000/min - all good
60000/min - three 9's
~3300 req/sec
GOOD
6000/min - all good
60000/min - three 9's
~2200 req/sec
Apache 2.2.4mod_fastcgiJUST OK
6000/min - three 9's
60000/min - three 9's
~2500 req/sec
JUST OK
6000/min - three 9's
60000/min - four 9's
~1900 req/sec
Apache 2.2.4mod_fcgidFAIL
Process not stable.
FAIL
Process not stable.

Looks like GHC 6.6.1 and Apache2/mod_fastcgi is the winning combination.

Addendum

I got GHC 6.6.1 installed and configured the forkIO and forkOS handlers on the User Mode Linux server where I have this blog hosted, and it looks like forkIO is a winner there, with process stability and around 100 requests per second sustained throughput. With the forkOS variant, the process IDs do tick up with each hit, but that's a property of fork() on the kernel where one process corresponds to one thread rather than being a result of a restarted FastCGI handler.

(comment bubbles) 2 comments

bash Tidbit of the Day

Paul Brown @ 2007-09-22T22:19:00Z

Along with !! (last command) and !$ (last argument), I can't remember not knowing the ^ quick substitution feature in bash for replacing a string in the last entry:

$ echo foo
foo
$ ^foo^bar
echo bar
bar

But that doesn't handle multiple occurrences:

$ echo foo foo
foo foo
$ ^foo^bar
echo bar foo
bar foo

I read the fine manual looking for something else and found the !!:gs/old/new/ history expansion feature:

$ echo foo foo foo
foo foo foo 
$ !!:gs/foo/bar/
echo bar bar bar
bar bar bar
(comment bubbles) 1 comment

All Posts contains 397 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57