More Haskell and Personal Publishing Platform Ramblings

Paul Brown @ 2006-10-18T07:00:00Z

Herein, part two of n, where I think about some basic design decisions and some actual code gets written.

From GET to Function Currying

First up, some thoughts about how to map a request to a response.

A weblog is a list of entries grouped and displayed in different formats according to parameters supplied by client applications, so serving pages is limited to figuring out what format to display (single-entry HTML, multi-entry HTML, Atom) and which entries to include (include/exclude by tag, include/exclude by category, date range, maximum number of entries). Ideally, I'd like a reader to be able to, e.g., exclude only the posts I make about my kid or include only posts about BPEL and Java.

The simple design is to turn the combination of a URL and parameters into a list of functions that are applied to the list of all entries, loosely:

(format url) (filter (filters url) entries)

And that's the whole program, up to details. In mildly abused Haskell notation:

format :: URL -> [Entry] -> String
filters :: URL -> [Entry -> Boolean]

In Haskell, the notation

f :: A -> B

is very much as it is in mathematics with f being a function with domain (things of type) A and codomain (things of type) B. Unraveling expressions with multiple arrows is accomplished by adding parentheses from the right. That is,

f :: A -> B -> C  =  f :: A -> (B -> C)

is essentially a function from A×B to C as in "f is a function that maps (things of type) A to a function that maps (things of type) B to (things of type) C". So, the function format above maps a URL and a list of Entry to a String. This is one of the things that I like about Haskell — a function looks like a function.

Haskell supports currying of functions, which is only appropriate, as the language was named for Haskell Curry. The filters can be implemented by currying arguments onto functions that take two arguments. For example, consider the functions:

include_by_tag :: String -> Entry -> Boolean
include_by_tag s e = s `elem` (tags e)

exclude_by_tag :: String -> Entry -> Boolean
exclude_by_tag s e = s `notElem` (tags e)

(tags is a function that returns the list of tags applied to an entry.) In addition to showing off that Haskell lets you flip back and forth between prefix notation (foo 1 2) and infix notation (1 `foo` 2), the idea would be to map a query string atom like tag=foo to the function (include_by_tag "foo") or a query string atom like tag=-foo to the function (exclude_by_tag "foo").

Playing with the concept in ghci is straightforward. For example, drop the following two lines of Haskell into a text file called curryplus.hs:

plus :: Integer -> Integer -> Integer
plus x y = x + y

And then fire-up ghci:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :load curryplus.hs
[1 of 1] Compiling Main             ( curryplus.hs, interpreted )
Ok, modules loaded: Main.
*Main> :t plus
plus :: Integer -> Integer -> Integer
*Main> :t (plus 1)
(plus 1) :: Integer -> Integer
*Main> let f = (plus 1)
*Main> f 2
3

The :t command in ghci interrogates the type of the expression passed to it.

Now, where does the list of entries come from?

Storing Entries — Database or Filesystem or...?

Databases are lovely places to store data where reads and writes may overlap, and the filesystem is a good place to store information that either doesn't fit or isn't needed in memory. For a weblog, read/write contention should be light (frequent reads, infrequent writes), with writes limited to posts and comments, and optimistic concurrency is entirely acceptable. (It's of no consequence if someone gets slightly stale content.) However, the total amount of content in my weblog, counting from 2002, is in the hundreds of kilobytes, so there is no reason not to hold the whole thing in memory.

Haskell (and specifically GHC) has a couple of shiny objects that I'm tempted by. The shiniest one is STM or Software Transactional Memory, and like the three-line quicksort implementation is one of the teasers for Haskell, the four-line AtomicInteger.getAndIncrement() implementation is the teaser for STM:

get_and_increment :: TVar Integer -> IO Integer
get_and_increment i = atomically ( do j <- readTVar i
                                      writeTVar i (j+1)
                                      return j )

The whitespace in the above definition is critically important, as it tells Haskell that the lines are all part of the do. I'll come back to the left arrow ("<-") and do notation below.

Experimenting with this in ghci isn't much more complicated than the plus example above. Put the four lines above in a text file called get_and_inc.hs following the import statement:

import Control.Concurrent.STM

And fire up ghci with an extra directive to get it to load the STM package:

$ ghci -package stm
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Loading package stm-2.0 ... linking ... done.
Prelude> :load get_and_inc.hs
[1 of 1] Compiling Main             ( get_and_inc.hs, interpreted )
Ok, modules loaded: Main.
*Main> x <- atomically ( newTVar 1 )
*Main> :t x
x :: TVar Integer
*Main> get_and_increment x
1
*Main> get_and_increment x
2
*Main> get_and_increment x
3

The original paper on STM is from Microsoft Research back in 2005. The follow-on paper Lock Free Data Structures using STM in Haskell is a good read, wherein the authors construct two implementations of ArrayBlockingQueue in Haskell, one using locks and one using STM, and then compare their performance. (Spoiler: The larger the number of processors, the better STM performs.)

As promised, a quick word about <- and do. Haskell is a purely functional language, and that means that it is side-effect-free. On the one hand it's great to have referential transparency, since it lets the compiler or runtime VM do things like replace common pieces of a complicated expression with values, but on the other hand, something imperatively trivial like console output is a side-effect. The elegant workaround is to allow a function to return an action as a value, where the action is to be performed by an external observer at some point. The construct is called a monad, and <- and do are some of the notation that comes with it. (Defining the nature of the action and the observer is the art of constructing a monad.) The "do" chains up actions, and the "<-" captures a value created by an action. Yet another good paper from Microsoft Research has a thorough tutorial on monads in Haskell.

This brings me to the design for storing and accessing the entries:

  • Use a single multi-threaded FastCGI handler to serve all requests.
  • Maintain an in-memory copy of all content (entries and approved comments), with access and updates managed via STM.
  • Store entry and approved comment content as separate files on disk, to be loaded at start-up time and written when an entry is posted or comment approved.

Data structures and a lo-fi approach to outputting Atom are up next.

(comment bubbles) 6 comments

GHC 6.6 and Mac OS X Readline Quick Fix

Paul Brown @ 2006-10-17T06:59:00Z

If you build vanilla GHC 6.6 from the source tarball on Mac OS X (non-Intel), it will build straight through with no issues, but the ghci commandline interface will drive you crazy because it lacks support for editing. This is due to a damaged readline that ships with Mac OS X, but here's a quick fix. The first step is installing an up-to-date version of GNU readline into /usr/local with the usual recipe:

cd ~/work
mkdir gnu-readline
cd !$
wget ftp://ftp.cwru.edu/pub/bash/readline-5.2.tar.gz
tar xzvf readline-5.2.tat.gz
cd readline-5.2
./configure
make && sudo make install

(I keep a copy of the source for any modules that I build from scratch.) The second step is courtesy of Andy Gill:

cd ~/work
mkdir ghc
cd !$
wget http://www.haskell.org/ghc/dist/6.6/ghc-6.6-src.tar.bz2
wget http://www.haskell.org/ghc/dist/6.6/ghc-6.6-src-extralibs.tar.bz2
tar xjvf ghc-6.6-src.tar.bz2
tar xjvf ghc-6.6-src-extralibs.tar.bz2
cd ghc-6.6
./configure --with-readline-includes=/usr/local/include \
            --with-readline-libraries=/usr/local/lib
make -j && sudo make install

This is one of those little joys of configure: flags are passed down to the libraries from the master build, but available options are not bubbled up to ./configure --help. You'd have no idea about the availability of the readline-related flags unless you looked at the library.

Of course, you can also use Fink or DarwinPorts and pass the appropriate directories to configure instead.

Building from source takes a long time, no matter what kind of machine you've got. If you've got a multi-processor or multi-core machine, make -j (as many concurrent jobs as make feels like running) will help cut the time down. It takes under an hour on my four-way G5 with -j and several hours without. (Don't bother with -j without also cranking up the number of processes, e.g., ulimit -u 512, which beats the anemic default of 100 on Mac OS.)

(comment bubbles) 2 comments

First Steps with Haskell for Web Applications

Paul Brown @ 2006-10-11T06:19:00Z

As I blogged yesterday, I'm planning to build a simplified personal publishing system to host this blog, partially to get around resource consumption issues with the current platform and partially to get some exercise with a new language or two. I thought about Smalltalk, Erlang, and Io, but Haskell gets the initial nod if for no other reason than it's a third side of the coin that Ruby and Java are two sides of — rigorously defined, "purely" functional, lazy, "typeful" and compiles to native code via GHC. (And, of course, the syntax warms the cockles of my mathematician's heart.) Like Ruby with gems, the GHC runtime also has excellent modularity, with a minimal and standard core and good package management via Cabal. (Hello? Java?)

The first question is how to integrate an application written in Haskell into a web container, preferably a web server like lightTPD or Apache via FastCGI. (CGI would be a consideration, too, but that's just too retro for me.) Thankfully, as of the forthcoming 6.6 version, GHC has good CGI support via the Network.CGI module, and Björn Bringert has a FastCGI binding that built on the GHC 6.5 tip with only a little tinkering. (I wanted to use the core Network.CGI module in place of Björn's cgi-compat module.)

A "Hello, World" implementation using the FastCGI binding and then compiled to native code performed well on a basic smoke benchmark. Here's the relevant line from top for an instance of the handler:

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
[...]
  234 hello.fcgi   0.0%  0:06.83   1    13    21   692K  1.63M  1.69M  29.0M
[...]

Benchmarking with ab shows that 5 handlers can happily crank through around 4000 requests/second with 99% of the requests requiring <2ms.

For comparison purposes and with an identical FastCGI configuration, the simplest possible Ruby on Rails "Hello, World" implementation (create test controller, edit the .rhtml to return content, wire-up FastCGI) consumes considerably more memory:

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
[...]
  537 ruby1.8     12.1%  0:26.49   1    14    94  22.5M  3.35M  24.5M  54.5M
[...]

and only manages around 100 requests/second with ~50ms response time for the 50th percentile and ~400ms at the 99th percentile. (I recognize that I should probably put a sic. after the "only", since 100 requests/second is significantly in excess of the peak throughput that my blog sees on a good day.)

This is far from apples-to-apples, as the RoR version is doing a lot more work under the covers, but it does give me the expectation that I can probably get a Haskell blog implementation that will have a memory footprint smaller than a base irb and provide Slashdottable performance.

Next up, deciding on how to store/represent an entry and how to implement Atom for syndication.

(comment bubbles) 2 comments

Typo + TextDrive != Happy

Paul Brown @ 2006-10-10T06:18:58Z

The logs say that mult.ifario.us throws a fair number of HTTP 500 response codes back at visitors, and that's sad. It is certainly not the impression I want to make on visitors and readers (although subscribers are insulated from failures by FeedBurner's excellent service). In a perfect world, something as simple as a weblog wouldn't throw any 500s, ever. The problems come from running Typo on TextDrive. There isn't anything intrinsically wrong with the Typo engine, with Ruby, or even with TextDrive, as a similar setup runs like a top in my test environment, but TextDrive's resource limits make Typo's design impractical.

This got me thinking about the design of the simplest possible weblog publishing software, a design that would eschew the use of a database and all runtime configuration in favor of a system that is ultra-lightweight and quick to "boot". Almost all of the content in the blog is relatively static — display of an entry, feeds, archives, various paginations and groupings only require lightweight decoration of the XHTML for a given entry. Paginations and groups, e.g., by multiple tags or by tags plus date, require some dynamic behavior on the server, but not that much. A complexity-ectomy doesn't have to come at the expense of chrome and eye candy, as modern browsers make it possible to inject dynamic content (images from Flickr, links from del.icio.us, free-associations from Google AdSense, etc.) into the browser directly in the form of JavaScript.

The one difficult bit (and the only thing that would require a POST) would be comments. Comments don't need a database or use of dynamic content, either, and using email for comment workflow would solve multiple problems. Here's a sketch:

  • Comment is made on the weblog by submitting a form.
  • Server-side executable wraps the comment as an email and sends it to the blog's author.
  • Normal email filtering machinery is applied to the comment, i.e., spam filtering, and the blog content author either chooses to reply to the message, in which case the comment is added to the relevant entry (e.g., via a procmail recipe), or simply ignores it.

Akismet is apparently effective (if, at the same time, a statement about the sad state of the signal-to-noise ratio of the present-day internet), but it makes sense to leverage the filtering technology and massive corpus (~107 messages) of SPAM and ham that I already use for email.

I've experimented with different publishing platforms (Radio Userland, SnipSnap, MT, WordPress, Typo), and they all fell short for me in one way or another.

As the saying goes, if you want something done right... I'm going to embark on a project to replace Typo with something simple, dense/terse, and home-grown. It's also a chance to experiment with a new language or two, so it should be both fun and educational. Java's out due to footprint, but my mind is open otherwise — SmallTalk, Haskell, Lisp, Io, ...?

(comment bubbles) 1 comment

The AdSense Magic 8-Ball

Paul Brown @ 2006-09-28T19:18:00Z

In terms of an hourly wage paid for blogging, AdSense pays me something less than slave wages, but it's fun to shake up the Google magic 8-ball and see what it thinks is relevant to a given page. Sometimes it gets things right (ads for BizTalk and cello for the orchestration tag), and sometimes it gets things wrong (ads for VB for the Erlang tag).

So, let's see... <shake shake shake>...

(comment bubbles) 0 comments

Misunderstanding the BPEL Thing

Paul Brown @ 2006-09-27T06:48:00Z

Over on InfoWorld, David Linthicum has some misdirected and FUD-filled comments (original and follow-up) on the impending release of the WS-BPEL 2.0 standard by OASIS. Bruce Silver, Steve Hoffman, and Fred Holahan are quicker on the draw in responding on the topic, but I'll come out of semi-retirement as a BPEL wonk to comment. (I could never quite get off my kiester to go after David Chappelle when he blogs about BPEL, partially because he has the clever defensive tactic of a broken blog that makes it impossible to link directly to entries...)

First off, if there are any genuine issues with the consistency of the specification, behavioral, semantic, or otherwise, then those should get presented to the OASIS BPEL TC promptly for consideration. I doubt it. The TC discussed literally thousands of issues — at length and sometimes ad nauseum. Of the 308 of those that were worthy of issue status, all have been resolved with the input of several dozen smart folks (and a few dumb ones).

As far as BPEL4WS 1.1 goes, finding a reasonable profile for the language isn't that onerous. (In fact, it's easier to figure out a behavior profile than to decipher the combined licensing terms from the specification's "owners".) Make a choice about variable declarations and fault-handlers, think hard about event-handlers, etc., and you can work it out. Someone else might make different choices, but c'est la vie. Row-level locking works differently in every SQL database, and no one's crying about it. The PXE engine offered side-by-side execution of a working profile for BPEL4WS 1.1 and a spec-tracking version of WS-BPEL 2.0 starting back in 2003 (and was the first engine to do so), with the ability to apply management, deployment, and debugging functionality equally to both 1.1 and 2.0 processes within the same runtime container. (PXE is now part of the ODE project under incubation at Apache and part of Intalio's open-source BPMS.) This was purposeful, up-front future-proof design; there was no planned or inadvertent obsolescence, no subterfuge. Side-by-side support should be the baseline customer expectation.

Before banging on a drum about portability, what would portability mean for an orchestrated process? No one should expect that vendor-specific goodies are portable. I can't run PL/SQL in DB2, and I'm cool with that. Does portability mean demonstrable behavior equivalence, up to non-determinism? A reasonable definition would be that a process written in vanilla WS-BPEL 2.0 should run in any engine that claims to be compliant, up to fussing with deployment descriptors. (Neither the 1.1 nor 2.0 versions of the specification address the way that BPEL ports are bound to concrete ports, so this is different across the various vendors.) For what it's worth, we had reasonable luck taking BPEL4WS 1.1 processes created with other tools and running them, with a modicum of fiddling.

I said it in 2004, Assaf said it recently in a long guest post over on IT|Redux, and it's still true: BPEL is an orchestration execution language, in a role analogous to assembly language. To quote myself:

Assembly Language BPEL
registers communication channels
bytes/bits messages/parts
memory variables
operations on bytes/bits/memory operations on messages/parts/variables
CPU BPEL engine

And that's all you should expect it to be. Diagramming tools? Not in there — I'd recommend taking a look at BPMN. Workflow add-ons? Not in there — check out BPEL4People or just (gasp) model your workflow components as services.

(comment bubbles) 0 comments

Toddler Idiosyncrasies

Paul Brown @ 2006-09-20T05:55:00Z

The kid is really starting to develop some personality, and her growing capabilities make it more fun to be a father. (Don't get me wrong, babies are wonderful blah blah, but toddlers are even better.) She can take multi-step commands, and she understands that Mom and Dad can help with things like hurting teeth or itches. She taught herself to do a somersault, spontaneously stacks objects, colors randomly on place mats at restaurants (whether or not they're intended for that purpose), and skipped right past shape-sorting toys to putting pry-out pieces back into a multi-page book.

So far, so good. The weird thing is that she's also developing quirks. I was giving her a bath tonight and turned my back for a split second to put something away, and she let out a blood curdling scream. She was standing in the tub with a serious expression on her face, arm stuck straight out, pointing to a small fleck of dirt floating in the tub. I fished it out, and she went back to pouring water from one cup into another like nothing had happened. It's very difficult to be a neat freak as a baby, where the difference between food, finger paint, and hair gel is situational, but she's showing a definite trend now. This is just one instance; for example, she behaved similarly after dropping a piece of food on the mat that sits under her highchair. A month ago, the mat was there to protect the floor, but today, she feels that it's important to keep it clean.

Who knows whether the kid will carry this into adulthood or even into next week, but the nascent idiosyncrasies are one of the things that makes her more fun.

(comment bubbles) 1 comment

All Posts contains 397 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57