More Haskell and Personal Publishing Platform Ramblings

Paul Brown @ 2006-10-18T07:00:00Z

Herein, part two of n, where I think about some basic design decisions and some actual code gets written.

From GET to Function Currying

First up, some thoughts about how to map a request to a response.

A weblog is a list of entries grouped and displayed in different formats according to parameters supplied by client applications, so serving pages is limited to figuring out what format to display (single-entry HTML, multi-entry HTML, Atom) and which entries to include (include/exclude by tag, include/exclude by category, date range, maximum number of entries). Ideally, I'd like a reader to be able to, e.g., exclude only the posts I make about my kid or include only posts about BPEL and Java.

The simple design is to turn the combination of a URL and parameters into a list of functions that are applied to the list of all entries, loosely:

(format url) (filter (filters url) entries)

And that's the whole program, up to details. In mildly abused Haskell notation:

format :: URL -> [Entry] -> String
filters :: URL -> [Entry -> Boolean]

In Haskell, the notation

f :: A -> B

is very much as it is in mathematics with f being a function with domain (things of type) A and codomain (things of type) B. Unraveling expressions with multiple arrows is accomplished by adding parentheses from the right. That is,

f :: A -> B -> C  =  f :: A -> (B -> C)

is essentially a function from A×B to C as in "f is a function that maps (things of type) A to a function that maps (things of type) B to (things of type) C". So, the function format above maps a URL and a list of Entry to a String. This is one of the things that I like about Haskell — a function looks like a function.

Haskell supports currying of functions, which is only appropriate, as the language was named for Haskell Curry. The filters can be implemented by currying arguments onto functions that take two arguments. For example, consider the functions:

include_by_tag :: String -> Entry -> Boolean
include_by_tag s e = s `elem` (tags e)

exclude_by_tag :: String -> Entry -> Boolean
exclude_by_tag s e = s `notElem` (tags e)

(tags is a function that returns the list of tags applied to an entry.) In addition to showing off that Haskell lets you flip back and forth between prefix notation (foo 1 2) and infix notation (1 `foo` 2), the idea would be to map a query string atom like tag=foo to the function (include_by_tag "foo") or a query string atom like tag=-foo to the function (exclude_by_tag "foo").

Playing with the concept in ghci is straightforward. For example, drop the following two lines of Haskell into a text file called curryplus.hs:

plus :: Integer -> Integer -> Integer
plus x y = x + y

And then fire-up ghci:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :load curryplus.hs
[1 of 1] Compiling Main             ( curryplus.hs, interpreted )
Ok, modules loaded: Main.
*Main> :t plus
plus :: Integer -> Integer -> Integer
*Main> :t (plus 1)
(plus 1) :: Integer -> Integer
*Main> let f = (plus 1)
*Main> f 2
3

The :t command in ghci interrogates the type of the expression passed to it.

Now, where does the list of entries come from?

Storing Entries — Database or Filesystem or...?

Databases are lovely places to store data where reads and writes may overlap, and the filesystem is a good place to store information that either doesn't fit or isn't needed in memory. For a weblog, read/write contention should be light (frequent reads, infrequent writes), with writes limited to posts and comments, and optimistic concurrency is entirely acceptable. (It's of no consequence if someone gets slightly stale content.) However, the total amount of content in my weblog, counting from 2002, is in the hundreds of kilobytes, so there is no reason not to hold the whole thing in memory.

Haskell (and specifically GHC) has a couple of shiny objects that I'm tempted by. The shiniest one is STM or Software Transactional Memory, and like the three-line quicksort implementation is one of the teasers for Haskell, the four-line AtomicInteger.getAndIncrement() implementation is the teaser for STM:

get_and_increment :: TVar Integer -> IO Integer
get_and_increment i = atomically ( do j <- readTVar i
                                      writeTVar i (j+1)
                                      return j )

The whitespace in the above definition is critically important, as it tells Haskell that the lines are all part of the do. I'll come back to the left arrow ("<-") and do notation below.

Experimenting with this in ghci isn't much more complicated than the plus example above. Put the four lines above in a text file called get_and_inc.hs following the import statement:

import Control.Concurrent.STM

And fire up ghci with an extra directive to get it to load the STM package:

$ ghci -package stm
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Loading package stm-2.0 ... linking ... done.
Prelude> :load get_and_inc.hs
[1 of 1] Compiling Main             ( get_and_inc.hs, interpreted )
Ok, modules loaded: Main.
*Main> x <- atomically ( newTVar 1 )
*Main> :t x
x :: TVar Integer
*Main> get_and_increment x
1
*Main> get_and_increment x
2
*Main> get_and_increment x
3

The original paper on STM is from Microsoft Research back in 2005. The follow-on paper Lock Free Data Structures using STM in Haskell is a good read, wherein the authors construct two implementations of ArrayBlockingQueue in Haskell, one using locks and one using STM, and then compare their performance. (Spoiler: The larger the number of processors, the better STM performs.)

As promised, a quick word about <- and do. Haskell is a purely functional language, and that means that it is side-effect-free. On the one hand it's great to have referential transparency, since it lets the compiler or runtime VM do things like replace common pieces of a complicated expression with values, but on the other hand, something imperatively trivial like console output is a side-effect. The elegant workaround is to allow a function to return an action as a value, where the action is to be performed by an external observer at some point. The construct is called a monad, and <- and do are some of the notation that comes with it. (Defining the nature of the action and the observer is the art of constructing a monad.) The "do" chains up actions, and the "<-" captures a value created by an action. Yet another good paper from Microsoft Research has a thorough tutorial on monads in Haskell.

This brings me to the design for storing and accessing the entries:

  • Use a single multi-threaded FastCGI handler to serve all requests.
  • Maintain an in-memory copy of all content (entries and approved comments), with access and updates managed via STM.
  • Store entry and approved comment content as separate files on disk, to be loaded at start-up time and written when an entry is posted or comment approved.

Data structures and a lo-fi approach to outputting Atom are up next.

Meta

Tags: (tag) (tag) (tag) (tag) (tag) (tag) (tag)

(comment bubbles) 6 comments
1569 direct views

Comment from Pete @ 2006-10-18T11:25:27Z # permalink

I'm looking for to your blog implemenation as I've been meaning to write one in Haskell for learning purposes as well. And your design goals are very similar to mine.

Comment from Sean Leather @ 2006-10-19T01:49:14Z # permalink

Several comments/questions:

  • You talk about both maintaining an in-memory copy of all content as well as storing files to disk. I don't get exactly what you mean. Are you keeping a data structure with a copy of the text stored in files? If so, why? If you simply open a file, the parts needed get loaded into memory. Your OS's file buffer takes care of keeping in memory what's accessed often.
  • Why do you need STM? As you said, reads are frequent and don't need to block other reads. Writes only ever happen the first time a post or comment occurs. If you were going to update an entry, you could effectively edit the text in a file and save it. That is atomic enough, I think.
  • I like the point on composing functions for generating different types of output from the same set of inputs. That's a nice image of functional programming.

Comment from Paul Brown @ 2006-10-19T03:41:00Z # permalink

@Sean:
I mean a solely in-memory copy would be used to serve content, so there would be no overhead of parsing files each time. (My planned structure for an entry includes tags, URLs to ping, etc.) The filesystem would provide durable storage.

(FWIW, on the subject of filesystem-only, I used blosxom a while back, and it's the ultimate in simple use of the filesystem for posting entries. There were definitely some things to like about it, although the plugin "architecture" gave me the willies every time I looked at it...)

Since everything would be in memory, STM is intended to provide a substitute for good filesystem behavior if I were just editing files. I would hate to have the inconsistent view of the site be the one that Google indexes or an aggregator posts.

I'm still on the fence as to whether I'd want to place files on the machine and hit an "update" URL or whether I'd want to POST content. The relative convenience of WebDAV may lead me down the filesystem path. I'll probably end up seeing what works.

Comment from Einar Karttunen @ 2006-10-19T09:10:10Z # permalink

Using STM is a very nice approach and exactly what HAppS (http://HAppS.org) does underneath. Would be fun to compare our approaches to various design alternatives.

Comment from Sean Leather @ 2006-10-19T14:26:53Z # permalink

Thanks for the reply, Paul.

As for serving content, I can see where you're coming from. The format of files and data in general is one question I've debated with myself; however, I've came at it from a different angle. Since most things are write-once/read-many, I've been thinking about having the processing at the write time instead of the read time. That way, when you update or POST content, you process the data to the desired format (say, HTML or some subset of it). If it's stored in a file, you just include that file whenever the appropriate URL is requested, thus no parsing is needed.

Of course, a problem comes in when you start doing something else that doesn't want straight HTML, e.g. Atom. For that, I believe you have to parse and escape all XML-related characters. So, my solution is not at all ideal.

Comment from Paul Brown @ 2006-10-19T17:34:24Z # permalink

@Einar:
I like the HAppS approach and basic concepts, although citing Prevayler as an inspiration made me a little suspicious. (The project does not have the best reputation in the Java world, but it does embody the class "batch with restart" approach to transaction processing.) I'm still wading through the HAppS implementation to figure out how it works.

There is also the Hope CMS from Björn Bringert (the same guy who has been working on the CGI module), although it takes a more traditional three-tier approach.