Herein, part two of n, where I think about some basic design decisions and some actual code gets written.
From GET to Function Currying
First up, some thoughts about how to map a request to a response.
A weblog is a list of entries grouped and displayed in different formats according to parameters supplied by client applications, so serving pages is limited to figuring out what format to display (single-entry HTML, multi-entry HTML, Atom) and which entries to include (include/exclude by tag, include/exclude by category, date range, maximum number of entries). Ideally, I'd like a reader to be able to, e.g., exclude only the posts I make about my kid or include only posts about BPEL and Java.
The simple design is to turn the combination of a URL and parameters into a list of functions that are applied to the list of all entries, loosely:
(format url) (filter (filters url) entries)
And that's the whole program, up to details. In mildly abused Haskell notation:
format :: URL -> [Entry] -> String filters :: URL -> [Entry -> Boolean]
In Haskell, the notation
f :: A -> B
is very much as it is in mathematics with f being a function with domain (things of type) A and codomain (things of type) B. Unraveling expressions with multiple arrows is accomplished by adding parentheses from the right. That is,
f :: A -> B -> C = f :: A -> (B -> C)
is essentially a function from A×B to C as in "f is a function that maps (things of type) A to a function that maps (things of type) B to (things of type) C". So, the function format above maps a URL and a list of Entry to a String. This is one of the things that I like about Haskell — a function looks like a function.
Haskell supports currying of functions, which is only appropriate, as the language was named for Haskell Curry. The filters can be implemented by currying arguments onto functions that take two arguments. For example, consider the functions:
include_by_tag :: String -> Entry -> Boolean include_by_tag s e = s `elem` (tags e) exclude_by_tag :: String -> Entry -> Boolean exclude_by_tag s e = s `notElem` (tags e)
(tags is a function that returns the list of tags applied to an entry.) In addition to showing off that Haskell lets you flip back and forth between prefix notation (foo 1 2) and infix notation (1 `foo` 2), the idea would be to map a query string atom like tag=foo to the function (include_by_tag "foo") or a query string atom like tag=-foo to the function (exclude_by_tag "foo").
Playing with the concept in ghci is straightforward.
For example, drop the following two lines of Haskell into a text file
called curryplus.hs:
plus :: Integer -> Integer -> Integer plus x y = x + y
And then fire-up ghci:
$ ghci ___ ___ _ / _ \ /\ /\/ __(_) / /_\// /_/ / / | | GHC Interactive, version 6.6, for Haskell 98. / /_\\/ __ / /___| | http://www.haskell.org/ghc/ \____/\/ /_/\____/|_| Type :? for help. Loading package base ... linking ... done. Prelude> :load curryplus.hs [1 of 1] Compiling Main ( curryplus.hs, interpreted ) Ok, modules loaded: Main. *Main> :t plus plus :: Integer -> Integer -> Integer *Main> :t (plus 1) (plus 1) :: Integer -> Integer *Main> let f = (plus 1) *Main> f 2 3
The :t command in ghci interrogates the type of the expression passed to it.
Now, where does the list of entries come from?
Storing Entries — Database or Filesystem or...?
Databases are lovely places to store data where reads and writes may overlap, and the filesystem is a good place to store information that either doesn't fit or isn't needed in memory. For a weblog, read/write contention should be light (frequent reads, infrequent writes), with writes limited to posts and comments, and optimistic concurrency is entirely acceptable. (It's of no consequence if someone gets slightly stale content.) However, the total amount of content in my weblog, counting from 2002, is in the hundreds of kilobytes, so there is no reason not to hold the whole thing in memory.
Haskell (and specifically GHC) has a couple of shiny objects that I'm tempted by. The shiniest one is STM or Software Transactional Memory, and like the three-line quicksort implementation is one of the teasers for Haskell, the four-line AtomicInteger.getAndIncrement() implementation is the teaser for STM:
get_and_increment :: TVar Integer -> IO Integer
get_and_increment i = atomically ( do j <- readTVar i
writeTVar i (j+1)
return j )
The whitespace in the above definition is critically important, as it tells Haskell that the lines are all part of the do. I'll come back to the left arrow ("<-") and do notation below.
Experimenting with this in ghci isn't much more complicated than the plus example above. Put the four lines above in a text file called get_and_inc.hs following the import statement:
import Control.Concurrent.STM
And fire up ghci with an extra directive to get it to load the STM package:
$ ghci -package stm ___ ___ _ / _ \ /\ /\/ __(_) / /_\// /_/ / / | | GHC Interactive, version 6.6, for Haskell 98. / /_\\/ __ / /___| | http://www.haskell.org/ghc/ \____/\/ /_/\____/|_| Type :? for help. Loading package base ... linking ... done. Loading package stm-2.0 ... linking ... done. Prelude> :load get_and_inc.hs [1 of 1] Compiling Main ( get_and_inc.hs, interpreted ) Ok, modules loaded: Main. *Main> x <- atomically ( newTVar 1 ) *Main> :t x x :: TVar Integer *Main> get_and_increment x 1 *Main> get_and_increment x 2 *Main> get_and_increment x 3
The original paper on STM is from Microsoft Research back in 2005. The follow-on paper Lock Free Data Structures using STM in Haskell is a good read, wherein the authors construct two implementations of ArrayBlockingQueue in Haskell, one using locks and one using STM, and then compare their performance. (Spoiler: The larger the number of processors, the better STM performs.)
As promised, a quick word about <- and
do. Haskell is a purely functional language, and that
means that it is side-effect-free. On the one hand it's great to have
referential
transparency, since it lets the compiler or runtime VM do things
like replace common pieces of a complicated expression with values,
but on the other hand, something imperatively trivial like console
output is a side-effect. The elegant workaround is to allow
a function to return an action as a value, where the action is to be
performed by an external observer at some point. The construct is
called a monad,
and <- and do are some of the notation
that comes with it. (Defining the nature of the action and the
observer is the art of constructing a monad.) The
"do" chains up actions, and the
"<-" captures a value created by an action. Yet
another
good paper from Microsoft Research has a thorough tutorial on
monads in Haskell.
This brings me to the design for storing and accessing the entries:
- Use a single multi-threaded FastCGI handler to serve all requests.
- Maintain an in-memory copy of all content (entries and approved comments), with access and updates managed via STM.
- Store entry and approved comment content as separate files on disk, to be loaded at start-up time and written when an entry is posted or comment approved.
Data structures and a lo-fi approach to outputting Atom are up next.












Comment from Pete @ 2006-10-18T11:25:27Z # permalink
Comment from Sean Leather @ 2006-10-19T01:49:14Z # permalink
Several comments/questions:
Comment from Paul Brown @ 2006-10-19T03:41:00Z # permalink
@Sean:
I mean a solely in-memory copy would be used to serve content, so there would be no overhead of parsing files each time. (My planned structure for an entry includes tags, URLs to ping, etc.) The filesystem would provide durable storage.
(FWIW, on the subject of filesystem-only, I used blosxom a while back, and it's the ultimate in simple use of the filesystem for posting entries. There were definitely some things to like about it, although the plugin "architecture" gave me the willies every time I looked at it...)
Since everything would be in memory, STM is intended to provide a substitute for good filesystem behavior if I were just editing files. I would hate to have the inconsistent view of the site be the one that Google indexes or an aggregator posts.
I'm still on the fence as to whether I'd want to place files on the machine and hit an "update" URL or whether I'd want to POST content. The relative convenience of WebDAV may lead me down the filesystem path. I'll probably end up seeing what works.
Comment from Einar Karttunen @ 2006-10-19T09:10:10Z # permalink
Comment from Sean Leather @ 2006-10-19T14:26:53Z # permalink
Thanks for the reply, Paul.
As for serving content, I can see where you're coming from. The format of files and data in general is one question I've debated with myself; however, I've came at it from a different angle. Since most things are write-once/read-many, I've been thinking about having the processing at the write time instead of the read time. That way, when you update or POST content, you process the data to the desired format (say, HTML or some subset of it). If it's stored in a file, you just include that file whenever the appropriate URL is requested, thus no parsing is needed.
Of course, a problem comes in when you start doing something else that doesn't want straight HTML, e.g. Atom. For that, I believe you have to parse and escape all XML-related characters. So, my solution is not at all ideal.
Comment from Paul Brown @ 2006-10-19T17:34:24Z # permalink
@Einar:
I like the HAppS approach and basic concepts, although citing Prevayler as an inspiration made me a little suspicious. (The project does not have the best reputation in the Java world, but it does embody the class "batch with restart" approach to transaction processing.) I'm still wading through the HAppS implementation to figure out how it works.
There is also the Hope CMS from Björn Bringert (the same guy who has been working on the CGI module), although it takes a more traditional three-tier approach.