Da da, More or Less

Paul Brown @ 2006-10-25T00:38:00Z

The kid has started talking in earnest. She started using a few made-up words for things ("mum mum" was food, "boo-ey" was book) as much as a year ago, but now she understands dozens of words and uses a few dozen — car, baby, cookie, etc. — that are close enough to pass for the real thing. This morning, she started saying "da da" to refer to me, which was a nice way to kick off the day.

Tonight, she pointed out a package of Brawny paper towels. The strapping, flannel-clad man on the package is "Dada". (I'll admit a certain resemblance, at least in wardrobe.) Then Paul Allen is on the cover of Seattle magazine, and he's "Dada", too. Who knows how many other Dada's could be lurking around...? I'm sure that we can get a finer point on just who "Da da" is and isn't in a couple of days, and it's good be one her Dadas at any rate.

(comment bubbles) 0 comments

Bye bye TextDrive, Hello Linode

Paul Brown @ 2006-10-21T01:11:27Z

The TextDrive folks are (not unjustifiably) cracking down on people who bump into their resource limits (e.g., a hard limit of 48Mb resident and 80Mb virtual per process), both in the form of a watchdog process that kills off processes that cross the line and in the form of emails out to folks (like me) who use a process nanny to restart their killed processes. Unfortunately, TextDrive doesn't sell more per-process memory with any of their plans short of the $250/month price point, so I packed up and moved over to Linode.

Linode provides user-mode Linux virtual private servers with reasonable specs, and so far so good. I dropped in an Ubuntu 6.0.6 distribution, got things configured, and had the typo instance moved over with a total of about 30 minutes of time invested, most of which was invested fiddling libraries needed by mysql and sqlite gems. (The Debian/Ruby team has an official position about why RubyGems isn't included, but it should be apparent that they've already lost to momentum in the Ruby community.) For $20/month instead of $12/month, Typo has space to run (until I'm done writing its replacement), 500s are a thing of the past, and I've got root access instead of a goofy control panel. Much better.

(comment bubbles) 1 comment

More Haskell and Personal Publishing Platform Ramblings

Paul Brown @ 2006-10-18T07:00:00Z

Herein, part two of n, where I think about some basic design decisions and some actual code gets written.

From GET to Function Currying

First up, some thoughts about how to map a request to a response.

A weblog is a list of entries grouped and displayed in different formats according to parameters supplied by client applications, so serving pages is limited to figuring out what format to display (single-entry HTML, multi-entry HTML, Atom) and which entries to include (include/exclude by tag, include/exclude by category, date range, maximum number of entries). Ideally, I'd like a reader to be able to, e.g., exclude only the posts I make about my kid or include only posts about BPEL and Java.

The simple design is to turn the combination of a URL and parameters into a list of functions that are applied to the list of all entries, loosely:

(format url) (filter (filters url) entries)

And that's the whole program, up to details. In mildly abused Haskell notation:

format :: URL -> [Entry] -> String
filters :: URL -> [Entry -> Boolean]

In Haskell, the notation

f :: A -> B

is very much as it is in mathematics with f being a function with domain (things of type) A and codomain (things of type) B. Unraveling expressions with multiple arrows is accomplished by adding parentheses from the right. That is,

f :: A -> B -> C  =  f :: A -> (B -> C)

is essentially a function from A×B to C as in "f is a function that maps (things of type) A to a function that maps (things of type) B to (things of type) C". So, the function format above maps a URL and a list of Entry to a String. This is one of the things that I like about Haskell — a function looks like a function.

Haskell supports currying of functions, which is only appropriate, as the language was named for Haskell Curry. The filters can be implemented by currying arguments onto functions that take two arguments. For example, consider the functions:

include_by_tag :: String -> Entry -> Boolean
include_by_tag s e = s `elem` (tags e)

exclude_by_tag :: String -> Entry -> Boolean
exclude_by_tag s e = s `notElem` (tags e)

(tags is a function that returns the list of tags applied to an entry.) In addition to showing off that Haskell lets you flip back and forth between prefix notation (foo 1 2) and infix notation (1 `foo` 2), the idea would be to map a query string atom like tag=foo to the function (include_by_tag "foo") or a query string atom like tag=-foo to the function (exclude_by_tag "foo").

Playing with the concept in ghci is straightforward. For example, drop the following two lines of Haskell into a text file called curryplus.hs:

plus :: Integer -> Integer -> Integer
plus x y = x + y

And then fire-up ghci:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :load curryplus.hs
[1 of 1] Compiling Main             ( curryplus.hs, interpreted )
Ok, modules loaded: Main.
*Main> :t plus
plus :: Integer -> Integer -> Integer
*Main> :t (plus 1)
(plus 1) :: Integer -> Integer
*Main> let f = (plus 1)
*Main> f 2
3

The :t command in ghci interrogates the type of the expression passed to it.

Now, where does the list of entries come from?

Storing Entries — Database or Filesystem or...?

Databases are lovely places to store data where reads and writes may overlap, and the filesystem is a good place to store information that either doesn't fit or isn't needed in memory. For a weblog, read/write contention should be light (frequent reads, infrequent writes), with writes limited to posts and comments, and optimistic concurrency is entirely acceptable. (It's of no consequence if someone gets slightly stale content.) However, the total amount of content in my weblog, counting from 2002, is in the hundreds of kilobytes, so there is no reason not to hold the whole thing in memory.

Haskell (and specifically GHC) has a couple of shiny objects that I'm tempted by. The shiniest one is STM or Software Transactional Memory, and like the three-line quicksort implementation is one of the teasers for Haskell, the four-line AtomicInteger.getAndIncrement() implementation is the teaser for STM:

get_and_increment :: TVar Integer -> IO Integer
get_and_increment i = atomically ( do j <- readTVar i
                                      writeTVar i (j+1)
                                      return j )

The whitespace in the above definition is critically important, as it tells Haskell that the lines are all part of the do. I'll come back to the left arrow ("<-") and do notation below.

Experimenting with this in ghci isn't much more complicated than the plus example above. Put the four lines above in a text file called get_and_inc.hs following the import statement:

import Control.Concurrent.STM

And fire up ghci with an extra directive to get it to load the STM package:

$ ghci -package stm
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Loading package stm-2.0 ... linking ... done.
Prelude> :load get_and_inc.hs
[1 of 1] Compiling Main             ( get_and_inc.hs, interpreted )
Ok, modules loaded: Main.
*Main> x <- atomically ( newTVar 1 )
*Main> :t x
x :: TVar Integer
*Main> get_and_increment x
1
*Main> get_and_increment x
2
*Main> get_and_increment x
3

The original paper on STM is from Microsoft Research back in 2005. The follow-on paper Lock Free Data Structures using STM in Haskell is a good read, wherein the authors construct two implementations of ArrayBlockingQueue in Haskell, one using locks and one using STM, and then compare their performance. (Spoiler: The larger the number of processors, the better STM performs.)

As promised, a quick word about <- and do. Haskell is a purely functional language, and that means that it is side-effect-free. On the one hand it's great to have referential transparency, since it lets the compiler or runtime VM do things like replace common pieces of a complicated expression with values, but on the other hand, something imperatively trivial like console output is a side-effect. The elegant workaround is to allow a function to return an action as a value, where the action is to be performed by an external observer at some point. The construct is called a monad, and <- and do are some of the notation that comes with it. (Defining the nature of the action and the observer is the art of constructing a monad.) The "do" chains up actions, and the "<-" captures a value created by an action. Yet another good paper from Microsoft Research has a thorough tutorial on monads in Haskell.

This brings me to the design for storing and accessing the entries:

  • Use a single multi-threaded FastCGI handler to serve all requests.
  • Maintain an in-memory copy of all content (entries and approved comments), with access and updates managed via STM.
  • Store entry and approved comment content as separate files on disk, to be loaded at start-up time and written when an entry is posted or comment approved.

Data structures and a lo-fi approach to outputting Atom are up next.

(comment bubbles) 6 comments

GHC 6.6 and Mac OS X Readline Quick Fix

Paul Brown @ 2006-10-17T06:59:00Z

If you build vanilla GHC 6.6 from the source tarball on Mac OS X (non-Intel), it will build straight through with no issues, but the ghci commandline interface will drive you crazy because it lacks support for editing. This is due to a damaged readline that ships with Mac OS X, but here's a quick fix. The first step is installing an up-to-date version of GNU readline into /usr/local with the usual recipe:

cd ~/work
mkdir gnu-readline
cd !$
wget ftp://ftp.cwru.edu/pub/bash/readline-5.2.tar.gz
tar xzvf readline-5.2.tat.gz
cd readline-5.2
./configure
make && sudo make install

(I keep a copy of the source for any modules that I build from scratch.) The second step is courtesy of Andy Gill:

cd ~/work
mkdir ghc
cd !$
wget http://www.haskell.org/ghc/dist/6.6/ghc-6.6-src.tar.bz2
wget http://www.haskell.org/ghc/dist/6.6/ghc-6.6-src-extralibs.tar.bz2
tar xjvf ghc-6.6-src.tar.bz2
tar xjvf ghc-6.6-src-extralibs.tar.bz2
cd ghc-6.6
./configure --with-readline-includes=/usr/local/include \
            --with-readline-libraries=/usr/local/lib
make -j && sudo make install

This is one of those little joys of configure: flags are passed down to the libraries from the master build, but available options are not bubbled up to ./configure --help. You'd have no idea about the availability of the readline-related flags unless you looked at the library.

Of course, you can also use Fink or DarwinPorts and pass the appropriate directories to configure instead.

Building from source takes a long time, no matter what kind of machine you've got. If you've got a multi-processor or multi-core machine, make -j (as many concurrent jobs as make feels like running) will help cut the time down. It takes under an hour on my four-way G5 with -j and several hours without. (Don't bother with -j without also cranking up the number of processes, e.g., ulimit -u 512, which beats the anemic default of 100 on Mac OS.)

(comment bubbles) 2 comments

First Steps with Haskell for Web Applications

Paul Brown @ 2006-10-11T06:19:00Z

As I blogged yesterday, I'm planning to build a simplified personal publishing system to host this blog, partially to get around resource consumption issues with the current platform and partially to get some exercise with a new language or two. I thought about Smalltalk, Erlang, and Io, but Haskell gets the initial nod if for no other reason than it's a third side of the coin that Ruby and Java are two sides of — rigorously defined, "purely" functional, lazy, "typeful" and compiles to native code via GHC. (And, of course, the syntax warms the cockles of my mathematician's heart.) Like Ruby with gems, the GHC runtime also has excellent modularity, with a minimal and standard core and good package management via Cabal. (Hello? Java?)

The first question is how to integrate an application written in Haskell into a web container, preferably a web server like lightTPD or Apache via FastCGI. (CGI would be a consideration, too, but that's just too retro for me.) Thankfully, as of the forthcoming 6.6 version, GHC has good CGI support via the Network.CGI module, and Björn Bringert has a FastCGI binding that built on the GHC 6.5 tip with only a little tinkering. (I wanted to use the core Network.CGI module in place of Björn's cgi-compat module.)

A "Hello, World" implementation using the FastCGI binding and then compiled to native code performed well on a basic smoke benchmark. Here's the relevant line from top for an instance of the handler:

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
[...]
  234 hello.fcgi   0.0%  0:06.83   1    13    21   692K  1.63M  1.69M  29.0M
[...]

Benchmarking with ab shows that 5 handlers can happily crank through around 4000 requests/second with 99% of the requests requiring <2ms.

For comparison purposes and with an identical FastCGI configuration, the simplest possible Ruby on Rails "Hello, World" implementation (create test controller, edit the .rhtml to return content, wire-up FastCGI) consumes considerably more memory:

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
[...]
  537 ruby1.8     12.1%  0:26.49   1    14    94  22.5M  3.35M  24.5M  54.5M
[...]

and only manages around 100 requests/second with ~50ms response time for the 50th percentile and ~400ms at the 99th percentile. (I recognize that I should probably put a sic. after the "only", since 100 requests/second is significantly in excess of the peak throughput that my blog sees on a good day.)

This is far from apples-to-apples, as the RoR version is doing a lot more work under the covers, but it does give me the expectation that I can probably get a Haskell blog implementation that will have a memory footprint smaller than a base irb and provide Slashdottable performance.

Next up, deciding on how to store/represent an entry and how to implement Atom for syndication.

(comment bubbles) 2 comments

Typo + TextDrive != Happy

Paul Brown @ 2006-10-10T06:18:58Z

The logs say that mult.ifario.us throws a fair number of HTTP 500 response codes back at visitors, and that's sad. It is certainly not the impression I want to make on visitors and readers (although subscribers are insulated from failures by FeedBurner's excellent service). In a perfect world, something as simple as a weblog wouldn't throw any 500s, ever. The problems come from running Typo on TextDrive. There isn't anything intrinsically wrong with the Typo engine, with Ruby, or even with TextDrive, as a similar setup runs like a top in my test environment, but TextDrive's resource limits make Typo's design impractical.

This got me thinking about the design of the simplest possible weblog publishing software, a design that would eschew the use of a database and all runtime configuration in favor of a system that is ultra-lightweight and quick to "boot". Almost all of the content in the blog is relatively static — display of an entry, feeds, archives, various paginations and groupings only require lightweight decoration of the XHTML for a given entry. Paginations and groups, e.g., by multiple tags or by tags plus date, require some dynamic behavior on the server, but not that much. A complexity-ectomy doesn't have to come at the expense of chrome and eye candy, as modern browsers make it possible to inject dynamic content (images from Flickr, links from del.icio.us, free-associations from Google AdSense, etc.) into the browser directly in the form of JavaScript.

The one difficult bit (and the only thing that would require a POST) would be comments. Comments don't need a database or use of dynamic content, either, and using email for comment workflow would solve multiple problems. Here's a sketch:

  • Comment is made on the weblog by submitting a form.
  • Server-side executable wraps the comment as an email and sends it to the blog's author.
  • Normal email filtering machinery is applied to the comment, i.e., spam filtering, and the blog content author either chooses to reply to the message, in which case the comment is added to the relevant entry (e.g., via a procmail recipe), or simply ignores it.

Akismet is apparently effective (if, at the same time, a statement about the sad state of the signal-to-noise ratio of the present-day internet), but it makes sense to leverage the filtering technology and massive corpus (~107 messages) of SPAM and ham that I already use for email.

I've experimented with different publishing platforms (Radio Userland, SnipSnap, MT, WordPress, Typo), and they all fell short for me in one way or another.

As the saying goes, if you want something done right... I'm going to embark on a project to replace Typo with something simple, dense/terse, and home-grown. It's also a chance to experiment with a new language or two, so it should be both fun and educational. Java's out due to footprint, but my mind is open otherwise — SmallTalk, Haskell, Lisp, Io, ...?

(comment bubbles) 1 comment

The AdSense Magic 8-Ball

Paul Brown @ 2006-09-28T19:18:00Z

In terms of an hourly wage paid for blogging, AdSense pays me something less than slave wages, but it's fun to shake up the Google magic 8-ball and see what it thinks is relevant to a given page. Sometimes it gets things right (ads for BizTalk and cello for the orchestration tag), and sometimes it gets things wrong (ads for VB for the Erlang tag).

So, let's see... <shake shake shake>...

(comment bubbles) 0 comments

All Posts contains 399 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57