Use the Cores, erl

Paul R. Brown @ 2008-01-19T07:26:23Z

In spite of the fact that my last Apple workstation failed rather ingloriously after only a couple of years of use, I went ahead and replaced it with another Apple workstation, an eight-core Mac Pro.

As an experiment, I decided to run the same Erlang benchmark (big.erl) that I ran on the quad-core machine, this time with Erlang R12B. The previous results showed that four schedulers was optimal on the four-core machine. Here are the results of the same test battery on the eight-core machine:

line chart of throughput per number of Erlang schedulers

Two things are odd about this chart:

  1. The running times appear to be about equal to the running times for the benchmark on the quad-core machine. The raw clock speeds aren't that different per core (2.5GHz G5 versus 2.8GHz Xeon), so maybe it's not unreasonable for that to be a draw.
  2. Four schedulers appears to be the optimum (from the set {1,2,4,8,16,32}), where eight would have been the expected value.

It turns out that the optimality of four schedulers in this case doesn't disprove the hypothesis that the optimum number of schedulers equals the number of cores, since the benchmark only appears to be utilizing three of the eight cores:

CPU information showing only 33% active

The question is why the Erlang VM isn't using the available CPU resources. (Two separate VMs running big.erl get utilization up to 85%.) The answer may be buried somewhere inside operating system limits (see, e.g., sysctl(3) and sysctl(8); maybe kern.clockrate?), but it might also be something more interesting. Meanwhile, I'll try to come up with a similar toy benchmark for Haskell to see if it achieves better utilization of the CPUs.

(comment bubbles) 0 comments

Tuppence Tour of Haskell Concurrency Constructs

Paul Brown @ 2007-10-21T05:04:00Z

One of the little widgets that I need for an experiment is a sequence number generator, and it's a good way to get a tuppence (i.e., less than half a nickel) tour of Haskell's concurrency constructs with a little lesson on laziness thrown in for spice.

Requirements

The generator should produce unique Int values on demand and support concurrent access, and I want to try out a couple of methods, one that uses GHC's baseline concurrency capabilities and one that uses STM. (Also, I'd like a better feel for the concepts in practice, so this makes a good exercise!)

Take One: Asynchronous Channels

The base GHC concurrency packages (Control.Concurrent and its descendents) include a great set of buildings blocks: one-place buffers (MVar), asynchronous channels (Chan) that can be combined into one-to-many broadcast channels, and quantity semaphores (QSem and QSemN).

The design I have in mind uses two asynchronous channels, one for requests and one for responses. All clients of the generator receive responses on the one channel, which means that one might get the number that another one requested, but that's no big deal.

First, one data structure for the requests and one for the client view of the generator:

import Control.Concurrent ( ThreadId, forkIO )
import Control.Concurrent.Chan ( Chan, newChan, writeChan, readChan )

data Request = Get
             | Set { initial_value :: Int }
             | Die

data Generator = Generator { thread_id :: ThreadId,
                             request_channel :: Chan Request, 
                             response_channel :: Chan Int }

Then some functions (whose signatures I'll match with the STM version below) to manipulate the generator:

reset :: Generator -> Int -> IO ()
reset g i = writeChan (request_channel g) (Set i)

get_next :: Generator -> IO Int
get_next g = (writeChan (request_channel g) (Get)) >> 
             readChan (response_channel g)

stop :: Generator -> IO ()
stop g = writeChan (request_channel g) Die

Each client function is implemented in terms of sending the request data structure to the generator on the request_channel and then, in the case of the Get operation, blocking to read a value from the response_channel.

Finally, the request-handling event loop in a separate lightweight thread:

new_counter :: IO Generator
new_counter = do { in_chan <- newChan
                 ; out_chan <- newChan
                 ; tid <- forkIO $ loop in_chan out_chan 0
                 ; return $ Generator tid in_chan out_chan }

loop :: Chan Request -> Chan Int -> Int -> IO ()
loop ic oc i = do { req <- readChan ic
                  ; case req of 
                      Die -> return ()
                      (Set n) ->
                          (loop ic oc n)
                      Get ->
                          do { writeChan oc i
                             ; loop ic oc $! (i+1) } }

A similar pattern (request channel, response channel or one-place buffer (MVar) either pre-set or passed with the request, tail-recursive event loop) works for a wide variety of problems and presents a reasonable API for clients.

The single-threadedness of the loop makes it intuitively easy to conclude that it does the right thing (returns unique values to clients), but it's easy enough to check with some experiments in ghci:

> :m + Control.Concurrent
> g <- new_counter
> get_next g
0
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 111
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 112
> forkIO $ sequence_ $ replicate 10000000 (get_next g)
ThreadId 113
[... wait a while ...]
> get_next g
300000001

Which is the right answer. Another experiment will show how fair the scheduler is in terms of multiple client threads:

> g <- new_counter
> :set -fno-print-bind-result -fglasgow-exts

By the way, [TAB]-completion in ghci means that the above can be obtained by typing:

:set -fno-pr[TAB] -fgl[TAB]

The -fno-print-bind-result prevents ghci from spoiling our attempts to be lazy by printing (and thus evaluating), and the -fglasgow-exts lets us use a type specification to specify what kind of channel we're creating. (Normally, the compiler would figure it out from context, but that won't work in ghci.)

> tid::(Chan (Int,Int)) <- newChan
> mapM_ (\n -> (forkIO $ sequence_ $ replicate 1000 (get_next g >>= (writeChan tid).((,) n)))) [1..10]

The last expression looks dense, but it breaks down simply. In plain English, fork 10 threads numbered 1 through 10, and on each thread, do the following 1000 times: get a sequence number from the generator and then send the ordered pair of the threads number and the sequence number on the tid channel. (The (,) function makes ordered pairs out of its arguments.) To inspect the contents of the channel once the threads are done:

> x <- getChanContents tid

(Without the -fno-print-bind-result above, this would run forever.) And now a couple of quick checks:

> :m + List
> let y = take 10000 x
> length $ nub $ map snd y
10000
> [ length $ filter (((==) j).fst) y | j <- [1..10] ]
[1000,1000,1000,1000,1000,1000,1000,1000,1000,1000]
> let w = [ length $ nub (take 10 (drop k (map fst y))) | k <- [0..9990] ]
> [ length $ filter ((==) j) w | j <- [1..10] ]
[0,0,0,0,0,0,1,1,522,9467]

So, most of the time, in 10 turns, 10 client threads get a chance.

A Quick Word About Laziness

The $! is a piece of Haskell magic that ensures that the value on the right is evaluated. Without it, this happens:

> g <- new_counter
> get_next g
0
> sequence_ $ replicate 100000000 (get_next g)
> get_next g
*** Exception: stack overflow

The reason lies in Haskell's laziness. At the time that the second get_next is evaluated, there are 100,000,000 (i+1) queued-up and waiting to be evaluated because no one ever actually consumed the values passed back on the response channel. This is just the way that Haskell works: You can tell the runtime about a ridiculous computation, but it won't complain until you ask for the result. The $! ensures that the (i+1) value is evaluated each time Get is performed.

Take Two: TVar

Haskell's STM (software transactional memory; read/watch Simon Peyton Jones on the subject) implementation provides another set of building blocks in the form of atomically mutable cells (TVar), asynchronous channels (TChan), one-place buffers (TMVar), and arrays (TArray).

The sequence generator implementation with the same API but using a TVar to hold the current value is shorter and simpler (no backing thread) than the one above that uses channels:

import Control.Concurrent.STM
import Control.Concurrent.STM.TVar
import Control.Monad

data Generator = Generator { counter :: TVar Int }

new_counter :: Int -> IO Generator
new_counter i = (atomically $ newTVar i) >>= (return . Generator)

get_next :: Generator -> IO Int
get_next g = atomically $ do { n <- readTVar $ counter g
                             ; (writeTVar (counter g)) $! (n+1)
                             ; return n }

reset :: Generator -> Int -> IO ()
reset g n = atomically $ writeTVar (counter g) n

Note the $! that appears in get_next; it is there for the same reason as it appears in the Chan version.

The same set of basic verifications as above ends with:

> [ length $ filter ((==) j) w | j <- [1..10] ]
[9901,90,0,0,0,0,0,0,0,0]

Or, the 10 threads took blocks of 1000 numbers from the sequence because the scheduler had no reason to switch. For a slightly different spawning of clients with a yield added, we get more regular results:

> mapM_ (\n -> (forkIO $ sequence_ $ replicate 1000 (get_next g >>= (writeChan tid).((,) n) >> yield))) [1..10]
[... same steps ...]
[0,0,0,0,0,0,0,0,0,9991]
> (map fst y) == (concat $ replicate 1000 [1..10])
True

Conclusions

The STM version is probably the one that I'll use, but I also need some more complicated components where the channel-based design will work well.

(comment bubbles) 2 comments

Not Quite No Transactions

Paul Brown @ 2007-03-20T02:56:00Z

I enjoyed Dan Pritchett's presentation on The EBay Architecture, but now it seems like people are getting carried away, where the intersection of people with my blogroll contains Martin Fowler. Martin is well-known and rightly regarded as a smart guy, but this fragment from his post on "no transactions" isn't quite right:

The rationale for not using transactions was that they harm performance at the sort of scale that eBay deals with. This effect is exacerbated by the fact that eBay heavily partitions its data into many, many physical databases. As a result using transactions would mean using distributed transactions, which is a common thing to be wary of.

(Unfortunately, Martin's bliki doesn't support comments, or I would be posting this over there.) The relevant bullet from Dan's presentation at SDForum (not the one at QCon) is:

Absolutely no client-side transactions.
∗ Single database transactions managed through anonymous PL/SQL blocks.
∗ No distributed transactions.

This is very different than "no transactions", but the game of telephone is already underway.

The reality is that attempts at being "for sure, for sure" (multi-master replication, two-phase commit, locking, aggressive retry, etc.) can and do inhibit scale. There are also plenty of stupid things you can do to hobble an application without leaving a single database, usually by inadvertently abusing locking. (One classic is a SELECT ... FOR UPDATE that includes a join across a code table.)

In terms of things that can be done to scale, it's a good idea to take the shortest path to a working system. It's a good idea to be intimately familiar with the performance characteristics of a system in production, and that means really, truly understanding how all of the pieces (load balancer, web server, application tier, database, physical network, hardware, etc.) work individually and collectively. It's important to be able to disentangle design choices from business constraints, and this is more or less the point that Assaf made in the original round of discussion about Dan Pritchett's presentation. As Dare Obasanjo and Charlie Wood point out, life without transactions isn't necessarily a picnic, either. Thinking about "what can I live without" in the abstract will lead down the path toward something like what Pat Helland describes in his Life Beyond Distributed Transactions paper, but that journey should be taken in pragmatic steps driven by experiment and experience. (And now that you're out of the jungle, Pat, can you hurry up and go back to blogging?)

(comment bubbles) 3 comments

Java Brainteaser on Regular Expressions

Paul Brown @ 2007-03-20T00:32:11Z

Suppose that you have a Java web application where regular expressions are used deep down in the implementation to do some work, but you observe that the an array index exception is occurring sporadically where the regular expressions are being used.

What's causing the exceptions? What's the solution?

(comment bubbles) 1 comment

Where did the Exception go...?

Paul Brown @ 2006-12-20T03:03:23Z

It's easy to get used to throwing away return values in Java. For example, I don't usually need or use the return value from Map#put(Object) or Map#remove(Object). Innocuous for collections, this can have unintended consequences (or, rather lack of consequences) if the return value that you're throwing away is a Future.

For example, suppose that you do something like:

Executors.newSingleThreadExecutor().submit(new Runnable() {
  public void run() {
    throw new UnsupportedOperationException();
  }
});

The exception is lost and gone forever. There are four choices, at least in the case of a ThreadPoolExecutor as above:

  1. Use execute(...) instead of submit(...), in which case the exception will be thrown as though we'd just started a thread to run the Runnable.
  2. Capture the Future returned by submit(...), which is actually a FutureTask in this case, and deal with completion.
  3. Subclass the ThreadPool and supply an afterExecute(Runnable,Throwable) implementation.
  4. Perform any exception handling (logging, retry, etc.) in the body of the run() method.

My preference is for number 4, since it doesn't involve stack traces on the console, a fragile cast, or subclassing, but I'm open to other opinions.

(comment bubbles) 2 comments

Sure, it runs like a top. How does it idle?

Paul Brown @ 2006-11-04T03:59:00Z

A couple of weeks back, I wrote a simple but well-instrumented Java framework to handle SEDA-like use cases (thread pools linked by queues) for a consulting customer. The java.util.concurrent package and friends makes this sort of thing much easier than it used to be, and it was surprisingly easy to crank it out. As a smoke test, I set up a torture test for a simple configuration and left it running over night, and it appeared solid — no memory or thread leaks, no lock-ups.

Someone working on a different problem needed something similar and took the framework for a quick test drive, ending up with an out of memory error after a night of doing nothing! It turns out that there was a bug that meant that the poll(long,TimeUnit) on an empty LinkedBlockingQueue leaks, and I'd never run across it in testing because I hadn't tested what happens when the system has no load for an extended period.

The lesson is that no load doesn't mean that the system is actually doing nothing, and it's an important scenario to add to a test plan.

(comment bubbles) 2 comments

Actors, Scala, and JaCOb

Paul Brown @ 2006-07-13T08:06:04Z

LtU carried an announcement of a paper from Martin Odersky and Phillipp Haller at EPFL called “Event-Based Programming without Inversion of Control”. To quote a snippet:

The central idea is as follows: An actor that waits in a receive statement is not represented by a blocked thread but by a closure that captures the rest of the actors computation. The closure is executed once a message is set to the actor that matches one of the message patterns specified in the receive. The executing closure is “piggy-backed” on the thread of the sender. If the receiving closure terminates, control is returned to the sender as if a procedure returns. If the receiving closure blocks in a second receive, control is returned to the sender by throwing a special exception that unwinds the receiver's call stack.

Scala has had actors in the form of scala.concurrent.Actor since version 1.x, but as extensions of java.lang.Thread they are thread-dependent. Thread-dependent means that a Scala Actor might be preferable to a Java Thread for semantic reasons but that there is no performance advantage. (In fact, a benchmark in the paper shows that it's a disadvantage, but semantics are often more important than raw performance.)

The PXE BPEL engine relies on a little-publicized Java-based actor framework called JaCOb (short for “Java Concurrent Objects”) that makes some different choices than Martin and Phillipp did but that is ultimately based on hooking closures to model the receipt of messages in the future. For a quick tour of JaCOb, there's either the source code or a detailed tutorial written by Matthieu Riou on the Apache Ode incubation wiki.) For one, JaCOb's syntax would be cleaner in Scala, as Java lacks delegates. JaCOb does not use exceptions to break the stack but instead relies on passing communication channels. (Using exceptions for control flow is generally regarded as a bad thing to do. In addition to the philosophical reasons, throwing an exception is involved and, depending on the JVM implementation (e.g., implementing exceptions as signals), can be expensive.) JaCOb doesn't include explicit support for distributed operation as Martin and Phillipp describe for their framework, but a suitable flavor of JaCOb's soup could be implemented.

I'm looking forward to checking out the event-based approach from Martin and Phillipp when it comes out with Scala 2.1.7, and in the meantime, I should finish up the JaCOb tutorial that's been mouldering on my to-do list.

(comment bubbles) 0 comments

Posts tagged ["concurrency"] contains 11 items in 2 pages of 7 items each:
1 2