Use the Cores, erl

Paul R. Brown @ 2008-01-19T07:26:23Z

In spite of the fact that my last Apple workstation failed rather ingloriously after only a couple of years of use, I went ahead and replaced it with another Apple workstation, an eight-core Mac Pro.

As an experiment, I decided to run the same Erlang benchmark (big.erl) that I ran on the quad-core machine, this time with Erlang R12B. The previous results showed that four schedulers was optimal on the four-core machine. Here are the results of the same test battery on the eight-core machine:

line chart of throughput per number of Erlang schedulers

Two things are odd about this chart:

  1. The running times appear to be about equal to the running times for the benchmark on the quad-core machine. The raw clock speeds aren't that different per core (2.5GHz G5 versus 2.8GHz Xeon), so maybe it's not unreasonable for that to be a draw.
  2. Four schedulers appears to be the optimum (from the set {1,2,4,8,16,32}), where eight would have been the expected value.

It turns out that the optimality of four schedulers in this case doesn't disprove the hypothesis that the optimum number of schedulers equals the number of cores, since the benchmark only appears to be utilizing three of the eight cores:

CPU information showing only 33% active

The question is why the Erlang VM isn't using the available CPU resources. (Two separate VMs running big.erl get utilization up to 85%.) The answer may be buried somewhere inside operating system limits (see, e.g., sysctl(3) and sysctl(8); maybe kern.clockrate?), but it might also be something more interesting. Meanwhile, I'll try to come up with a similar toy benchmark for Haskell to see if it achieves better utilization of the CPUs.

(comment bubbles) 0 comments

De gustibus non disputandum est

Paul Brown @ 2007-05-18T16:08:29Z

From Blaine Cook's RubyConf presentation on scaling Twitter:

(The SlideAware guys obviously have a different perspective.) As I've been bouncing around between languages for various projects (Java, Ruby, Erlang, Haskell, etc.), the only thing I've found that really makes me want to wield a fork or other sharp implement is what people do with the language as opposed to the language itself.

(comment bubbles) 1 comment

New Erlang Book

Paul Brown @ 2007-03-04T01:11:00Z

With the default Erlang book now over a decade old, a new one was sorely needed, and it looks like Joe Armstrong has stepped up (again). (Hat-tip to Bill de Hóra for the pointer to the book, since I haven't been tracking the Erlang space lately.) It's not in the beta PDF yet, but I'm looking forward to what is currently slated for Chapter 19:

How to structure applications for programming multi-core CPUs. Increasing parallelism. Deciding the granularity of concurrency.

Good stuff.

(comment bubbles) 0 comments

More on Erlang Performance and Threading

Paul Brown @ 2006-05-11T23:34:00Z

After I saw Robert Sayre's results, I thought that I'd give Rickard Green's Erlang exerciserbig.erl” a go on my four-core (two 2.5GHz G5 processors with two cores each) PowerMac (MacOS X 10.4.6) to see the effect of different numbers of schedulers. (Joe Armstrong posted some benchmark information in his blog, but I don't have a means to reproduce them for direct comparison.)

Eye-Grabbing Plots

There's code below, so as an amuse bouche, here are a couple of plots that illustrate the results. (I used HippoDraw to draw the plots.) This first graphic shows the time to execute the benchmark plotted against the number of processes.

#SchedulersColor
1 orange 
2 red 
4 green 
8 blue 
16 magenta 
  

The green plot illustrates that four schedulers breaks even with one or two schedulers at 800 processes and wins from there out. (I did try a 32 scheduler run but ditched it part way through because the performance was so poor.) Here's another plot that provides an alternative visualization.

In the plot, lighter is faster, and as the number of processes increases, it's visually apparent that the four scheduler sequence is superior.

Interpretation

OK — so what gives here?

In comparison with Robert's results (look for the graph), multiple schedulers provided better performance but much less dramatically versus a single scheduler, and performance degraded much more rapidly with more than the optimum number of schedulers. More than likely, the root cause lies down deep in the core of the MacOS X kernel. Apple has a technote that explains threading in MacOS X, and a cursory read suggests that the application-level pthread threading model is deeply layered over the low-level kernel threading model. My interpretation would be that Mach is doing extra work to spread load across lower-level threads when relatively few schedulers are used, so it wouldn't be surprising if a single scheduler manages to use slightly more than one of the cores.

In terms of what SMP (a.k.a. “symmetric multi-processing”) means for Erlang, MT (for “multi-threaded”) would be a better term. The current version of Erlang, R10B, uses a single scheduler thread to process a queue of runnables, and Erlang R11B uses multiple scheduler threads to manage the same queue. (See, e.g., this presentation.) Under (naively) ideal circumstances, a threads works so hard that it fully consumes the attention of a processor and then other threads are forced onto other processors (i.e., number of threads converges to number of processors), but as this benchmark illustrates, the strength of that convergence is determined by the extent to which the operating system kernel cooperates.

Code Snippets

Here's a little snippet of Erlang to make running the benchmark with different numbers of processes easier and dump data in a convenient format:

-module(bmark).
-export([go/0]).

n() -> element(1,string:to_integer(
                  lists:nth(1,init:get_plain_arguments()))).

plur(1) -> "";
plur(_) -> "s".

runbmark([]) -> done;
runbmark([Head|Tail]) ->
    io:format("~4w ~4w ~6.1f~n",
              [n(), 
               Head, 
               trunc(big:bang(Head)/100000)/10]),
    runbmark(Tail).

go() ->
    N = n(),
    io:format("// Running with ~w scheduler~s.~n",
              [N,plur(N)]),
    runbmark(lists:seq(50,1500,50)),
    io:format("~n",[]),
    halt().

And here's some bash to run 1, 2, 4, 8, and 16 schedulers in succession:

for ((i=0 ; i<5 ; ++i )); do \
 path/to/otp_src_R11B_2006-05-08/bin/erl -smp +S$((1<<$i)) \
-noshell -eval 'bmark:go()' -- $((1<<$i)); echo; done
(comment bubbles) 1 comment

Single Threading Good

Paul Brown @ 2006-05-09T05:39:00Z

Perhaps surprisingly, a post from Robert Sayre, who's been playing with Erlang on a SUN Fire T2000 (lucky bum on both counts), doesn't surprise me: best performance is achieved when the number of Erlang schedulers is equal to the number of hardware threads (i.e., “CoolThreads” in the case of the SUN box). Experience has borne out that using a single low-level thread under lightweight higher-level structures to manage concurrency is usually a good plan, and at least intuitively, the pigeonhole principle says that this should apply to additional low-level threads (or cores) if the higher-level structures are designed properly. (More threads than cores would mean that at any point in time one of the cores was supporting multiple threads and thus slower than it would be if it were only supporting one.)

(comment bubbles) 0 comments