After I saw Robert Sayre's results, I thought that I'd give Rickard Green's Erlang exerciser “big.erl” a go on my four-core (two 2.5GHz G5 processors with two cores each) PowerMac (MacOS X 10.4.6) to see the effect of different numbers of schedulers. (Joe Armstrong posted some benchmark information in his blog, but I don't have a means to reproduce them for direct comparison.)
Eye-Grabbing Plots
There's code below, so as an amuse bouche, here are a couple of plots that illustrate the results. (I used HippoDraw to draw the plots.) This first graphic shows the time to execute the benchmark plotted against the number of processes.
 |
| |
| #Schedulers | Color |
| 1 | orange |
| 2 | red |
| 4 | green |
| 8 | blue |
| 16 | magenta |
| | |
The green plot illustrates that four schedulers breaks even with one or two schedulers at 800 processes and wins from there out. (I did try a 32 scheduler run but ditched it part way through because the performance was so poor.) Here's another plot that provides an alternative visualization.

In the plot, lighter is faster, and as the number of processes increases, it's visually apparent that the four scheduler sequence is superior.
Interpretation
OK — so what gives here?
In comparison with Robert's results (look for the graph), multiple schedulers provided better performance but much less dramatically versus a single scheduler, and performance degraded much more rapidly with more than the optimum number of schedulers. More than likely, the root cause lies down deep in the core of the MacOS X kernel. Apple has a technote that explains threading in MacOS X, and a cursory read suggests that the application-level pthread threading model is deeply layered over the low-level kernel threading model. My interpretation would be that Mach is doing extra work to spread load across lower-level threads when relatively few schedulers are used, so it wouldn't be surprising if a single scheduler manages to use slightly more than one of the cores.
In terms of what SMP (a.k.a. “symmetric multi-processing”) means for Erlang, MT (for “multi-threaded”) would be a better term. The current version of Erlang, R10B, uses a single scheduler thread to process a queue of runnables, and Erlang R11B uses multiple scheduler threads to manage the same queue. (See, e.g., this presentation.) Under (naively) ideal circumstances, a threads works so hard that it fully consumes the attention of a processor and then other threads are forced onto other processors (i.e., number of threads converges to number of processors), but as this benchmark illustrates, the strength of that convergence is determined by the extent to which the operating system kernel cooperates.
Code Snippets
Here's a little snippet of Erlang to make running the benchmark with different numbers of processes easier and dump data in a convenient format:
-module(bmark).
-export([go/0]).
n() -> element(1,string:to_integer(
lists:nth(1,init:get_plain_arguments()))).
plur(1) -> "";
plur(_) -> "s".
runbmark([]) -> done;
runbmark([Head|Tail]) ->
io:format("~4w ~4w ~6.1f~n",
[n(),
Head,
trunc(big:bang(Head)/100000)/10]),
runbmark(Tail).
go() ->
N = n(),
io:format("// Running with ~w scheduler~s.~n",
[N,plur(N)]),
runbmark(lists:seq(50,1500,50)),
io:format("~n",[]),
halt().
And here's some bash to run 1, 2, 4, 8, and 16 schedulers in succession:
for ((i=0 ; i<5 ; ++i )); do \
path/to/otp_src_R11B_2006-05-08/bin/erl -smp +S$((1<<$i)) \
-noshell -eval 'bmark:go()' -- $((1<<$i)); echo; done