Overview

brahms_gperf is a Matlab function that reviews performance data returned by BRAHMS after executing a system in a graphical report.

Warnings
  • This utility is in its infancy and, in particular, will give meaningless CPU times on current Linux and OSX builds (though the Process times should be accurate). No bug reports, please.

Example

M Source Code (against 995)
% invoke BRAHMS, asking for the execution report [out, report] = brahms(sys, exe); % review the performance brahms_gperf(report);

High overhead

A BRAHMS execution will display high overhead if the amount of computation performed in each process iteration is low. This is because switching between processes (and/or threads), as well as passing in and out state data and results, can end up costing more resources than performing the computation. Here, we engineer such a case. We ran brahms_test std with a modified execution, as shown below. The execution time was extended, Encapsulation was disabled, Run-Phase Timing was enabled, and the thread count was limited to 2.

M Source Code (against 995)
% construct exe exe = brahms_execution; exe.all = true; exe.name = test; exe.stop = 250; exe.encapsulated = false; exe.precision = []; exe.execPars.TimeRunPhase = 1; exe.execPars.MaxThreadCount = 2;

In the figure below, the large orange blocks indicate CPU time for each of the two worker threads, as measured by the CPU timers. Each is marked with an axis up the side in CPU seconds (this is like Wall Clock time, but for each thread - since the threads run in parallel they can clock up more CPU seconds between them than have passed on the actual Wall Clock). This lets us see that the threads were relatively balanced, but we could do better by juggling the processes between them (there is not currently any support for this). These CPU times have relatively poor resolution (usually on the order of 10ms).

In the two threads, four and three processes ran, respectively; these are represented by the blue blocks. The height of the blue blocks gives the relative execution time taken by each process (as measured by microsecond timers against the Wall Clock; the microsecond timers are much more accurate, but cost CPU time themselves to operate). Thus, the tallest blue block is always as high as the orange block. This lets us see which processes are taking up all our execution time. resample_spikes is the biggest user in thread 0. This information is only available if you turn on TimeRunPhase - if you do not, the blue blocks are not displayed, and the execution runs faster.

The pale orange block towards the bottom of the thread 0 block represents the time this thread spent in Init Phase. Often, Init Phase and Term Phase are so brief that they will not log any CPU time at all, remembering that the CPU timers have poor resolution (they have not appeared at all for thread 1, for instance). Within that block, the lighter blue bars indicate the relative usage of each process in the thread. sub/source_spikes was the main consumer, here.

The green/red bar up the side of each thread block indicates the breakdown of CPU usage between User Mode and Kernel Mode. Kernel Mode represents time spent in system operations (particularly, filing system requests and, in this case, context switching).

Along the bottom is the overall execution timeline, as measured by the Invocation Bindings. The first section (pale green) is prepare, and was brief in this case. The second section (dark green) is execute. The third (pale green) is collect and its length represents how long it took to collect the fairly large set of results from this execution and load them into Matlab.

The execute section is further broken down, by the timings returned by the BRAHMS executable, into Init, Run and Term Phases. The discrepancy between the time measured by the executable in execute and that measured by the bindings (around 100ms) represents the time taken by the OS to load the executable and get the timers started, and the time not counted at the end as the Report File is being written and the executable is unloaded. This overhead should remain constant for longer executions. The length of the Term Phase, here, is around 100ms; some of this represents the time taken to terminate the system as a whole, but the bulk of it is the time taken to terminate the Data objects in the System - since these are terminated in the caller thread, the time does not show up in the CPU timings.

Overall, this system used something like 0.3 seconds of User Mode Run Phase time (the sort that actually gets things done) in two seconds of total execution. The the latter second was entirely concerned with loading the results into Matlab; even so, this execution has high overhead.

Low overhead

When the computation performed per iteration is greater, the overhead tends to negligible. Here, we ran brahms_test resamp, with the system and execution modified as follows.

M Source Code (against 995)
srcdata = repmat(srcdata, 50000, 1); state.data = srcdata; state.repeat = true; ... % exe exe = brahms_execution; exe.all = false; exe.name = test; exe.stop = 1000; exe.precision = []; exe.encapsulated = false; exe.execPars.TimeRunPhase = 1; exe.execPars.ShowGUI = 0;

This is a low overhead execution. All the overheads discussed above are greatly reduced or unmeasurable. There is no measurable use of CPU Kernel Time, which indicates that we spent all our time computing rather than servicing system calls. Having turned off all logging (exe.all = false), termination time is also almost unmeasurable. The ~100ms load/unload overhead is still present, but looks less significant in this six second execution. Overall, we got about 7 seconds of User Mode Run Phase time (the good stuff) in an execution of a just over 6.5 seconds. This isn't a great speed-up (the threads are poorly balanced), but the overhead is no longer significant.

Parallel speed-up

Finally, we engineer an example that takes good advantage of the hardware to parallelise the execution for good speed-up. We took brahms_test resamp, again, with the changes made above. Then, we added eight copies of the resampler, and specified 4 threads, to balance the execution rather better on the test hardware (a quad core machine). The results, shown below, are that we got about 77 seconds of User Mode Run Phase time in an execution of about 21 seconds. This is a speed up of around 3.65, which is pretty good for a four-core machine - the appearance of the process called "src" in thread 0 is putting the execution off perfect balance. We give the full code to generate this plot, below the figure.

Perfect balance can always be achieved by not limiting the thread count - the scheduler will, then, simply schedule the processes on whichever CPU is free. The cost of this, however, is more context switching, and a compromise position will usually achieve the best overall performance. However, for typical largish systems, there is usually very little to be gained by attempting to manually balance the execution. For this reason, we do not discuss balancing in the documentation, other than our little play around here to illustrate the operation of brahms_gperf.

M Source Code (against 995)
% RESAMPLE a numeric of arbitrary type f1 = 13; f2 = '513/10'; T1 = 1 / f1; T2 = 1 / str2num(f2); r1 = 0:T1:1-1e-8; r2 = 0:T2:1-1e-8; srcdata = single(sin(r1*pi*2)); % empty system sys = sml_system('Resample Test System'); % add process state = []; srcdata = repmat(srcdata, 50000, 1); state.data = srcdata; state.repeat = true; sys = sys.addprocess('src', 'std/2009/source/numeric', f1, state); % add process (x8) state = []; state.order = 1; for n = 1:8 name = ['resamp' int2str(n)]; sys = sys.addprocess(name, 'std/2009/resample/numeric', f2, state); sys = sys.link('src>out', name); end % exe exe = brahms_execution; exe.all = false; exe.name = test; exe.stop = 1000; exe.precision = []; exe.encapsulated = false; exe.execPars.TimeRunPhase = 1; exe.execPars.ShowGUI = 0; exe.execPars.MaxThreadCount = 4;