## Zen of Code Optimization

Michael Abrash

Mentioned 3

Michael Abrash explores the inner workings of all Intel-based PCs including the hot new Pentium. This is the only book available that provides practical and innovative "right-brain" approaches to writing fast PC software using C/C++ and assembly language. This book is packed with "from the trenches" programming secrets and features "undocumented" Pentium programming tips. Provides hundreds of optimized coding examples.

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

• we can't assume libraries are available (and the call itself has its cost), only plain C
• to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
• room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening` code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

• the example is simple and easy to understand and measure, not much C skill involved
• it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

``````#include <stdio.h>

static __inline__ int sort6(int * d){

char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}

static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};

unsigned long long cycles = rdtsc();
for (i = 0; i < 6 ; i++){
sort6(d[i]);
/*
* printf("d%d : %d %d %d %d %d %d\n", i,
*  d[i][0], d[i][6], d[i][7],
*  d[i][8], d[i][9], d[i][10]);
*/
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
}
``````

## Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

• Direct call to qsort library function : 689.38
• Naive implementation (insertion sort) : 285.70
• Insertion Sort (Daniel Stutzbach) : 142.12
• Insertion Sort Unrolled : 125.47
• Rank Order : 102.26
• Rank Order with registers : 58.03
• Sorting Networks (Daniel Stutzbach) : 111.68
• Sorting Networks (Paul R) : 66.36
• Sorting Networks 12 with Fast Swap : 58.86
• Sorting Networks 12 reordered Swap : 53.74
• Sorting Networks 12 reordered Simple Swap : 31.54
• Reordered Sorting Network w/ fast swap : 31.54
• Reordered Sorting Network w/ fast swap V2 : 33.63
• Inlined Bubble Sort (Paolo Bonzini) : 48.85
• Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

• Direct call to qsort library function : 705.93
• Naive implementation (insertion sort) : 135.60
• Insertion Sort (Daniel Stutzbach) : 142.11
• Insertion Sort Unrolled : 126.75
• Rank Order : 46.42
• Rank Order with registers : 43.58
• Sorting Networks (Daniel Stutzbach) : 115.57
• Sorting Networks (Paul R) : 64.44
• Sorting Networks 12 with Fast Swap : 61.98
• Sorting Networks 12 reordered Swap : 54.67
• Sorting Networks 12 reordered Simple Swap : 31.54
• Reordered Sorting Network w/ fast swap : 31.24
• Reordered Sorting Network w/ fast swap V2 : 33.07
• Inlined Bubble Sort (Paolo Bonzini) : 45.79
• Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do `SWAP(1, 2); SWAP(0, 2);` you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do `SWAP(1, 2); SWAP(4, 5);`the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

``````static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x)
#define SWAP(x,y) { const int a = min(d[x], d[y]);
const int b = max(d[x], d[y]);
d[x] = a; d[y] = b; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
``````

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

``````static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
``````

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

``````static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}
``````

I'm looking for a fast and memory efficient approach for implementing Conway's Game of Life.

Constraints: a 96x128 board, approximately 2kB RAM available and 52MHz processor (see the tech specs here: http://www.getinpulse.com/features).

My current naive solution that represents each cell as a single bit in a matrix (96*128/8=1,536 bytes) works but is too slow. What tricks can be used to improve performance?

Storing the coordinates of live cells (for example in this implementation http://dotat.at/prog/life/life.html) would use too much memory.

Look at the chapter on the Game of Life in "The Zen of Code Optimization" by Michael Abrash. There was an implementation that coded the current and previous states of 3 cells into a single word, and used tricks with lookup tables and carry bits generate the next states. It was blazingly fast.

Does asm language get a steep learning curve for someone who dont have a CS background (my major is applied maths)?

Usually how long it takes somebody who is familiar with some higher-level programming language like C/C++ etc. to write asm programs that can at least do some elemental math computation like linear regression, 1D orthogonal polynomial computation and such?

Btw, does modern asm support some features like:

• Math ops with SIMD instructions

And usually is there much performance-gain by doing this in asm instead of C/C++ with some latest optimized compiler?

And usually is there much performance-gain by doing this in asm instead of C/C++ with some latest optimized compiler?

No.

It's more difficult and more time-consuming to write asm: because each asm instruction is simpler than a HLL statement, you need more asm instructions to implement the same functionality. Writing in assembly has additional complications, e.g. that there's only a small number of registers.

Additionally, there are many ways to write assembly to achieve the same thing: for example, `mov eax,0` and `xor eax,eax` with both write zero into the `eax` register, but one is faster than the other. Furthermore, the sequence of asm instructions makes a difference to how fast it runs: a non-obvious or a longer sequence of instructions may be faster if it matches to keep several CPU pipelines busy.

Writing optimal assembly is compute-intensive: working out how long various possibilities might take, based on cycle-counting, accounting for pipelining, etc. Such computation is done by an optimizing compiler, which can probably do it better (faster and more accurately) than you can.

http://www.amazon.com/Zen-Code-Optimization-Ultimate-Software/dp/1883577039 is a very interesting read, but by the end of the book, where it's starting to talk about Pentium-class CPUs, you can see that writing optimal assembly is getting complicated.

If you're not happy with your compiler's output, you might consider getting a better compiler: for example Intel's compiler produces (or at least used to produce) more-optimized than many other compilers, and included better support for specific CPUs, unusual opcodes, maybe had better run-time libraries, etc.