Inner Loops

Rick Booth

Mentioned 5

With practical advice and a broad sampling of important algorithms, Inner Loops shows how to design programs that extend the edge of the envelope of PC performance. It provides a thorough review of 32-bit code optimization for the 486, Pentium and Pentium Pro, as well as design tips for writing fast 32-bit software.

More on

Mentioned in questions and answers.

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go. I used ARM assembler about 20 years ago and am surprised by the difference in cost models between these architectures.

Specifically, memory accesses and branches are expensive on ARM but the equivalent stack operations and jumps are cheap on x86. I believe modern x86 CPUs do far more dynamic optimizations than ARM cores do and I find it difficult to anticipate their effects.

What is a good cost model to bear in mind when writing x86 assembler? Which combinations of instructions are cheap and which are expensive?

For example, my compiler would be simpler if it always generated the long form for loading integers or jumping to offsets even if the integers were small or the offsets close but would this impact performance?

I haven't done any floating point yet but I'd like to get on to it soon. Is there anything not obvious about the interaction between normal and float code?

I know there are lots of references (e.g. Michael Abrash) on x86 optimization but I have a hunch than anything more than a few years old will not apply to modern x86 CPUs because they have changed so much lately. Am I correct?

For what it's worth, there used to be an amazing book called "Inner Loops" by Rick Booth that described in great detail how to manually micro-optimize IA-86 assembly code for Intel's 80486, Pentium, Pentium Pro, and Pentium MMX processors, with lots of useful real-world code examples (hashing, moving memory, random number generation, Huffman and JPEG compression, matrix multiplication).

Unfortunately, the book hasn't been updated ever since its first publication in 1997 for newer processors and CPU architectures. Nevertheless, I would still recommend it as a gentle introduction to topics such as:

  • which instructions are generally very cheap, or cheap, and which aren't
  • which registers are the most versatile (i.e. have no special meaning / aren't the default register of some instructions)
  • how to pair instructions so that they are executed in parallel without stalling one pipeline
  • different kinds of stalls
  • branch prediction
  • what to keep in mind with regard to processor caches

I would like to learn more about low level code optimization, and how to take advantage of the underlying machine architecture. I am looking for good pointers on where to read about this topic.

More details:

I am interested in optimization in the context of scientific computing (which is a lot of number crunching but not only) in low level languages such as C/C++. I am in particular interested in optimization methods that are not obvious unless one has a good understanding of how the machine works (which I don't---yet).

For example, it's clear that a better algorithm is faster, without knowing anything about the machine it's run on. It's not at all obvious that it matters if one loops through the columns or the rows of a matrix first. (It's better to loop through the matrix so that elements that are stored at adjacent locations are read successively.)

Basic advice on the topic or pointers to articles are most welcome.


Got answers with lots of great pointers, a lot more than I'll ever have time to read. Here's a list of all of them:

I'll need a bit of skim time to decide which one to use (not having time for all).

An interesting book about bit manipulation and smart ways of doing low-level things is Hacker's Delight.

This is definitely worth a read for everyone interested in low-level coding.

Drepper's What Every Programmer Should Know About Memory [pdf] is a good reference to one aspect of low-level optimisation.

I learned a lot from the book Inner Loops. It's ancient now, in computer terms, but it's very well written and Rick Booth is so enthusiastic about his subject I would still say it's worth looking at to see the kind of mindset you need to make a CPU fly.

It's been a few years since I read it, but Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level by Randall Hyde was quite good. It gives good examples of how C/C++ code translates into assembly, e.g. what really happens when you have a big switch statement.

Also, is focused on game development, but the programming articles might give you some ideas.

How can programming in assembly help in achieving optimization

There used to be a very good book about this subject, called Inner Loops by Rick Booth. I still think it's a very valuable book, in that it can raise your awareness of what to look out for when optimizing assembler code (categorization of instructions as very fast, fast, slow; when two instructions can execute in parallel; memory alignment, cache misses, stalls, penalties, etc.) However, the book only covered Intel processors up to the Pentium Pro / Pentium MMX, and with the newer hardware architectures that are available today, the book is now fairly out-of-date.

This is exactly the problem of optimizing assembly language: You need to know very well the architecture which you're targeting; contrast this with an optimizing compiler (e.g. for the C language) that can target different platforms and will apply optimizations accordingly. Much knowledge and work has gone into the optimization stage of compilers, and you will have to study a particular architecture quite a bit before you can beat a good compiler.

I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?

I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).

I've heard that it should be possible to do a lossless rotation on a jpeg image. That means you do the rotation in the frequency domain without an IDCT. I've tried to google it but haven't found anything. Could someone bring some light to this?

What I mean by lossless is that I don't lose any additional information in the rotation. And of course that's probably only possible when rotating multiples of 90 degrees.

Disclaimer: :-)

Admittedly I know the JPEG compression algorithm only at a very superficial level. What I know comes from Rick Booth's slightly dated, but excellent book Inner Loops, chapter 17: JPEG.

I do not have a complete answer to your question, rather I have a vague idea of what the solution might be. Maybe that will already be helpful to you. To be honest, I would actually be somewhat surprised to see that I have it correct.

Lossless rotation of a JPEG image seems only possible if you wouldn't have to decode it first using a IDCT, and then re-encode it again once you've rotated the image, since that's two computational steps where information loss may happen.

This seems feasible at all because an image encoded as JPEG is already in the frequency domain, as a DCT (Discrete Cosine Transform) has already been performed on it. Let me cite one short passage from the above book (p. 325):

Usually referred to as the DCT […]. Conceptually, what happens is that the 8 × 8 piece of the image gets multiplied by two other 8 × 8 matrices to produce a derivative 8 × 8 matrix. […]

Ordinarily, two 8 × 8 matrix multiplications would require 1,204 (64 × 8 × 2) multiplication steps. Part of the magic of the DCT is that the very special matrices chosen for this step have a lot of internal symmetries, so there is a way to execute the math with only 80 multiplication steps. It is this symmetry that saves the day for JPEG and keeps the algorithm fairly fast. — (emphasis added by me.)

I could imagine that the symmetries in the DCT transformation matrices make it possible to later rotate the transformed 8 × 8 matrices at some very specific angles without perceptually altering the image (apart from the fact that it's rotated, of course). At a conceptual level, what I mean is this: Let's say you had transformed the original 8 × 8 pixel blocks using matrices such as the following:

                * . . . . . . *
                . * . . . . * .
                . . * . . * . .
                . . . * * . . .
                . . . * * . . .
                . . * . . * . .
                . * . . . . * .
                * . . . . . . *

(I'm using symbols instead of actual numeric values since I only want to show the symmetry of this matrix.)

Such a transformation matrix might allow you to rotate the transformed matrices by a multiple of 90 degrees in any direction, since the transformation matrix itself would always look identical if transformed at the same angles.

If indeed this is what you read about, it would mean that lossless rotation won't work for arbitrary rotation angles. The angles which guarantee no loss would depend on the matrices used during the JPEG encoding.