Computer Systems

Randal E. Bryant, David R. O'Hallaron

Mentioned 9

For Computer Systems, Computer Organization and Architecture courses in CS, EE, and ECE departments. Few students studying computer science or computer engineering will ever have the opportunity to build a computer system. On the other hand, most students will be required to use and program computers on a near daily basis. Computer Systems: A Programmer’s Perspective introduces the important and enduring concepts that underlie computer systems by showing how these ideas affect the correctness, performance, and utility of application programs. The text's hands-on approach (including a comprehensive set of labs) helps students understand the “under-the-hood” operation of a modern computer system and prepares them for future courses in systems topics such as compilers, computer architecture, operating systems, and networking. Visit the CSS:AP web page for more information and resources.

More on

Mentioned in questions and answers.

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
            if (data[c] >= 128)
                sum += data[c];

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.

import java.util.Arrays;
import java.util.Random;

public class Main
    public static void main(String[] args)
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
                if (data[c] >= 128)
                    sum += data[c];

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);

With a somewhat similar but less extreme result.

My first thought was that sorting brings the data into the cache, but then I thought how silly that is because the array was just generated.

  • What is going on?
  • Why is a sorted array faster to process than an unsorted array?
  • The code is summing up some independent terms, and the order should not matter.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

Now, if we look at the code

if (data[c] >= 128)
    sum += data[c];

we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.

In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:

sum += data[c] >=128 ? data[c] : 0;

While maintaining readability, we can check the speedup factor.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):


//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71


//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.

Now let's look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.

max1 uses the conditional branch if... else ...:

int max1(int a, int b) {
    if (a > b)
        return a;
        return b;

max2 uses the ternary operator ... ? ... : ...:

int max2(int a, int b) {
    return a > b ? a : b;

On a x86-64 machine, GCC -S generates the assembly below.

    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    -8(%rbp), %eax
    jle     .L2
    movl    -4(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp     .L4
    movl    -8(%rbp), %eax
    movl    %eax, -12(%rbp)
    movl    -12(%rbp), %eax

    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    %eax, -8(%rbp)
    cmovge  -8(%rbp), %eax

max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2 does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.

So why does a conditional move perform better?

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

I've tried to find an answer to this using SO. There are a number of questions that list the various pros and cons of building a header-only library in c++, but I haven't been able to find one that does so in quantifiable terms.

So, in quantifiable terms, what's different between using traditionally separated c++ header and implementation files versus header only?

For simplicity, I'm assuming that templates are not used (because they require header only).

To elaborate, I've listed what I have seen from the articles to be the pros and cons. Obviously, some are not easily quantifiable (such as ease of use), and are therefore useless for quantifiable comparison. I'll mark those that I expect quantifiable metrics with a (quantifiable).

Pros for header-only

  1. It's easier to include, since you don't need to specify linker options in your build system.
  2. You always compile all the library code with the same compiler (options) as the rest of your code, since the library's functions get inlined in your code.
  3. It may be a lot faster. (quantifiable)
  4. May give compiler/linker better opportunities for optimization (explanation/quantifiable, if possible)
  5. Is required if you use templates anyways.

Cons for header-only

  1. It bloats the code. (quantifiable) (how does that affect both execution time and the memory footprint)
  2. Longer compile times. (quantifiable)
  3. Loss of separation of interface and implementation.
  4. Sometimes leads to hard-to-resolve circular dependencies.
  5. Prevents binary compatibility of shared libraries/DLLs.
  6. It may aggravate co-workers who prefer the traditional ways of using C++.

Any examples that you can use from larger, open source projects (comparing similarly-sized codebases) would be very much appreciated. Or, if you know of a project that can switch between header-only and separated versions (using a third file that includes both), that would be ideal. Anecdotal numbers are useful too because they give me a ballpark with which I can gain some insight.

sources for pros and cons:

Thanks in advance...


For anyone that may be reading this later and is interested in getting a bit of background information on linking and compiling, I found these resources useful:

UPDATE: (in response to the comments below)

Just because answers may vary, doesn't mean that measurement is useless. You have to start measuring as some point. And the more measurements you have, the clearer the picture is. What I'm asking for in this question is not the whole story, but a glimpse of the picture. Sure, anyone can use numbers to skew an argument if they wanted to unethically promote their bias. However, if someone is curious about the differences between two options and publishes those results, I think that information is useful.

Has no one been curious about this topic, enough to measure it?

I love the shootout project. We could start by removing most of those variables. Only use one version of gcc on one version of linux. Only use the same hardware for all benchmarks. Do not compile with multiple threads.

Then, we can measure:

  • executable size
  • runtime
  • memory footprint
  • compile time (for both entire project and by changing one file)
  • link time

Summary (notable points):

  • Two packages benchmarked (one with 78 compilation units, one with 301 compilation units)
  • Traditional Compiling (Multi Unit Compilation) resulted in a 7% faster application (in the 78 unit package); no change in application runtime in the 301 unit package.
  • Both Traditional Compiling and Header-only benchmarks used the same amount of memory when running (in both packages).
  • Header-only Compiling (Single Unit Compilation) resulted in an executable size that was 10% smaller in the 301 unit package (only 1% smaller in the 78 unit package).
  • Traditional Compiling used about a third of the memory to build over both packages.
  • Traditional Compiling took three times as long to compile (on the first compilation) and took only 4% of the time on recompile (as header-only has to recompile the all sources).
  • Traditional Compiling took longer to link on both the first compilation and subsequent compilations.

Box2D benchmark, data:


Botan benchmark, data:


Box2D SUMMARY (78 Units)

enter image description here

Botan SUMMARY (301 Units)

enter image description here


Box2D executable size:

Box2D executable size

Box2D compile/link/build/run time:

Box2D compile/link/build/run time

Box2D compile/link/build/run max memory usage:

Box2D compile/link/build/run max memory usage

Botan executable size:

Botan executable size

Botan compile/link/build/run time:

Botan compile/link/build/run time

Botan compile/link/build/run max memory usage:

Botan compile/link/build/run max memory usage

Benchmark Details


The projects tested, Box2D and Botan were chosen because they are potentially computationally expensive, contain a good number of units, and actually had few or no errors compiling as a single unit. Many other projects were attempted but were consuming too much time to "fix" into compiling as one unit. The memory footprint is measured by polling the memory footprint at regular intervals and using the maximum, and thus might not be fully accurate.

Also, this benchmark does not do automatic header dependency generation (to detect header changes). In a project using a different build system, this may add time to all benchmarks.

There are 3 compilers in the benchmark, each with 5 configurations.


  • gcc
  • icc
  • clang

Compiler configurations:

  • Default - default compiler options
  • Optimized native - -O3 -march=native
  • Size optimized - -Os
  • LTO/IPO native - -O3 -flto -march=native with clang and gcc, -O3 -ipo -march=native with icpc/icc
  • Zero optimization - -Os

I think these each can have different bearings on the comparisons between single-unit and multi-unit builds. I included LTO/IPO so we might see how the "proper" way to achieve single-unit-effectiveness compares.

Explanation of csv fields:

  • Test Name - name of the benchmark. Examples: Botan, Box2D.
  • Test Configuration - name a particular configuration of this test (special cxx flags etc.). Usually the same as Test Name.
  • Compiler - name of the compiler used. Examples: gcc,icc,clang.
  • Compiler Configuration - name of a configuration of compiler options used. Example: gcc opt native
  • Compiler Version String - first line of output of compiler version from the compiler itself. Example: g++ --version produces g++ (GCC) 4.6.1 on my system.
  • Header only - a value of True if this test case was built as a single unit, False if it was built as a multi-unit project.
  • Units - number of units in the test case, even if it is built as a single unit.
  • Compile Time,Link Time,Build Time,Run Time - as it sounds.
  • Re-compile Time AVG,Re-compile Time MAX,Re-link Time AVG,Re-link Time MAX,Re-build Time AVG,Re-build Time MAX - the times across rebuilding the project after touching a single file. Each unit is touched, and for each, the project is rebuilt. The maximum times, and average times are recorded in these fields.
  • Compile Memory,Link Memory,Build Memory,Run Memory,Executable Size - as they sound.

To reproduce the benchmarks:

  • The bullwork is
  • Requires psutil (for memory footprint measurements).
  • Requires GNUMake.
  • As it is, requires gcc, clang, icc/icpc in the path. Can be modified to remove any of these of course.
  • Each benchmark should have a data-file that lists the units of that benchmarks. will then create two test cases, one with each unit compiled separately, and one with each unit compiled together. Example: The file format is defined as a json string, containing a dictionary with the following keys
    • "units" - a list of c/cpp/cc files that make up the units of this project
    • "executable" - A name of the executable to be compiled.
    • "link_libs" - A space separated list of installed libraries to link to.
    • "include_directores" - A list of directories to include in the project.
    • "command" - optional. special command to execute to run the benchmark. For example, "command": "botan_test --benchmark"
  • Not all C++ projects can this be easily done with; there must be no conflicts/ambiguities in the single unit.
  • To add a project to the test cases, modify the list test_base_cases in with the information for the project, including the data file name.
  • If everything runs well, the output file data.csv should contain the benchmark results.

To produce the bar charts:

  • You should start with a data.csv file produced by the benchmark.
  • Get Requires matplotlib.
  • Adjust the fields list to decide which graphs to produce.
  • Run python data.csv.
  • A file, test.png should now contain the result.


  • Box2D was used from svn as is, revision 251.
  • The benchmark was taken from here, modified here and might not be representative of a good Box2D benchmark, and it might not use enough of Box2D to do this compiler benchmark justice.
  • The file was manually written, by finding all the .cpp units.


  • Using Botan-1.10.3.
  • Data file:
  • First ran ./ --disable-asm --with-openssl --enable-modules=asn1,benchmark,block,cms,engine,entropy,filters,hash,kdf,mac,bigint,ec_gfp,mp_generic,numbertheory,mutex,rng,ssl,stream,cvc, this generates the header files and Makefile.
  • I disabled assembly, because assembly might intefere with optimizations that can occure when the function boundaries do not block optimization. However, this is conjecture and might be totally wrong.
  • Then ran commands like grep -o "\./src.*cpp" Makefile and grep -o "\./checks.*" Makefile to obtain the .cpp units and put them into file.
  • Modified /checks/checks.cpp to not call the x509 unit tests, and removed x509 check, because of conflict between Botan typedef and openssl.
  • The benchmark included in the Botan source was used.

System specs:

  • OpenSuse 11.4, 32-bit
  • 4GB RAM
  • Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz

Going trough chapter 3 of this book called Computer Systems Architecture: A programmer's perspective, it is stated that an implementation like

testl %eax, %eax
cmovne (%eax), %edx

is invalid because if the prediction fails, then we'll have NULL dereferencing. It is also stated that we should use branching code.

Still, wouldn't using conditional jumps lead to the same result? For example:

jmp *%eax

testl %eax, %eax
jne .L1

Is it possible to trick gcc to output something like that for an x86-32? Suppose I have an array of pointers to functions of which some are valid and some aren't and I call each one that's not NULL.

I just have an interesting idea. I was using objdump to dump a simple binary and I see many functions in the binary. Is it possible to create another C program that link with these functions? Assuming I know the parameters for input and output.

Some more information: file1:test.c

#include <stdio.h>

int add(int x,int y)
    return x+y;

int main(int argc, const char *argv[])
    return 0;

file2: test1.c

#include <stdio.h>

int main(int argc, const char *argv[]) 
    return 0; 

gcc test.c -o test.exe
gcc test1.c test.exe -o test1.exe


ld: in test.exe, can't link with a main executable
collect2: ld returned 1 exit status

I'm afraid not.

The compiled binary file has been processed through relocation phase by the linker, which associate each symbol reference in the code with a run-time address.

You can do a simple experiment to find out the differences, here is a program which output 'Hello World':

// main.c
#include <stdio.h>

int main()
    printf("Hello World!");
    return 0;

Using gcc -c you can compile the source code into a relocatable object:

$ gcc -c main.o

$ readelf -s main.o

Symbol table '.symtab' contains 10 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 00000000     0 FILE    LOCAL  DEFAULT  ABS main.c
     2: 00000000     0 SECTION LOCAL  DEFAULT    1 
     3: 00000000     0 SECTION LOCAL  DEFAULT    3 
     4: 00000000     0 SECTION LOCAL  DEFAULT    4 
     5: 00000000     0 SECTION LOCAL  DEFAULT    5 
     6: 00000000     0 SECTION LOCAL  DEFAULT    7 
     7: 00000000     0 SECTION LOCAL  DEFAULT    6 
     8: 00000000    29 FUNC    GLOBAL DEFAULT    1 main
     9: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND printf

You can see from here the value of function main is 0x0, which means it has not yet been relocated and can be linked with others.

But when you compile the file with gcc command, to generated an executable one:

$ gcc main.c
$ readelf -s a.out | grep main
     2: 00000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.0 (2)
    39: 00000000     0 FILE    LOCAL  DEFAULT  ABS main.c
    51: 00000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@@GLIBC_
    62: 080483c4    29 FUNC    GLOBAL DEFAULT   13 main

Now you can see the address of function main has been relocated to 0x80483c4, which is the runtime address of the function code. The generate a.out can no longer get linked with others, since there may be runtime address violation to do so.

Generally speaking the relocation phase cannot be reverted, because some symbol information get lost after the phase.

For more details I suggest you to read the Linking chapter in the book Computer System: A Programmer's Prospective, which covers a lot in linking and relocation.

I am currently reading Introduction to computer systems : a programmer's perspective ( and trying to understand ways to thwart buffer overflows.

I understand why we need NOP sleds when address randomization is used and how to write exploits but i have trouble in understanding the address calculation related to NOP sleds which is given in the book. I will quote it here:-

(Assume the starting addresses of a program on the stack have a range of 2^23 on a 32 bit system and 2^32 on a 64 bit system)

"If we set up a 256 byte NOP sled, then the randomization over n=2^23 can be cracked by enumerating 2^15 starting addresses which is entirely feasible for a determined attacker. For the 64 bit case, trying to enumerate 2^24 addresses is more daunting."

How did the authors come up with the figures 2^15 and 2^24 for the 32 and 64 bit case respectively? An explanation would be really helpful.

They are just assuming a max total shared memory (kbytes) on 32bit system which would be 8388608 which is where they came up with the 2^23

and the 2^15 is calculated by the following:

(8388608 / 256) = 32768 == 2^15

So in other words:

total_memory_size / NOP_sled_length = total_iterations_to_find_NOP_sled_in_memory

They calculated that based on the assumption that our NOP sled could be anywhere within a range starting from 0x0 all the way to 0x800000 (which is 8388608 or 2^23). Because our NOP sled is 256 bytes long we don't need to increment by one for each guess/iteration/brute force but instead we count increasing by 256 each time giving us the results from the equation above 0x800000 / 256 = 32768 == 2^15. So we only need to brute force 32768 possible addresses because one of those addresses will land us at the start of our NOP sled and slide all the way down to our payload.

If our NOP sled was 500 bytes (assuming our exploit allowed us to fit a NOP sled that large) the equation would be:

0x8000000 / 500 = 268435 iterations to find the starting address of our NOP sled.

The reason why this approach is not so great for 64bit systems is because of the following equation:

2^32 / 256 = 16,777,216 (over 16 million possible addresses that our NOP sled could start at! And even if your NOP sled was 500 bytes and you divided by 500 you will still have over 8.5 million addresses where your NOP sled could start!)


If you imagine the above is my stack I have a total memory size of 40. I have a NOP sled (the N's) of 12 bytes and a payload (the P's) of 12 bytes. So my equation for this exploitable scenario would be:

40 / 12 = 3 giving me 3 possible addresses that my NOP sled could be found at (12, 24, 32 or in hex 0x0c, 0x18 and 0x20) in as few tries as possible.

So if my exploit just starts at the beginning of the stack and counts in increments of 12 then it will find my NOP sled on the first try (landing 4 bytes into my NOP sled).

Update based on comments:

The idea behind a NOP sled in terms of a bypass technique for address randomization is not to guess the start of a NOP sled - it's to calculate the least amount of addresses that will guarantee that you will land inside your NOP sled with as few address guesses/brute forces as possible. Sure, if you want to find the beginning of your NOP sled you could use the following equation:

total_mem_size / NOP_size = least_amount_of_guesses_to_land_inside_payload + 1

But keep in mind that by adding the extra address to try you are no longer calculating the least amount of addresses to guess before you get payload execution (which is what myself and the book you are reading is calculating because that is the idea behind using a NOP sled).

If we revisit my small "stack" above it is true that there could be 4 total addresses at which the NOP sled could start but the equation calculates the 3 addresses that are guaranteed to find the NOP sled (in as few guesses as possible being the key). To make it more clear you could say that the exploit developer will try to make this number as small as possible by increasing the size of the NOP sled ( where possible ) so they won't be worrying about finding the start of the NOP sled - they just want to land inside the NOP sled.

The guess of index 12 will land you 4 bytes into your NOP sled giving you only 8 NOPs to execute before reaching the payload. The guess of index 24 will land you a few bytes into the payload causing a crash and the guess of index 32 will land you past the payload also causing a crash.

Let's use your approach (of using the total 4 addresses) to demonstrate why the extra address is not taking into consideration in the equation:


Let's add 1 to our equation giving us 4 possible addresses with the same stack layout as before:

40 / 12 = 3 + 1 = 4

So now we have 4 addresses to brute force to land in our NOP sled 0, 12, 24, 32. Because we have a NOP sled of 12 bytes there is still only 1 address (the address at index 12 whose address the original equation found) that will land in our NOP sled letting us start shellcode execution at the beginning of the shellcode. Index 0 lands us on the stack at data we don't control before our NOP sled. So by adding 1 more address to the mix doesn't help us find the NOP sled - it just increases the amount of addresses to try/brute/guess.

And yes you are correct - I am just incrementing addresses that are more intuitive for the example to make more sense but "counting" or "incrementing" address on the stack during your actual execution will start at the high addresses.

I want to cover the equivalent of a typical CS undergrad course in material, so I'm making a list of books to cover the typical topics. I've split the list into topics that, from the research I did, I think are compulsory and optional. I would like some help to confirm if the topics are split correctly, and if the books are of the correct level. Also, please let me know if I left out any important topics, or if any are beyond undergrad level.

Thank you for your time!

Edit regarding the on hold status: I do not believe this question is off-topic as I am not asking for a recommendation for books - I am asking if the topics I have listed are indicative of a typical CS course, and if any important topics are missing. The books links are only there in case the books I have chosen are not correct for the topic, and can be removed if necessary.


Operating Systems: Operating System Concepts

Networks: Computer Networking: A Top-Down Approach

Discrete Mathematics: Concrete Mathematics

Data Structures and Algorithms: Introduction to Algorithms

Computer Architecture: Computer Systems: A Programmer's Perspective

Automata Theory: Introduction to the Theory of Computation

Compilers: Engineering a Compiler was recommended to me over the dragon book.

Database Theory: An Introduction to Database Systems

Programming Language Concepts and Design: Programming Language Pragmatics


Cryptography: Cryptography Engineering: Design Principles and Practical Applications

Functional Programming: Learn You a Haskell for Great Good!

Artificial Intelligence: Artificial Intelligence: A Modern Approach

Computer Graphics: Real-Time Rendering

Your list is very good on the subjects directly related to computer science. However, it is light on math. In my own B.Sc. in Computer Science I also had a ton of calculus, linear algebra, algebra (groups, rings, etc), statistics, analytic geometry and numerical analysis. Some applications of computer science rely heavily on those:

  • Machine learning depends on lots of linear algebra, calculus, and statistics;
  • Computer graphics depends a lot on analytic geometry and linear algebra;
  • Scientific computation relies on calculus and numerical analysis.

I never used much of the ton of Algebra I had, but I hear it is important for cryptography. :-)

For a programmer developing more regular applications your list is very good, but for those interested in these more specialized areas (which are still pretty important), these subjects are vital.

I'm reading a book Computer Systems: A Programmer's Perspective (2nd Edition) and Practice Problem 3.23 are little confused me:

A function fun_b has the following overall structure:

int fun_b(unsigned x) {
   int val = 0;
   int i;
   for ( ____;_____;_____) {
   return val;

The gcc C compiler generates the following assembly code:

x at %ebp+8
1 movl 8(%ebp), %ebx
2 movl $0, %eax
3 movl $0, %ecx
5 leal (%eax,%eax), %edx
6 movl %ebx, %eax
7 andl $1, %eax
8 orl  %edx, %eax
9 shrl %ebx   Shift right by 1
10 addl $1, %ecx
11 cmpl $32, %ecx
12 jne .L13

Reverse engineer the operation of this code and then do the following:

A. Use the assembly-code version to fill in the missing parts of the C code.

My solution.

int fun_b(unsigned x) {
   int val = 0;
   int i;
   for ( i = 0 ;i < 32;i++) {
      val  += val; //because leal (%eax,%eax), edx  --> %edx = %eax + %eax
      val = val | x & 0x1;
      x >>= 1;
   return val;

Book's solution.

int fun_b(unsigned x) {
  int val = 0;
  int i;
  for (i = 0; i < 32; i++) {
    val = (val << 1) | (x & 0x1);
    x >>= 1;
 return val;

Please, explain to me why leal function has non typical behavior in this function. And I dont understand how this assembly code is yielding this statement val = (val << 1) | (x & 0x1)

In your code:

val  += val;
val = val | x & 0x1;

Here, val += val which is equivalent to (val*2) which is effectively equal to val left shifted by 1.

But I think your solution is correct only if the assembly code was something like:

x at %ebp+8
1 movl 8(%ebp), %ebx
2 movl $0, %eax
3 movl $0, %ecx
5 addl  %eax, %eax
6 movl %ebx, %edx
7 andl $1, %edx
8 orl  %edx, %eax
9 shrl %ebx   # shift right by 1
10 addl $1, %ecx
11 cmpl $32, %ecx
12 jne .L13

Because if val + val was a separate statement, compiler usually places it in eax register rather than in edx (i'm not sure this is the case always). So, for the code you have given, the possible solutions are:

val = (val << 1) | (x & 0x1);


val = (val + val) | (x & 0x1);


val = (val * 2) | (x & 0x1);

I encountered this ascii's style ascii table.
Of course I can store it in a file ascii and use cat ascii to display it content. But I want to make it behavior more like a command.

When I read cs:app I find that how I bother to restore it in a file and using other commands. Just run man ascii

If your shell supports aliases, you can do:

alias ascii='cat ~/ascii'

Then just type ascii et voila!

If you're using bash, put the above line in your .bashrc to persist it across logins. Other shells have similar features.

I am attempting to learn assembly language, but i need help with learning what each command's purpose is. Following is a program where they are used.

 push   %ebp
 mov    %esp,%ebp
 sub    $0x10,%esp
 mov    0x8(%ebp),%eax
 add    $0x1,%eax
 mov    %eax,-0x4(%ebp)
 mov    -0x4(%ebp),%eax

this is a pretty good book for your purpose:

Computer Systems: A Programmer's Perspective