Computer Architecture: CD-ROM

John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

Mentioned 7

More on Amazon.com

Mentioned in questions and answers.

I'm curious, why did Sun decide to make the JVM stack-based and Google decide to make the DalvikVM register-based?

I suppose the JVM can't really assume that a certain number of registers are available on the target platform, since it is supposed to be platform independent. Therefor it just postpones the register-allocation etc, to the JIT compiler. (Correct me if I'm wrong.)

So the Android guys thought, "hey, that's inefficient, let's go for a register based vm right away..."? But wait, there are multiple different android devices, what number of registers did the Dalvik target? Are the Dalvik opcodes hardcoded for a certain number of registers?

Do all current Android devices on the market have about the same number of registers? Or, is there a register re-allocation performed during dex-loading? How does all this fit together?

I don't know why Sun decided to make JVM stack based. Erlangs virtual machine, BEAM is register based for performance reasons. And Dalvik also seem to be register based because of performance reasons.

From Pro Android 2:

Dalvik uses registers as primarily units of data storage instead of the stack. Google is hoping to accomplish 30 percent fewer instructions as a result.

And regarding the code size:

The Dalvik VM takes the generated Java class files and combines them into one or more Dalvik Executables (.dex) files. It reuses duplicate information from multiple class files, effectively reducing the space requirement (uncompressed) by half from traditional .jar file. For example, the .dex file of the web browser app in Android is about 200k, whereas the equivalent uncompressed .jar version is about 500k. The .dex file of the alarm clock is about 50k, and roughly twice that size in its .jar version.

And as I remember Computer Architecture: A Quantitative Approach also conclude that a register machine perform better than a stack based machine.

I know that every running process has pages associated with it in virtual memory and few of them will be loaded into main memory as required. I also know that program will have a stack and also a heap to allocate dynamic memory. Here are my questions.

  1. Is stack also part of some page in main memory?
  2. What happens when the program is moved to waiting state? Where are the stack pointer, program counter and other info stored?
  3. Why stack grows down and heap grows up?
  4. Can L1, L2 cache contain only one chunk of contiguous memory, or can it have some part of stack and heap?

Can you recommend any good book that covers these things?

  1. Yes - the stack is typically stored in the "low" addresses of memory and fills upward toward its upper limit. The heap is typically stored at the "top" of the address space and grows toward the stack.

  2. The O/S stores a "context" per running process. The operation of saving and restoring process state is called a "context switch."

  3. Just a convention AFAIK. The stack doesn't really "grow" it's got fixed allocation.

  4. Caches simply contain snapshots of parts of RAM that have been used (either recently or nearby). At any moment in time they can have memory from any part of the address space in them. What shows up where depends heavily on the structural parameters of the cache (block length, associativity, total size, etc.).

I would suggest Computer Architecture: A Quantitative Approach as a good reference on the underlying hardware and any book on Operating Systems for how the hardware is "managed."

Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.

The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.

The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.

It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.

Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.

Any advice or great websites or books? After much Googling, I'm coming up empty.

Sincerely, Wayne

This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:

http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901

And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.

my own question, with answer, on stackoverflow's sister site: http://programmers.stackexchange.com/questions/126986/where-can-i-find-an-overview-of-known-multithreading-design-patterns/126993#126993

I will copy the answer to avoid the need for click-through:

Quote Boris:

Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures https://amazon.com/dp/0735651590/?tag=devbookscom20-20

This is a book, I recommend wholeheartedly.

It is:

New - published last year. Means you are not reading somewhat outdated practices.

Short - about 200+ pages, dense with information. These days there is too much to read and too little time to read 1000+ pages books.

Easy to read - not only it is very well written but it introduces hard to grasps concepts in really simple to read way.

Intended to teach - each chapter gives exercises to do. I know it is always beneficial to do these, but rarely do. This book gives very compelling and interesting tasks. Surprisingly I did most of them and enjoyed doing them.

additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.

PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

I have 2 arrays of 16 elements (chars) that I need to "compare" and see how many elements are equal between the two.

This routine is going to be used millions of times (a usual run is about 60 or 70 million times), so I need it to be as fast as possible. I'm working on C++ (C++Builder 2007, for the record)

Right now, I have a simple:

matches += array1[0] == array2[0];

repeated 16 times (as profiling it appears to be 30% faster than doing it with a for loop)

Is there any other way that could work faster?

Some data about the environment and the data itself:

  • I'm using C++Builder, which doesn't have any speed optimizations to take into account. I will try eventually with another compiler, but right now I'm stuck with this one.
  • The data will be different most of the times. 100% equal data is usually very very rare (maybe less than 1%)

If you have the ability to control the location of the arrays, putting one right after the other in memory for instance, it might cause them to be loaded to the CPU's cache on the first access.

It depends on the CPU and its cache structure and will vary from one machine to another.

You can read about memory hierarchy and cache in Henessy & Patterson's Computer Architecture: A Quantitative Approach

I am trying to study for a test where I have to know something about MIPS and assembly code. Can u help me plz? I will try to write what I think is a correct answer for given question but Im not sure if Im right

1) Can direct operand 32 bit operand in MIPS contain any 32 bit value? I think "no - never" because first 16 bits are reserved for opcode and source + final registers. iS it right or are there some instructions that can contain any 32 bit value?

2) We have times for instruction (IF = 400ps, ID = 500ps, EX = 450ps, MEM = 500ps, WB = 150ps) Whats the clock tact for a) processor without pipeline? b) processor with pipeline?

I think that a) is 2000ps (sum of all times) and b) 500ps (the biggest time in a table) but again, Im not sure.

3) I have the following assembly code:

0x0000      addi t0, $0, 5
0x0004  loop:   beq t0, $0, done
0x0008      nop
0x000C      lw t1, 0x4($0)
0x0010      lw t2, 0x24($0)
0x0014      addi t0, t0, -1
0x0018      j loop
0x001C      nop
0x0020  done

I am not 100% sure what it does (because i dont fully undestand what is the result of 0x4($0) operation in load). I know that there is a for cycle (for t=5, ,t >0 t--). The question is - what is hit rate and miss rate of this cache and how do u calculate it?

If you could answer at least the first two questions, it would be great. Thanks very much...

  1. If you are talking about MIPS 32 bits, then of course, it is not possible for type I instructions to contain a 32 bits immediate. The layout for such a instructions is (opcode, rs, rt, IMM), being their sizes (6, 5, 5, 16) bits. So the immediate value is just 16 bits size. Of course if the architecture is 64 bits, then you could have longer immediate values.

  2. I assume you refer to the latency of instruction execution. As you well point out, if there is no pipeline you need to add the time for all the stages. If the processor uses a pipeline, the clock must be set to match the slowest stage. In your case that is 500ps, both for decoding and memory stages.

  3. lw t1, 0x4($0) loads a word from memory address 0x4 ($0 refers to register 0 which always contains zero) and stores the value into t1. So if you look carefully at the code, you will see that it always loads data from positions 0x4 and 0x24. Assuming the cache is empty at the beginning, then you will have two misses in the first iteration and no other miss during the following ones. Therefore the miss rate will be (1*2) / (5*2) = 2/10 = 1/5. You must take into account, however, whether the cache transfers data in blocks. In that case the first load may transfer a big block containing for instance 64 bytes. That will make that the second load operation would not miss, so the miss rate would be reduced to 1/10. But I do not think this is the case with this simple processor.

FYI, here you have lots of information about MIPS architecture and ISA. You may also be interested in a classic book on computer architecture: Computer Architecture: A Quantitative Approach

I need to write an essay on multi-core processing, are there any good books for recommendation? thanks! :)

Computer Architecture: A Quantitative Approach, 5th Edition by John L. Hennessy & David A. Patterson or Computer Architecture: A Quantitative Approach 4th Edition

covers things at a high level, and might be in a library nearby.

Also Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface by the same authors.

But Wikipedia Multi-core processor would get you started with lots of references.

Multicore Processors – A Necessity looked like a relatively straightforward starting point.

Background

I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense.

Assumptions

  1. The following pseudo-code is executed on a two-processor machine:

    int sharedVar = 0;
    
    myThread()
    {
        print(sharedVar);
    }
    
    main()
    {
        sharedVar = 1;
        spawnThread(myThread);
        sleep(-1);
    }
    
  2. main() executes on processor 1 (P1), while myThread() executes on P2.

  3. Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.)

Question

Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?

With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0.

References

These are the related articles and books I've been reading:

  1. Shared Memory Consistency Models: A Tutorial
  2. Memory Barriers: a Hardware View for Software Hackers
  3. Linux Kernel Memory Barriers
  4. Computer Architecture: A Quantitative Approach

Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?

Theoretically, it can print either 0 or 1, even on x86, since stores can move after loads on almost any architecture.

In practice, it would be hard to make myThread() print 0.
Spawning a thread will most likely function as an implicit store/release memory barrier, since it would probably:
- have at least one instruction along the execution path that results in a memory barrier - interlocked instructions, explicit memory barrier instructions, etc.,
- or the store would simply be retired/drained from the store buffer by the time myThread() is called, since setting up a new thread results in executing many instructions - among them, many stores.