ARM System Developer's Guide

Andrew N. Sloss, Dominic Symes, Chris Wright

Mentioned 8

This book provides a comprehensive description of the operation of the ARM core from a developer's perspective with a clear emphasis on software. It demonstrates not only how to write efficient ARM software in C and assembly but also how to optimize code. Example code throughout the book can be integrated into commercial products or used as templates to enable quick creation of productive software.

More on

Mentioned in questions and answers.

I'm familiar with X86[-64] architecture & assembly. I want to start develop for an ARM processor. But unlike desktop processors, I don't have an actual ARM processor. I think I need an ARM simulator. say

An ARM assembly compiler will be required, the most accessible is the ARMulator.

I thought of downloading Armulator but found from that

Its not sold seperately. But you can download an eval of RVDS - which includes RVISS/ARMulator

I've downloaded & installed RVDS but It looks very complex. I'm unable to figure out what do I need to do to write ARM assembly & run it.

Do you have any better suggestions?

Options for environments

  • Install Linux in the QEMU system emulator. It can emulate a variety of ARM-based chipsets.
  • Get an emulator for a specific ARM-chipset like a game handheld. Gameboy Advance is fun to play with. NoCash GBA and VisualBoy Advance are two great GBA emulators.


You will need a toolchain. A toolchain is a collection of low-level tools like an assembler, a linker, a compiler, an archiver and a bunch of other usefull stuff. Even more, you want a cross-toolchain, which means that the toolchain runs on one system, but builds executables for another architecture. This way you can build applications that run on ARM-devices, but on your x86-based PC. It's faster and more convenient.

If you run Windows, DevkitPro is a fairly good choice. For Unix/Linux/BSD variants, you have CodeSourcery's free toolchains, and the GCC toolchain from There are several others, but you don't need more options.


Get the specification for your ARM CPU of choice at One reference you need no matter the CPU is the ARM Architecture Reference Manual (Often abbreviated ARMARM). I'm hosting an older version, which covers the ARM architecture and instruction set version up to ARMv4T, here, but you will find the current and later versions on as well. If you go for GBA, notice that the CPU is an ARM7TDMI, with the instruction set version ARMv4T.

The ARMARM contains tips and examples for usual nitty-gritty system coding, tips on how to proceed on certain design issues, as well as a reference of both the ARM instruction set, Thumb instruction set and co-processors like MMUs, MPUs, DSPs and FPUs.

If you stick with QEMU, that's pretty much all you need, since the Linux kernel handles everything. QEMU also has user-mode emulation (with a C-library stub). If you go for one of the GBA emulators, here's a nice reference over the GBA hardware and hardware registers: CowBiteSpec. Also make sure to check out

Nintendo DS is probably an option as well, but I don't know of any decent emulators for that handheld yet. Good luck to you :-)

EDIT: Here's a trivial example of some GBA code I wrote years ago: GBA Color fill 240x160 16-bit example

I'm looking to learn about embedded programming (in C mainly, but I hope to brush up on my ASM as well) and I was wondering what the best platform would be. I have some experience in using Atmel AVR's and programming them with the stk500 and found that to be relatively easy. I especially like AVR Studio and the debugger that lets you view that state of registers.

However, If I was to take the time to learn, I would rather learn about something that is prevalent in industry. I am thinking ARM, that is unless someone has a better suggestion.

I would also be looking for some reference material, I have found the books section on the ARM website and if one is a technically better book than another I would appreciate a heads up.

The last thing I would be looking for is a prototyping/programming board like the STK500 that has some buttons and so forth.

Thanks =]

You should try and learn from developpers kits provided by Embedded Artists. After you get the kit, check their instructional videos and videos provided by NXP, which are not as detailed as they could be, but they cover a lot of things. Problems with learning ARM as your first architecture and try to do something practicall are:

  • You need to buy dev. kit.
  • You need a good book to learn ARM assembly, because sooner or later you will come across ARM startup code, which is quite a deal for a beginner. The book i mentioned allso covers some C programming.
  • Combine book mentioned above with a user guide for your speciffic processor like this one. Make sure you get this as studying this in combination with above book is the only way to learn your ARM proc. in detail.
  • If you want to make a transfer from ARM assembly to C programming you will need to read this book, which covers a different ARM processor but is easier for C beginner. The down side is that it doesn't explain any ARM assembly, but this is why you need the first book.

There is no easy way.

I'm starting develop an application in embedded arm board from I'm a newbie in developing embedded applications. I would like resources like books, online guides that will get me started in to develop applications in embedded arm. I was planing to use Linux as the OS.

At some point you'll need to understand some level of ARM Assembly language. "ARM System Developer's Guide" by Andrew Sloss, et al is a really good book for ARM assembly.

I have to choose a thesis topic soon and I was considering implementing an operating system for an architecture that is not x86 (I'm leaning towards ARM or AVR). The reason I am avoiding x86 is because I would like to gain some experience with embedded platforms and I (possibly incorrectly) believe that the task may be easier when carried out on a smaller scale. Does anyone have any pointers to websites or resources where there are some examples of this. I have read through most if not all of the OSDev questions on stack overflow, and I am also aware of AvrFreaks and OSDev. Additionally if anyone has had experience in this area and wanted to offer some advice in regards to approach or platform it would be much appreciated.


Seems like you should get a copy of Jean Labrosse's book MicroC/OS.

It looks like he may have just updated it too.

This is a well documented book describing the inner workings of an RTOS written in C and ported to many embedded processors. You could also run it on a x86, and then cross compile to another processor.

If you choose ARM, pick up a copy of the ARM System Developer's Guide (Sloss, Symes, Wright). Link to Amazon

Chapter 11 discusses the implementation of a simple embedded operating system, with great explanations and sample code.

I'm debugging some odd ARM exceptions in an embedded system using the IAR workbench toolchain. Sometimes, when an exception is trapped the SVC_STACK is reported as out of range (very out of range!) Is this relevant, or just an artifact of the J-Link JTAG debugger? What is the SVC_STACK used for? It is set to 0x1000 size, but when it is out of range, it is way up in our heap area. Thanks!

ARMs SVC mode is entered when an exception occurs (not an IRQ or FIQ - fast IRQ). It can also be entered directly by code executing in non-user mode by setting the CPRS register, but I think this is uncommon except for when initializing the system.

When an exception occurs, the processor switches to the SVC stack, which has to be set up very early in the initialization of the system. I'm guessing that your initialization code is not properly setting up the SVC stack, or it's possible that one of the exception handlers is not coded properly and is trashing the stack.

A third possibility is that you're using an RTOS that sets up the ARM stacks the way it wants (basically overriding the SVC stack that the IAR's initialization code might set up). If this is the case, it's possible that everything is OK, but the IAR debugger thinks the SVC stack is out of range - the debugger will get its information from the linker config file - but if something changes the stack to another area of memory, then the debugger will get confused.

This happened to me all the time with the user mode stack in IAR when using an RTOS - the stacks were allocated based on task control blocks which were not in the CSTACK segment the debugger thought it should be in, and the debugger would issue irritating warnings. There was some project configuration setting that could be used to quiet the warnings, but I don't recall off the top of my head what it was - we rarely bothered with it, and just lived with the noise.

You'll need to verify that the the stack 'way up in the heap' area is valid - if you don't have some bit of code explicitly doing this, it's likely that it's wrong (or maybe you'll need to ask your RTOS vendor).

The ARM Architecture Reference Manual (ARM ARM) is freely available from and goes into excruciating detail about how the ARM stacks work. Another good reference is the ARM System Developer's Guide by Andrew Sloss, et al.

I have a C Function which tries to copy a framebuffer to FSMC RAM.

The functions eats the frame rate of the game loop to 10FPS. I would like to know how to analyze the disassembled function, should I count each instruction cycle ? I want to know where the CPU spend its time, in which part. I'm sure that the algorithm is also a problem, because its O(N^2)

The C Function is:

void LCD_Flip()

    u8  i,j;

    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position

            u16 color = frameBuffer[i+j*fbWidth];



Disassembled function:

08000fd0 <LCD_Flip>:
 8000fd0:   b580        push    {r7, lr}
 8000fd2:   b082        sub sp, #8
 8000fd4:   af00        add r7, sp, #0
 8000fd6:   2000        movs    r0, #0
 8000fd8:   2100        movs    r1, #0
 8000fda:   f7ff fde9   bl  8000bb0 <LCD_SetCursor>
 8000fde:   2050        movs    r0, #80 ; 0x50
 8000fe0:   2100        movs    r1, #0
 8000fe2:   f7ff feb5   bl  8000d50 <LCD_WriteRegister>
 8000fe6:   2051        movs    r0, #81 ; 0x51
 8000fe8:   21ef        movs    r1, #239    ; 0xef
 8000fea:   f7ff feb1   bl  8000d50 <LCD_WriteRegister>
 8000fee:   2052        movs    r0, #82 ; 0x52
 8000ff0:   2100        movs    r1, #0
 8000ff2:   f7ff fead   bl  8000d50 <LCD_WriteRegister>
 8000ff6:   2053        movs    r0, #83 ; 0x53
 8000ff8:   f240 113f   movw    r1, #319    ; 0x13f
 8000ffc:   f7ff fea8   bl  8000d50 <LCD_WriteRegister>
 8001000:   2022        movs    r0, #34 ; 0x22
 8001002:   f7ff fe87   bl  8000d14 <LCD_WriteIndex>
 8001006:   2300        movs    r3, #0
 8001008:   71bb        strb    r3, [r7, #6]
 800100a:   e01b        b.n 8001044 <LCD_Flip+0x74>
 800100c:   2300        movs    r3, #0
 800100e:   71fb        strb    r3, [r7, #7]
 8001010:   e012        b.n 8001038 <LCD_Flip+0x68>
 8001012:   79f9        ldrb    r1, [r7, #7]
 8001014:   79ba        ldrb    r2, [r7, #6]
 8001016:   4613        mov r3, r2
 8001018:   011b        lsls    r3, r3, #4
 800101a:   1a9b        subs    r3, r3, r2
 800101c:   011b        lsls    r3, r3, #4
 800101e:   1a9b        subs    r3, r3, r2
 8001020:   18ca        adds    r2, r1, r3
 8001022:   4b0b        ldr r3, [pc, #44]   ; (8001050 <LCD_Flip+0x80>)
 8001024:   f833 3012   ldrh.w  r3, [r3, r2, lsl #1]
 8001028:   80bb        strh    r3, [r7, #4]
 800102a:   88bb        ldrh    r3, [r7, #4]
 800102c:   4618        mov r0, r3
 800102e:   f7ff fe7f   bl  8000d30 <LCD_WriteData>
 8001032:   79fb        ldrb    r3, [r7, #7]
 8001034:   3301        adds    r3, #1
 8001036:   71fb        strb    r3, [r7, #7]
 8001038:   79fb        ldrb    r3, [r7, #7]
 800103a:   2bef        cmp r3, #239    ; 0xef
 800103c:   d9e9        bls.n   8001012 <LCD_Flip+0x42>
 800103e:   79bb        ldrb    r3, [r7, #6]
 8001040:   3301        adds    r3, #1
 8001042:   71bb        strb    r3, [r7, #6]
 8001044:   79bb        ldrb    r3, [r7, #6]
 8001046:   2b63        cmp r3, #99 ; 0x63
 8001048:   d9e0        bls.n   800100c <LCD_Flip+0x3c>
 800104a:   3708        adds    r7, #8
 800104c:   46bd        mov sp, r7
 800104e:   bd80        pop {r7, pc}

Not exactly answering your question, but I see you aspire for fast execution of the loops.

Here are some tips from the book:

ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design)

Chapter 5 contains section named 'C looping structures'. Here is the summary of the section:

Writing Loops Efficiently

  • Use loops that count down to zero. Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free.
  • Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This will ensure that the loop overhead is only two instructions.
  • Use do-while loops rather than for loops when you know the loop will iterate at least once. This saves the compiler checking to see if the loop count is zero.
  • Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache.
  • Try to arrange that the number of elements in arrays are multiples of four or eight. You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements.

Based on summary, your inner loop might look as below.

uinsigned int i = 240/4;  // Use unsigned loop counters by default
                          // and the continuation condition i!=0

    // Unroll important loops to reduce the loop overhead
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
while ( i != 0 )  // Use do-while loops rather than for
                  // loops when you know the loop will
                  // iterate at least once

You might want to experiment also with 'pragmas' as well, e.g. :

#pragma Otime

#pragma unroll(n)

Maybe not everything may be applicable in your application (filling a buffer in reverse order). I just wanted to draw your attention to the book and possible points for optimization.

int readint(__packed int *data)
    return *data;

I have seen __packed attribute in struct declarations to avoid padding. However, what is the benefit of using __packed attribute in function arguments.

The author says that he has used __packed to tell the compiler that the integer may possibly not be aligned. What does it means?

Edit: Will the following work with gcc compiler

int readint(__attribute__((packed)) int *data)
    return *data;

The __packed qualifier is a compiler-specific feature of the armcc C compiler, published by ARM. A full explanation is present in their documentation, but in brief, it indicates that no padding for alignment should be inserted into the qualified object, and that pointers with this qualifier should be accessed as if they might be misaligned. (This may cause slower code to be generated for some processors, so it should not be used gratuitously.)

Note that this is not the same as the GCC packed attribute, which only applies to struct and union type definitions.

I am just recently trying to get into embedded programming and am looking for a few resources. I've done quite a bit of programming in higher level languages but have always been fascinated by how hardware actually works. As a forcing function to get myself to finally learn about hardware I recently purchased a BeagleBoard XM with the goal of programming it bare metal with assembly.

I've spent a week or so reading through the TRM in my spare time as well as searching the web for sample code. I've found a few resources which provide good examples for displaying data through the serial port but nothing much beyond that. I had hoped to find a few examples of people making use of interrupts and sdma but have yet to find any. My goal as a starter project is to write a very simple program which would take a character input from the serial port and echo it back to the screen. I would like to make it such that it made use of interrupts/sdma. Reading through the TRM it isn't apparent how to make this happen. Being completely new to this subject it is incredibly difficult to know exactly what I even need to look for in order to make sense of the documentation. I wondered if there are any experts out there who might be able to provide any sample asm code which makes use of a few of the hardware features of the BeagleBoard. After all, no amount of documentation can ever substitute a good concrete example of code.

Your problem is two-fold: understand baremetal and OS programming and understand the beagleboard hardware. For the latter, I recommend looking at other peoples code alongside the datasheets. Reading the datasheats only is very time consuming. Start with u-boot code for the beagleboard:

Some other baremetal projects that are not BB-XM but I have found useful:

Your second problem is to understand low-level programming on ARM. I recommend these books, note however that these are written for older architectures. Nevertheless, they should still be very useful to you:

The latter even has a chapter on writing your own small OS.