A Retargetable C Compiler

Christopher W. Fraser

Mentioned 5

Examining the implementation of lcc, a production-quality, research-oriented retargetable compiler, designed at AT&T Bell Laboratories for the ANSI C programming language, this book is designed for professionals who seek a detailed examination of a real-world compiler. A thorough and accurate picture of the lcc compiler is provided, and a line-by-line explanation of the code demonstrates how the compiler is built. The accompanying disk holds the full source code for the lcc compiler, the three back ends and the code-generator.

More on Amazon.com

Mentioned in questions and answers.

After over a decade of C/C++ coding, I've noticed the following pattern - very good programmers tend to have detailed knowledge of the innards of the compiler.

I'm a reasonably good programmer, and I have an ad-hoc collection of compiler "superstitions", so I'd like to reboot my knowledge and start from the basics.

Can anyone recommend links to online resources or favorite books? I'm particularly interested in C/C++ compiling, optimization, GCC and LLVM.

BTW: At university I was interested in compilers, and so I signed up for subject called "Parsing and Translation". Bad move. It ended up being the Professor's vehicle for testing his advanced parsing and grammar theory textbook on unsuspecting 3rd year Comp Sci students. I was left with no practical knowledge, and even more confused than before.

As noted by Pete Eddy, Jack Crenshaw's tutorial is excellent for newbies. But if you want to see how to a real, production C compiler works—one which was designed by brilliant engineers instead of created by throwing code at the wall until something stuck—get yourself a copy of Fraser and Hanson's A Retargetable C Compiler: Design and Implementation, which contains the source code to the very clean lcc compiler. Explanations of the design and implementation are mixed in with the code. It is not a first book for a beginner, but it will repay careful study, and you can get a used copy for $35.

For a longer blurb about lcc, see Compile C Faster on Linux.

The lcc web page also has links to a number of good textbooks. I don't know of an intro text that I really like, however.

P.S. Sorry you got ripped off at Uni.

If you want dead-tree edition, try The Art of Compiler Design: Theory and Practice.

The @encode directive returns a const char * which is a coded type descriptor of the various elements of the datatype that was passed in. Example follows:

struct test
{ int ti ;
  char tc ;
} ;

printf( "%s", @encode(struct test) ) ;
// returns "{test=ic}"

I could see using sizeof() to determine primitive types - and if it was a full object, I could use the class methods to do introspection.

However, How does it determine each element of an opaque struct?

@Lothars answer might be "cynical", but it's pretty close to the mark, unfortunately. In order to implement something like @encode(), you need a full blown parser in order to extract the the type information. Well, at least for anything other than "trivial" @encode() statements (i.e., @encode(char *)). Modern compilers generally have either two or three main components:

  • The front end.
  • The intermediate end (for some compilers).
  • The back end.

The front end must parse all the source code and basically converts the source code text in to an internal, "machine useable" form.

The back end translates the internal, "machine useable" form in to executable code.

Compilers that have an "intermediate end" typically do so because of some need: they support multiple "front ends", possibly made up of completely different languages. Another reason is to simplify optimization: all the optimization passes work on the same intermediate representation. The gcc compiler suite is an example of a "three stage" compiler. llvm could be considered an "intermediate and back end" stage compiler: The "low level virtual machine" is the intermediate representation, and all the optimization takes place in this form. llvm also able to keep it in this intermediate representation right up until the last second- this allows for "link time optimization". The clang compiler is really a "front end" that (effectively) outputs llvm intermediate representation.

So, if you want to add @encode() functionality to an 'existing' compiler, you'd probably have to do it as a "source to source" 'compiler / preprocessor'. This was how the original Objective-C and C++ compilers were written- they parsed the input source text and converted it to "plain C" which was then fed in to the standard C compiler. There's a few ways to do this:

Roll your own

  • Use yacc and lex to put together a ANSI-C parser. You'll need a grammar- ANSI C grammar (Yacc) is a good start. Actually, to be clear, when I say yacc, I really mean bison and flex. And also, loosely, the other various yacc and lex like C-based tools: lemon, dparser, etc...
  • Use perl with Yapp or EYapp, which are pseudo-yacc clones in perl. Probably better for quickly prototyping an idea compared to C-based yacc and lex- it's perl after all: Regular expressions, associative arrays, no memory management, etc.
  • Build your parser with Antlr. I don't have any experience with this tool chain, but it's another "compiler compiler" tool that (seems) to be geared more towards java developers. There appears to be freely available C and Objective-C grammars available.

Hack another tool

Note: I have no personal experience using any of these tools to do anything like adding @encode(), but I suspect they would be a big help.

  • CIL - No personal experience with this tool, but designed for parsing C source code and then "doing stuff" with it. From what I can glean from the docs, this tool should allow you to extract the type information you'd need.
  • Sparse - Worth looking at, but not sure.
  • clang - Haven't used it for this purpose, but allegedly one of the goals was to make it "easily hackable" for just this sort of stuff. Particularly (and again, no personal experience) in doing the "heavy lifting" of all the parsing, letting you concentrate on the "interesting" part, which in this case would be extracting context and syntax sensitive type information, and then convert that in to a plain C string.
  • gcc Plugins - Plugins are a gcc 4.5 (which is the current alpha/beta version of the compiler) feature and "might" allow you to easily hook in to the compiler to extract the type information you'd need. No idea if the plugin architecture allows for this kind of thing.

Others

  • Coccinelle - Bookmarked this recently to "look at later". This "might" be able to do what you want, and "might" be able to do it with out much effort.
  • MetaC - Bookmarked this one recently too. No idea how useful this would be.
  • mygcc - "Might" do what you want. It's an interesting idea, but it's not directly applicable to what you want. From the web page: "Mygcc allows programmers to add their own checks that take into account syntax, control flow, and data flow information."

Links.

Edit #1, the bonus links.

@Lothar makes a good point in his comment. I had actually intended to include lcc, but it looks like it got lost along the way.

  • lcc - The lcc C compiler. This is a C compiler that is particularly small, at least in terms of source code size. It also has a book, which I highly recommend.
  • tcc - The tcc C compiler. Not quite as pedagogical as lcc, but definitely still worth looking at.
  • poc - The poc Objective-C compiler. This is a "source to source" Objective-C compiler. It parses the Objective-C source code and emits C source code, which it then passes to gcc (well, usually gcc). Has a number of Objective-C extensions / features that aren't available in gcc. Definitely worth looking at.

It is a university task in my group to write a compiler of C-like language. Of course I am going to implement a small part of our beloved C++.
The exact task is absolutely stupid, and the lecturer told us it need to be self-compilable (should be able to compile itself) - so, he meant not to use libraries such as Boost and STL.
He also does not want us to use templates because it is hard to implement.
The question is - is it real for me, as I`m going to write this project on my own, with the deadline at the end of May - the middle of June (this year), to implement not only templates, but also nested classes, namespaces, virtual functions tables at the level of syntax analysis?
PS I am not noobie in C++

I will like to stress a few points already mentioned and give a few references.

1) STICK TO THE 1989 ANSI C STANDARD WITH NO OPTIMIZATION.

2) Don't worry, with proper guidance, good organization and a fair amount of hard work this is doable.

3) Read the The C Programming Language cover to cover.

4) Understand important concepts of compiler development from the Dragon Book.

5) Take a look at lcc both the code as well as the book.

6) Take a look at Lex and Yacc (or Flex and Bison)

7) Writing a C compiler (up to the point it can self compile) is a rite of passage ritual among programmers. Enjoy it.

I have a set of *.C files(embedded related). Could anyone please detail to me the steps/processes(internal information) involved while compiling followed by linking to create the final executable(I need the information/steps regarding what a preprocessor/compiler generally performs to a C src code)

Also i just want to get an idea about the general structure of the final executable(eg:headers followed by symbol tables etc etc..)

Also please notify me if anyone already discussed the same topic earlier.

__Kanu

That's probably too in-depth for an SO question. If you really need to know how it all works, I suggest your read A Retargetable C Compiler. It'll go through all the steps to building a C compiler (I believe this book covers the lcc compiler).

I'm working on a little compiler project. A prof. told me that burg or iburg is a good starting point.

Next semester I have to use it in the compiler construction course anyway so I thought it's good when I start with iburg. But: There aren't any tutorials on how to start.

Where do I find good sources beside the linked paper in the readme of the zip file on http://code.google.com/p/iburg/ ?

You might want to find a copy of A Retargetable C Compiler: Design and Implementation by Fraser and Hanson. The book discusses lburg (a variant of iburg) in some detail with examples for x86, SPARC, and MIPS.