Alfred V. Aho

Mentioned 58

Compilers: Principles, Techniques and Tools, known to professors, students, and developers worldwide as the "Dragon Book," is available in a new edition. Every chapter has been completely revised to reflect developments in software engineering, programming languages, and computer architecture that have occurred since 1986, when the last edition published. The authors, recognizing that few readers will ever go on to construct a compiler, retain their focus on the broader set of problems faced in software design and software development.

More on

Mentioned in questions and answers.

Preferred languages: C/C++, Java, and Ruby.

I am looking for some helpful books/tutorials on how to write your own compiler simply for educational purposes. I am most familiar with C/C++, Java, and Ruby, so I prefer resources that involve one of those three, but any good resource is acceptable.

The Dragon Book is definitely the "building compilers" book, but if your language isn't quite as complicated as the current generation of languages, you may want to look at the Interpreter pattern from Design Patterns.

The example in the book designs a regular expression-like language and is well thought through, but as they say in the book, it's good for thinking through the process but is effective really only on small languages. However, it is much faster to write an Interpreter for a small language with this pattern than having to learn about all the different types of parsers, yacc and lex, et cetera...

I think Modern Compiler Implementation in ML is the best introductory compiler writing text. There's a Java version and a C version too, either of which might be more accessible given your languages background. The book packs a lot of useful basic material (scanning and parsing, semantic analysis, activation records, instruction selection, RISC and x86 native code generation) and various "advanced" topics (compiling OO and functional languages, polymorphism, garbage collection, optimization and single static assignment form) into relatively little space (~500 pages).

I prefer Modern Compiler Implementation to the Dragon book because Modern Compiler implementation surveys less of the field--instead it has really solid coverage of all the topics you would need to write a serious, decent compiler. After you work through this book you'll be ready to tackle research papers directly for more depth if you need it.

I must confess I have a serious soft spot for Niklaus Wirth's Compiler Construction. It is available online as a PDF. I find Wirth's programming aesthetic simply beautiful, however some people find his style too minimal (for example Wirth favors recursive descent parsers, but most CS courses focus on parser generator tools; Wirth's language designs are fairly conservative.) Compiler Construction is a very succinct distillation of Wirth's basic ideas, so whether you like his style or not or not, I highly recommend reading this book.

I concur with the Dragon Book reference; IMO, it is the definitive guide to compiler construction. Get ready for some hardcore theory, though.

If you want a book that is lighter on theory, Game Scripting Mastery might be a better book for you. If you are a total newbie at compiler theory, it provides a gentler introduction. It doesn't cover more practical parsing methods (opting for non-predictive recursive descent without discussing LL or LR parsing), and as I recall, it doesn't even discuss any sort of optimization theory. Plus, instead of compiling to machine code, it compiles to a bytecode that is supposed to run on a VM that you also write.

It's still a decent read, particularly if you can pick it up for cheap on Amazon. If you only want an easy introduction into compilers, Game Scripting Mastery is not a bad way to go. If you want to go hardcore up front, then you should settle for nothing less than the Dragon Book.

Big List of Resources:


  • ¶ Link to a PDF file
  • $ Link to a printed book

This is a pretty vague question, I think; just because of the depth of the topic involved. A compiler can be decomposed into two separate parts, however; a top-half and a bottom-one. The top-half generally takes the source language and converts it into an intermediate representation, and the bottom half takes care of the platform specific code generation.

Nonetheless, one idea for an easy way to approach this topic (the one we used in my compilers class, at least) is to build the compiler in the two pieces described above. Specifically, you'll get a good idea of the entire process by just building the top-half.

Just doing the top half lets you get the experience of writing the lexical analyzer and the parser and go to generating some "code" (that intermediate representation I mentioned). So it will take your source program and convert it to another representation and do some optimization (if you want), which is the heart of a compiler. The bottom half will then take that intermediate representation and generate the bytes needed to run the program on a specific architecture. For example, the the bottom half will take your intermediate representation and generate a PE executable.

Some books on this topic that I found particularly helpful was Compilers Principles and Techniques (or the Dragon Book, due to the cute dragon on the cover). It's got some great theory and definitely covers Context-Free Grammars in a really accessible manner. Also, for building the lexical analyzer and parser, you'll probably use the *nix tools lex and yacc. And uninterestingly enough, the book called "lex and yacc" picked up where the Dragon Book left off for this part.

The quickest approach is through two books:

1990 version of An Introduction to Compiling Techniques, a First Course using ANSI C, LeX, and YaCC by JP Bennett - a perfect balance of example code, parsing theory and design- it contains a complete compiler written in C, lex and yacc for a simple grammar

Dragon Book (older version) - mostly a detailed reference for the features not covered in the former book

What concepts in Computer Science do you think have made you a better programmer?

My degree was in Mechanical Engineering so having ended up as a programmer, I'm a bit lacking in the basics. There are a few standard CS concepts which I've learnt recently that have given me a much deeper understanding of what I'm doing, specifically:

Language Features

  • Pointers & Recursion (Thanks Joel!)

Data Structures

  • Linked Lists
  • Hashtables


  • Bubble Sorts

Obviously, the list is a little short at the moment so I was hoping for suggestions as to:

  1. What concepts I should understand,
  2. Any good resources for properly understanding them (as Wikipedia can be a bit dense and academic sometimes).

I find it a little funny that you're looking for computer science subjects, but find wikipedia too academic :D

Anyway, here goes, in no particular order:

As a recent graduate from a computer science degree I'd recommend the following:

Some of the OS concepts

 ( memory, IO, Scheduling, process\Threads, multithreading )

[a good book "Modern Operating Systems, 2nd Edition, Andrew S. Tanenbaum"]

Basic knowledge of Computer networks

[a good book by Tanenbaum

OOPS concepts

Finite autometa

A programming language ( I learnt C first then C++)

Algorithms ( Time\space complexity, sort, search, trees, linked list, stack, queue )

[a good book Introduction to Algorithms]

So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

  • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

  • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

Let's take a C program.

When you run 'gcc' or 'cl' on the c program, it will go through these stages:

  1. Preprocessor lexing(#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...)
  2. Lexical analysis(producing tokens and lexical errors).
  3. Syntactical analysis(producing a parse tree and syntactical errors).
  4. Semantic analysis(producing a symbol table, scoping information and scoping/typing errors).
  5. Output into assembly(or another intermediate format)
  6. Optimization of assembly(as above). Probably in ASM strings still.
  7. Assembling of the assembly into some binary object format.
  8. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  9. Output of final executable in elf or coff format.

In practice, some of these steps may be done at the same time, but this is the logical order.

Note that there's a 'container' of elf or coff format around the actual executable binary.

You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.

I've wiki'd this answer to let people tweak any errors/add information.

Possible Duplicate:
Learning to write a compiler

I looked around trying to find out more about programming language development, but couldn't find a whole lot online. I have found some tutorial videos, but not much for text guides, FAQs, advice etc. I am really curious about how to build my own programming language. It brings me to SO to ask:

How can you go about making your own programming language?

I would like to build a very basic language. I don't plan on having a very good language, nor do I think it will be used by anyone. I simply want to make my own language to learn more about operating systems, programming, and become better at everything.

Where does one start? Building the syntax? Building a compiler? What skills are needed? A lot of assembly and understanding of the operating system? What languages are most compilers and languages built in? I assume C.

I'd say that before you begin you might want to take a look at the Dragon Book and/or Programming Language Pragmatics. That will ground you in the theory of programming languages. The books cover compilation, and interpretation, and will enable you to build all the tools that would be needed to make a basic programming language.

I don't know how much assembly language you know, but unless you're rather comfortable with some dialect of assembly language programming I'd advise you against trying to write a compiler that compiles down to assembly code, as it's quite a bit of a challenge. You mentioned earlier that you're familiar wtih both C and C++, so perhaps you can write a compiler that compiles down to C or C++ and then use gcc/g++ or any other C/C++ compiler to convert the code to a native executable. This is what the Vala programming language does (it converts Vala syntax to C code that uses the GObject library).

As for what you can use to write the compiler, you have a lot of options. You could write it by hand in C or C++, or in order to simplify development you could use a higher level language so that you can focus on the writing of the compiler more than the memory allocations and the such that are needed for working with strings in C.

You could simply generate the grammars and have Flex and Bison generate the parser and lexical analyser. This is really useful as it allows you to do iterative development to quickly work on getting a working compiler.

Another option you have is to use ANTLR to generate your parser, the advantage to this is that you get lots of target languages that ANTLR can compile to. I've never used this but I've heard a lot about it.

Furthermore if you'd like a better grounding on the models that are used so frequently in programming language compiler/scanner/parser construction you should get a book on the Models of Computation. I'd recommend Introduction to the Theory of Computation.

You also seem to show an interest in gaining an understanding of operating systems. This I would say is something that is separate from Programming Language Design, and should be pursued separately. The book Principles of Modern Operating Systems is a pretty good starting place for learning about that. You could start with small projects like creating a shell, or writing a programme that emulates the ls command, and then go into more low level things, depending on how through you are with the system calls in C.

I hope that helps you.

EDIT: I've learnt a lot since I write this answer. I was taking the online course on programming languages that Brown University was offering when I saw this answer featured there. The professor very rightly points out that this answer talks a lot about parsers but is light on just about everything else. I'd really suggest going through the course videos and exercises if you'd like to get a better idea on how to create a programming language.

I know that one of the differences between classes and structs is that struct instances get stored on stack and class instances(objects) are stored on the heap.

Since classes and structs are very similar. Does anybody know the difference for this particular distinction?

How the compiler and run-time environment handle memory management has grown up over a long period of time. The stack memory v.s. heap memory allocation decision had a lot to do with what could be known at compile-time and what could be known at runtime. This was before managed run times.

In general, the compiler has very good control of what's on the stack, it gets to decide what is cleaned up and when based on calling conventions. The heap on the other hand, was more like the wild west. The compiler did not have good control of when things came and went. By placing function arguments on the stack, the compiler is able to make a scope -- that scope can be controlled over the lifetime of the call. This is a natural place to put value types, because they are easy to control as opposed to reference types that can hand out memory locations (pointers) to just about anyone they want.

Modern memory management changes a lot of this. The .NET runtime can take control of reference types and the managed heap through complex garbage collection and memory management algorithms. This is also a very, very deep subject.

I recommend you check out some texts on compilers -- I grew up on Aho, so I recommend that. You can also learn a lot about the subject by reading Gosling.

I'm reading through the dragon book and trying to solve an exercise that is stated as follows

Write regular definitions for the following languages:

  • All strings of digits with no repeated digits. Hint: Try this problem first with a few digits, such as { 0, 1, 2 }.

Despite having tried to solve it for hours, I can't imagine a solution, beside the extremely wordy

d0 -> 0?
d1 -> 1?
d2 -> 2?
d3 -> 3?
d4 -> 4?
d5 -> 5?
d6 -> 6?
d7 -> 7?
d8 -> 8?
d9 -> 9?
d10 -> d0d1d2d3d4d5d6d7d8d9 | d0d1d2d3d4d5d6d7d9d8 | ...

Hence having to write 10! alternatives in d10. Since we shall write this regular definition, I doubt that this is a proper solution. Can you help me please?

So the question didn't necessarily ask you to write a regular expression, it asked you to provide a regular definition, which I interpret to include NFA's. It turns out it doesn't matter which you use, as all NFA's can be shown to be mathematically equivalent to regular expressions.

Using the digits 0, 1, and 2, a valid NFA would be the following (sorry for the crummy diagram):

enter image description here

Each state represents the last digit scanned in the input, and there are no loops on any of the nodes, therefore this is an accurate representation of a string with no repeated digits from the set {0,1,2}. Extending this is trivial (although it requires a large whiteboard :) ).

NOTE: I am making the assumption that the string "0102" IS valid, but the string "0012" is not.

This can be converted to a regular expression (although it will be painful) by using the algorithm described here.

I've been given a job of 'translating' one language into another. The source is too flexible (complex) for a simple line by line approach with regex. Where can I go to learn more about lexical analysis and parsers?

After taking (quite) a few compilers classes, I've used both The Dragon Book and C&T. I think C&T does a far better job of making compiler construction digestible. Not to take anything away from The Dragon Book, but I think C&T is a far more practical book.

Also, if you like writing in Java, I recommend using JFlex and BYACC/J for your lexing and parsing needs.

Yet another textbook to consider is Programming Language Pragmatics. I prefer it over the Dragon book, but YMMV.

If you're using Perl, yet another tool to consider is Parse::RecDescent.

If you just need to do this translation once and don't know anything about compiler technology, I would suggest that you get as far as you can with some fairly simplistic translations and then fix it up by hand. Yes, it is a lot of work. But it is less work than learning a complex subject and coding up the right solution for one job. That said, you should still learn the subject, but don't let not knowing it be a roadblock to finishing your current project.

How do recursive ascent parsers work? I have written a recursive descent parser myself but I don't understand LR parsers all that well. What I found on Wikipedia has only added to my confusion.

Another question is why recursive ascent parsers aren't used more than their table-based counterparts. It seems that recursive ascent parsers have greater performance overall.

The clasical dragon book explains very well how LR parsers work. There is also Parsing Techniques. A Practical Guide. where you can read about them, if I remember well. The article in wikipedia (at least the introduction) is not right. They were created by Donald Knuth, and he explains them in his The Art of Computer Programming Volume 5. If you understand spanish, there is a complete list of books here posted by me. Not all that books are in spanish, either.

Before to understand how they work, you must understand a few concepts like first, follows and lookahead. Also, I really recommend you to understand the concepts behind LL (descendant) parsers before trying to understand LR (ascendant) parsers.

There are a family of parsers LR, specially LR(K), SLR(K) and LALR(K), where K is how many lookahead they need to work. Yacc supports LALR(1) parsers, but you can make tweaks, not theory based, to make it works with more powerful kind of grammars.

About performance, it depends on the grammar being analyzed. They execute in linear time, but how many space they need depends on how many states do you build for the final parser.

I'm interested in writing an x86 assembler for a hobby project.

At first it seemed fairly straight forward to me but the more I read into it, the more unanswered questions I find myself having. I'm not totally inexperienced: I've used MIPs assembly a fair amount and I've written a toy compiler for a subset of C in school.

My goal is to write a simple, but functional x86 assembler. I'm not looking to make a commercially viable assembler, but simply a hobby project to strengthen my knowledge in certain areas. So I don't mind if I don't implement every available feature and operation.

I have many questions such as: Should I use a one-pass or two-pass method? Should I use ad-hoc parsing or define formal grammars and use a parser-generator for my instructions? At what stage, and how do I resolve the addresses of my symbols?

Given my requirements, can anyone suggest some general guidelines for the methods I should be using in my pet-project assembler?

You may find the dragon book to be helpful.

The actual title is Compilers: Principles, Techniques, and Tools (

Check out the Intel Architectures Software Developer's Manuals for the complete documentation of the IA-32 and IA-64 instruction sets.

AMD's architecture technical documents are available on its website as well.

Linkers and Loaders ( is a good introduction to object formats and linking issues. (The unedited original manuscript is also available online.)

Suppose I crafted a set of classes to abstract something and now I worry whether my C++ compiler will be able to peel off those wrappings and emit really clean, concise and fast code. How do I find out what the compiler decided to do?

The only way I know is to inspect the disassembly. This works well for simple code, but there're two drawbacks - the compiler might do it different when it compiles the same code again and also machine code analysis is not trivial, so it takes effort.

How else can I find how the compiler decided to implement what I coded in C++?

You want to know if the compiler produced "clean, concise and fast code".

"Clean" has little meaning here. Clean code is code which promotes readability and maintainability -- by human beings. Thus, this property relates to what the programmer sees, i.e. the source code. There is no notion of cleanliness for binary code produced by a compiler that will be looked at by the CPU only. If you wrote a nice set of classes to abstract your problem, then your code is as clean as it can get.

"Concise code" has two meanings. For source code, this is about saving the scarce programmer eye and brain resources, but, as I pointed out above, this does not apply to compiler output, since there is no human involved at that point. The other meaning is about code which is compact, thus having lower storage cost. This can have an impact on execution speed, because RAM is slow, and thus you really want the innermost loops of your code to fit in the CPU level 1 cache. The size of the functions produced by the compiler can be obtained with some developer tools; on systems which use GNU binutils, you can use the size command to get the total code and data sizes in an object file (a compiled .o), and objdump to get more information. In particular, objdump -x will give the size of each individual function.

"Fast" is something to be measured. If you want to know whether your code is fast or not, then benchmark it. If the code turns out to be too slow for your problem at hand (this does not happen often) and you have some compelling theoretical reason to believe that the hardware could do much better (e.g. because you estimated the number of involved operations, delved into the CPU manuals, and mastered all the memory bandwidth and cache issues), then (and only then) is it time to have a look at what the compiler did with your code. Barring these conditions, cleanliness of source code is a much more important issue.

All that being said, it can help quite a lot if you have a priori notions of what a compiler can do. This requires some training. I suggest that you have a look at the classic dragon book; but otherwise you will have to spend some time compiling some example code and looking at the assembly output. C++ is not the easiest language for that, you may want to begin with plain C. Ideally, once you know enough to be able to write your own compiler, then you know what a compiler can do, and you can guess what it will do on a given code.

I am given two functions for finding the product of two matrices:

 void MultiplyMatrices_1(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int j = 0; j < n; j++)
              for (int k = 0; k < n; k++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];

 void MultiplyMatrices_2(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int k = 0; k < n; k++)
              for (int j = 0; j < n; j++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];

I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?

I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.

Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.

Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.

Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.

Given this, let's look at your code. The first version looks like this:

  for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
          for (int k = 0; k < n; k++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.

But consider this second version:

  for (int i = 0; i < n; i++)
      for (int k = 0; k < n; k++)
          for (int j = 0; j < n; j++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.

Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.

Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.

Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.

Hope this helps!

Recently, I was going around looking for ideas on what I can build using C this summer and I came across this post: Interesting project to learn C?

Implement a programming language. This doesn't have to be terribly hard - I did the language that must not be named - but it will force you to learn a lot of the important parts of C. If you don't want to write a lexer and/or parser yourself, you can use lex/flex and yacc/bison, but if you plan on that you might want to start with a somewhat smaller project.

I was kinda intrigued about the implementing a programming language answer and I'm wondering how do I go about starting this? I've gone through the whole K&R book and I've done some of the exercises as well. I also have a bit of experience in C++ and Java if that matters. Any tips? Thanks!

Well, I think something like that is really hard to do but also it would be a great pet project. You should have notions of parsers, lexers, flow control, paradigms (imperative, functional, OO) and many other things.

Many people says the Dragon Book is one of the best books for this. Maybe you can take a look at it :)

Good Luck!

This is just a question out of curiosity since I have been needing to get more and more into parsing and using regex lately.. it seems, for questions I come across in my searches regarding parsing of some sort, someone always ends up saying, when asked something relating to regex, "regex isn't good for that, use such and such parser instead"... as I have come to better understand regex, I think most stuff is possible, just its rather complex and time consuming since you have to account for many different possiblities, and of course, it has to be combined with conditional statements and loops to build any sort of parser.. so I'm wondering if regex is what is used to build most parsers or is there some other method being used.. I am just wondering since I may have the need to build some fairly complex custom parsers coming up where there isn't necessarily an existing one to use.

thanks for any info as I can't seem to find a direct answer to this.

Well, building a parser is pretty complex and you can use regex but that's not the only things you use. I suggest to read the Dragon Book

These days, in my opinion, you should use a parser generator because you can do it from scratch but it's not simple nor quick to do. You have to consider, generally speaking, regex and finite state automata for the lexical analysis; context-free grammars, LL parsers, bottom-up parsers, and LR parsers for Syntax analysis etc...etc...

I'm preparing for an exam concerning languages, grammars, parsing and compilers. It's not really my cup of tea and most resources I find use the language of mathematics to define the different terms of the trade and explain the different concepts I need to know rather than stick with English or French, which I would very much prefer. Therefore, I'm having some trouble both with finding the motivation to continue studying and with simply understanding the theory. So here is my question: Do any of you know where I could find a "fun" way of learning all this? Or at the very least, maybe a more "concrete" and less "mathematical" way of handling this subject.

I need to cover the following so anything on these subjects is welcome!

  • Parsing (LR, LL, ...)
  • Grammars (Context-free, deterministic, ...)
  • Syntax analysis Static flow analysis
  • Impact analysis concerning software maintenance and dependency to user interfaces
  • Dynamic analysis

Here are some resources which could be considered "fun" (with an emphasis on the quotation marks) ways to learn about a technical subject, just to get a sense of what I'm looking for.

How long do you have to prepare? The "best" way to learn compilers is to dig into them and the best way to do that is to use the best book on compilers EVER WRITTEN: The Dragon Book It's old, but awesome. It's not cheap but it is, quite possibly, the most concrete and least mathematical way to learn about the magical compiler.

It doesn't have any flashing lights and it won't be in an awesome font like the Ruby guide, but it's in the top 10 Books Every Programmer Should Read

I am trying to modify the value of local variable through another called function, but I am not able to figure out what all values pushed onto the stack are.

#include <stdio.h>
#include <string.h>

void fun()
  int i; 
  int *p=&i; 
  int j; 
  /* Stack Frame size is j int pointers. */ 

  int i=10;
  printf("\n %d \n",i);

How exactly does j in fun() equal to 12? I am trying to understand what values are pushed onto stack. More specifically, can we change the value of i which is in main() without using a for loop in fun() and is it possible to predict the value of j inside fun()?

When you have to access local variables from other function calls, I think you had better redesign your code.

Theoretically, you can direct to modify the i of main() in fun() if you can completely understand how the compilers deal with the activation records of function calls on the run-time stack. You can read "Compilers: Principles, Techniques, and Tools" for details( )

enter image description here

The value of j depends on the run-time stack address between the int i; in the fun() and int i = 10; in main() . In this case, when fun() is called, their relative distance on stack is just 12. That is why the j is 12. So the *(p + j) = 20; actually changed the i of the main(). If you change your code by adding int a = 14; as following, you will find the value of j changed for the activation record on the run-time stack has been changed.

#include <stdio.h>

void fun()
  int i;
  int *p=&i;
  int j;
  /* Stack Frame size is j int pointers. */

  int i=10;
  int a=14;  /* add this for example to change the relative address i in the main and i in the fun*/
  printf("\n %d \n",i);

I would like to write an SQL Parser. I was considering a high level language (probably Python).

Is there a good starting point for some theoretical concepts, a tutorial or something more generic on writing parsers ?

ANTLR can be a good choice. Or GOLD Parsing System with SQL grammar. Dragon book has good theoretical background if you want to build your own parser.

Is there any way to interpret Reverse Polish Notation into "normal" mathematical notation when using either C++ or C#? I work for an engineering firm, so they use RPN occasionally and we need a way to convert it. Any suggestions?

One approach is to take the example from the second chapter of the dragon book which explains how to write a parser to convert from infix to postfix notation, and reverse it.

Possible Duplicates:
Methodologies for designing a simple programming language
Learning to write a compiler

I would like to write a programming language with a syntax similar to QBasic but even simpler. I want it to be for beginning programmers. Its simplicity will encourage aspiring programmers not to give up and get them interested in programming. For example: Instead of QBasic's PRINT "Hello World!"

I would use

Write "Hello World!"

or a little more like VB

Write ("Hello World")

How would I go about adapting the basic syntax to make my language?

This is not a simple task. Language parsing and compiler theory are pretty hefty subjects. Lots o' math. You also have to decide what platform you want to target, which will also determine whether your language is fully compiled (eg. C/C++, Pascal), compiled into bytecode (e.g. Python, Java), or interpreted at runtime (eg. VBScript, JavaScript). For specifying the language itself, brush up on the Backus-Naur format.

To help you along, there are several robust parser generators out there, including:

  • Lex/Yacc (Flex/Bison are the GNU Versions) - The old school industry standard. For developing a compiler in C/C++
  • ANTLR - If you're interested in creating a compiler using Java
  • Boost.Spirit - A different approach, allowing specification of the language using C++ itself.

And many more. A comparison can be found here, while yet another list can be found here

If you're really interested in the full theory, you want to check out The Dragon Book.

But I must reiterate: This is a big subject. There are many, many tools to help you along the way, but the rabbit hole goes pretty deep.

Possible Duplicate:
Learning to write a compiler

I know this is a broad question to ask, but where could I start learning how compilers actually work, how programming languages are made, I mean not how you use Java or Ruby but how people actually are making them. I will not try to replicate these languages in any ways but I want to understand the concepts and theory behind it. So what I need is either some directions on what I should search for, or even better and more appriciated are book recommendations.


Jonathan Nash.

You could take a look at the Dragon Book:

Remember, this is using python. Well, I was fiddling around with an app I made called Pyline, today. It is a command line-like interface, with some cool features. However, I had an idea while making it: Since its like a "OS", wont it have its own language?

Well, I have seen some articles online on how to make a interpreter, and parser, and compiler, but it wasn't really readable for me. All I saw was a crapload of code. I am one of those guys who need comments or a readme or SOME form or communication towards the user without the code itself, so I think that Stack Overflow would be great for a teenager like me. Can I get some help?

You need some grounding first in order to actually create a programming language. I strongly suggest picking up a copy of Programming Language Pragmatics, which is quite readable (much more so than the Dragon book) and suitable for self study.

Once you are ready to start messing with parsers, ANTLR is the "gold" standard for parser generators in terms of usability (though flex+bison/yacc are quite capable).

The project: I want to build a LaTeX-to-MathML translator in PHP. Why? Because I'm a mathematician, and I want to publish math on my Drupal site. It doesn't have to translate all of LaTeX, since the basic document-level stuff is ably handled by the CMS and wouldn't be written in LaTeX to begin with; it just has to translate math written in LaTeX into math written in MathML. Although I feel as though I've done my due diligence, this doesn't seem to exist already. Maybe I'm wrong---if you know of something that would serve this purpose, by all means let me know, and thank you in advance. But assuming it doesn't exist, I guess I have to go write it myself.

Here's the thing, though: I've never done anything this ambitious. I don't really know where to begin. I've used PHP for years, but just to do the standard "build a CMS with PHP and MySQL"-type of stuff. I've never attempted anything as seemingly sophisticated as translation from one language to another.

I'm just dumb enough to consider doing it with regex---after all, LaTeX is a much more formal language, and it doesn't allow for nearly the kinds of pathological edge-cases, as say, HTML. But on the other hand, I'm just smart enough to realize this is probably a terrible idea: now I have two problems, and I sure don't want to end up like this guy.

So if that's not the way to go (right?), what is? How should I start thinking about this problem? Am I essentially writing a LaTeX compiler in PHP, and if so, what do I need to know to do that (like, should I just go read the Purple Dragon book first?)?

I'm both really excited and pretty intimidated by the prospect of this project, but hey, this is how we all learn to be programmers, right? If something we need doesn't exist, we go and build it, necessity is the mother of... you get the point. Tremendous thanks to everyone in advance for any and all guidance you can offer.

Possible Duplicate:
Learning to write a compiler

I need to come up with a dummy SQL like language which has very limited features. I have never done any compiler or parsing stuff before. Can anyone let me know a good point to start may be a link or a example of the same. I am so clueless.

I will be using this dummy language with C/C++ as my primary language.


I did a compiler construction course last year and we used the book

Compiler Construction by Kenneth C. Louden

It is very detailed with a good theoretical background. At the same time the author gives enough examples and uses very informative figures, so that you're never lost while learning. Eventually a compiler in C for a toy language is listed in the later chapters.

I really liked it!

The Dragon Book is often considered a good starting point. However, I will also recommend the ANTLR book

Does anyone know where to find good online resources with examples of how to make grammars and parse trees? Preferably introductory materials. Info that is n00b friendly, haven't found anything good with Google myself.

Edit: I'm thinking about theory, not a specific parser software.

Not online, but maybe you should take a look at Compilers: Principles, Techniques, and Tools (2nd Edition) by Aho et al. This is a standard text that has been evolving for 30 years (if you count the 1st Dragon Book, published in 1977

This project is for educational use and I am very well aware that excellent compilers already exist.

I am currently fighting my way through the famous Dragon Book and just started to implement my own Lexer. It works suprisingly well except for literals. I do not understand how to handle literals using symbol (lookup) tables and the book doesn't seem to cover that very well:

In the following code 60 is a numeric literal:

int myIdentifier = 60;

The Dragon Book says:

Technically speaking, for the lexeme 60 we should make up a token like (number,4), where 4 points to the symbol table for the internal representation of integer 60 [...]

Understood - I created the following Token:

<enum TokenType, int lookupIndex> 
//TokenType could be 'number' and lookupIndex could be any int

And stored the literal in a dictionary like this:

Dictionary<int literal, int index> 
//literal could be '60' and index could be anything

Since the literal itself is the key in the Dictionary, that allows me to quickly check if future literals have already been added to the symbol table (or not).

The Parser then recieves the Tokens from the Lexer and should be able to identify the literals in the symbol table.


  1. Why should my Token contain a lookup-index instead of containing the literal itself?
    Wouldn' t that be quicker...
  2. How should the Parser be able to quickly find the literal values inside the symbol-table when the lookup-index is the value of the dictionary?
    (I cannot make the lookup-index the dictionary-key because the Lexer would then have to check against the value of the dictionary wich is not very performant as well)
    Could a multi-indexed-dictionary be a solution? I guess not...
  3. Must I create a symbol-table for every type of literal then?
    F.e.:Dictionary<int literal, int index>
    and Dictionary<double literal, int index>
    and Dictionary<char literal, int index> etc.
  4. Maybe I am completly on the wrong track with literals. Feel free to post any better solutions.
  1. Why should my Token contain a lookup-index instead of containing the literal itself? Wouldn't that be quicker?

    Sure, it would probably be quicker. But then every literal would be a different value. Now, most programmers have the expectation that if they use, for example, "this longish string" twice in the same program, the compiler will be clever enough to only emit a single copy of that string in the final executable. And it would also be, shall we say, surprising if when you decompiled the code, you found 273 different storage locations for the constant 1, because every time the compiler saw a += 1, it created a new constant.

    The easiest way to ensure that constant literals are only emitted once is to keep them in an associative container indexed by the value of the literal.

    As @sepp2k points out in a commment, most hardware allows the use of small integer constants as direct operands, and sometimes even not-so-small constants. So the statement about the constant 1 above is a bit of an exagerration. You might well be able to handle integers differently, but it might not be worth the trouble.

  2. How should the Parser be able to quickly find the literal values inside the symbol-table when the lookup-index is the value of the dictionary?

    That depends a lot on the precise datastructure you use for literal tables (which I don't like to call symbol tables, but admittedly the concepts are related.) In many languages, you will find that your standard library containers are not a perfect match for the problem, so you will either need to adapt them to the purpose or write replacements.

    Still, it's not terribly complicated. One possibility is to use the combination of a map<literalType, int> and a vector<literalType>. Here the map associates literal values with indices into the vector. When you find a new literal value, you enter it into the map associated with the current size of the vector, and then push the value onto the vector (which will make its index correspond to the index you just inserted into the map.)

    That's not entirely ideal for large constants like strings because between the key in the map and the value in the vector, the constant is stored twice. When you're starting, I'd recommend just suppressing your annoyance about this duplication; later, if it proves to be a problem, you can find a solution.

    If you were using C++, you could use an (unordered) set instead of a map, and use a reference (pointer) to the newly-added element instead of an index. But I don't think that feature is available in many languages, and also pointers are sometimes awkward in comparison to indices. In some languages you could put all the values into the vector and then keep a set whose keys were indices into the vector. This requires that a lookup of the set can be done with something other than the key type; for some reason, this feature is available in very few datastructure libraries.

    And, yes, a doubly-indexed datastructure could be used, if you have one of those handy. (In effect, the map+vector solution is a doubly-indexed datastructure.)

  3. Must I create a symbol-table for every type of literal then?

    Maybe. How many kinds of literals do you have?

    You'll probably end up using type-tagged enumerations ("discriminated unions"), both for variables and for constants. (Again, not all languages have discriminated unions in their standard library, which is truly sad; if your implementation language lacks this basic feature, you'll need to implement it.) It should certainly be possible for a discriminated union instance to be used as a key in an associative data structure, so there is nothing stopping you, in principle, from keeping all your literals in a single data structure. If you have appropriate types, that's definitely what I'd recommend, at least when starting.

    Note that when you are ultimately emitting the literals as object code, you're really more interested in their bit representation and alignment than their semantics. If two constants of completely different types happen to have the same bit representation, then you could use the same storage location for both of them. If you have multiple widths of integer datatypes, then you'd probably want to keep all of them in a single literal table, precisely to take advantage of this optimization. No need to store a 1 of every width :). Occasionally you will find other cases where two literals of different types have the same representation, but it's probably not common enough to go out of your way to deal with it. (However, on IEEE hardware, floating point and integer zeros have the same representation, and that is usually the same representation as a NULL pointer, so you might want to special case zeros.)

    All in all, it's a judgement call. How complicated is it to use a discriminated union as a key? How much storage could you save by having associative containers with specific key types, and does it matter? Will you want to iterate over all literal constants of the same type (answer: probably) and how easy is that to do with your datastructures?

    If you use a well-designed internal API, you will be able to change your mind about the internal representation of your literal tables. So I'd use this experiment as an opportunity to try good API design.

  4. Anything else?

    Good luck with your project. Learn and enjoy!

I want to write an interpreter for a scripting language in javascript. Something that could run this script:

set myVariable to "Hello World"
repeat 5 times with x
    set myVariable to myVariable plus " " plus x
popup "myVariable is: " plus myVariable

The equivalent javascript of the above would be:

var myVariable = "Hello World";
for (var x=1; x<=5; x++) {
    myVariable += " " + x;
alert("myVariable is: " + myVariable);

I don't want to translate from one to the other, I want to write a javascript program to interprete and execute the script directly. How can this be done?


I'm looking for a tutorial (preferably in javascript but C would do) that will walk me through this. I guess I'm looking for one that does not use any external tools as the tools seem to be my issue. I don't want to use something that calls libraries and a bunch of pre-built code. I want to see the whole thing, done from scratch.

OK, I'll actually try to tackle this question somewhat... although there is no way I could possibly distill everything you need to know into a few sentences or even paragraphs.

First, you should gain an understanding / familiarity with what's involved in building a compiler. You say you want to "interpret" the code - but, I think what you really want is to compile the code to Javascript (and in Javascript as well).

Wikipedia has a great page on the topic:

The gist of the thing is:

1.) Convert the text (source code) into some sort of in-memory data structure (abstract syntax tree - AST) that actually lets you reason about the structure of the program you've been given.

2.) Given that structure, produce your output (Javascript, in this case).

To break down step 1 a bit further - Define your grammar e.g.; what is valid syntax in this new language of yours, and what is not? Typically, it's best to reason about this sort of thing with BNF on paper (or whatever syntax the tools you use prefer - although (E)BNF is the standard). The challenging part about this step is not only doing the grunt work of parsing the source code - but also making sure you've come up with a grammar that is unambiguous and readily parsable. Those two requirements are actually somewhat more difficult to nail down than you might think.

I've built an LALR parser generator in C# - and, I can tell you, unless you've built one before, it's not a trivial task. Beyond that, there are so many good ones, that, unless you are really wanting to know how it works for the fun of it or because you're into that kind of thing, it makes a whole lot more sense to use a parser-generator someone else wrote. The great thing about a parser generator is that it will take that syntax definition you've come up with convert it into a program that will spit out an AST the other end. That's a HUGE amount of work that was just done for you. And, in fact, there are a few for Javascript:

PEG.js – Parser Generator for JavaScript

JS/CC Parser Generator Project Homepage

On to step 2. This step can be very basic for something like infix expressions - or it can get very complex. But, the idea is, given the AST, "convert" it into your output format (Javascript). Typically you need to check for things that aren't checked for by the "simple" syntax checking that occurs in the parser. For example, even in your sample code there is a whole number of things that could possibly go wrong. In the part where you say plus x what would happen if the developer never defined x? Should this be an error? Should x default to some value? This is where your language really comes to life. And, to back-track for a minute - your time needs to be spent on this step - not on the parser. Use a tool for that - seriously. You're talking about starting a large and challenging project - don't make it even harder for yourself. To add to all this - there is often a need to make multiple "passes" through the AST. For example, the first pass may look for and setup "module" definitions, the second pass may look for and setup "namespaces", another pass may setup classes, etc. These further refinements of the structure of the final application are used in later steps to determine if a reference to a particular class/variable/module/etc is valid (it actually exists or can be referenced).

There are a few really great books on compilers. The infamous "dragon book" is one.

Say I define, instantiate, and use an adder functor like so:

class SomeAdder {
        SomeAdder(int init_x): x(init_x) {}
        void operator()(int num) { cout << x + num <<endl; }
        int x;

SomeAdder a = SomeAdder (3);
a(5); //Prints 8

SomeAdder b(5);
b(5); //Prints 10

The constructor and the overloaded () operator are both called using double parenthesis and have the same types of parameters. How would the compiler determine which function to use during the instantiations of SomeAdder and the "function calls", as to implement the correct behavior? The answer seems like it would be obvious on the surface, but I just can't wrap my head around this thought.

Thanks for your time!

C++ has a grammar and from that the compiler will know(gross simplification) when a type is being instantiated and therefore a constructor should be called from the case where an overloaded operator () is being called on an instance of a class.

How the grammar is used to determine this probably requires a course on compilers which the Dragon Book is probably the standard. If you are curious you can also check out the C++ Grandmaster Certification whose goal is to build a C++ compiler.

I am working on a project to automatically convert a custom language to Java and have been asked to do some basic optimizations of the code during the conversion process. For example, the custom code may have something like:

if someFunction(a, b) > x:
    do something
    return someFunction(a, b) + y

in this instance, someFunction is called multiple times with the same inputs, so additional performance can be obtained by caching the value of someFunction() and only calling it once. Thus, an "optimized" version of the above code may look something like:

var1 = someFunction(a, b)

if var1 > x:
    do something
    return var1 + y

Currently, this is done by hand during the conversion process. I run a program to convert the code in the custom language to Java and then manually examine the converted code to see what can be optimized. I want to automate the optimization process since these problems creep up again and again. The people who are writing the code in the custom language do not want to worry about such things, so I can't ask them to just make sure that the code they give me is already optimized.

What are some tutorials, papers, etc... that details how such things are done in modern compilers? I don't want to have to re-invent the wheel too much. Thanks in advance.

Edit 1:

It can be assumed that the function is pure.

This is known as Common subexpression elimination.

Normally, this would require you pretty much implement a full compiler in order to do the data flow analysis. An algorithm is given in Dragon Book, "6.1.2 The Value-Number Method for Constructing DAG's" (for the local CSE at least).

I'm looking for a good explanation of the definitions of the FIRST, FOLLOW, and PREDICT sets of a RDP when given a grammar.

You can automatically calculate first, follow, and predict sets using Calculate Predict, First, and Follow Sets from BNF (Backus Naur Form) Grammar Specification without having to download anything. It's a good way to verify answers or automate the tedium.

If you want to do it manually, the Dragon Book (2nd ed) covers it on pages 221-222.

Possible Duplicate:
What are the stages of compilation of a C++ program?

I find that understanding how a given software language is compiled can be key to understanding best practices and getting the most out of that language. This seems to be doubly true with C++. Is there a good primer or document (for mortals) that describes C++ from the point of view of the compiler? (Obviously every compiler is a little different.)

I thought there may be something along those lines in the beginning of Stroustrup's book.

Does anyone have references to documents and research specific on the inner workings of shader compilers/graphics drivers compilers?

There's no big difference between writing an ordinary C compiler and writing a shader compiler. The standard book on writing compilers is the so called "Dragon Book":

Guys what are the best Online resources for learning Compiler Design ?

Would Perl be a viable language to write a Compiler ?

Rather than online, as mentioned in the above answer, grab yourself a copy of the Dragon book Compilers: Principles, Techniques, and Tools. A copy of the first edition shouldn't set you back too much.

Not sure about Perl as a language of choice for implementing a compiler though.

It's been ratling in my brain for a while.

I've had some investigation on Compilers/Flex/Byson and stuff but I never found a good reference that talked in detail about the "parsing stack", or how to go about implementing one.

Does anyone know of good references where I could catch up?

Edit: I do appreciate all the compiler references, and I'm going to get some of the books listed, but my main focus was on the Parsing itself and not what you do with it after.

The Dragon book! I used it quite recently to write a compiler (in PHP!) for a processing language for template files written in RTF...

My MIPS Assembly class required me to read in an expression of unknown size into a Parse Tree. I've never had to deal with trees, so this is how I went around storing values:

Lets say the user entered the expression 1 + 3 - 4 (each operand could only be a digit 1-9)

My leftmost child node would be the starting point and contain 2 pieces of data

1. The operand
2. Pointer to the next node (operator)

This is how I constructed the tree. I would point from operand to operator to next operand to next operator until there were no more values left to be read in.

My next task was to traverse the tree recursively and output the values in infix/prefix/postfix notation.

Infix traversal was no problem considering how I constructed my tree.

I'm stuck on the prefix. Firstly, I don't fully understand it.

When I output our expression (1 + 3 - 4) in prefix should it come out - + 1 3 4? I'm having trouble following the online examples.

Also do you think my tree is constructed properly? I mean, I have no way to go to a previous node from the current node which means I always have to begin traversal from the leftmost child node which instinctively doesn't sound right even though my TA said that was the way to go.

Thank you for any help.

This is an instance of the general problem of compiling, which is a solved problem. If you do a google on compiling techniques, you will find out all kinds of information relating to your problem.

Your library should have a copy of Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman. If it doesn't have it, request it for purchase(it's the Standard Work in the field). The first part of it should help you.

Third edition link

My teacher told me that if I wanted to get the best grade in our programming class, I should code a Simple Source Code Converter.

Python to Ruby (the simplest he said)

Now my question to you: how hard is it to code a simple source code converter for python to ruby. (It should convert file controlling, Control Statements, etc.)

Do you have any tips for me?

Which language should I use to code the converter (C#, Python or Ruby)?

There is a name for a program which converts one type of code to another. It's called a compiler (even if the target language is not in fact machine or byte code).

Compilers are not the easiest part of computer science, and this is project that, if it were to be anything more than a toy implementation of a converter, would be a massive project. Certainly larger than what one would normally do for a class project in most university courses. (Even many/most compilers courses have fairly modest project assignments.

As to what language to use? Well, whichever one you know best is probably the answer. Though if you want to learn something new, Haskell would be a good choice, with its pattern matching features. (Disclaimer: I'm new to haskell.) (Yacc could also be used, if you're really serious about getting into compilers.)

You'll also want to consult: The Dragon Compiler Book, which is worth studying even if you don't plan to write compilers.

Compilation generally occur in several stages:lexical analysis, syntax analysis, etc. Say, in C language, I wrote


without declaring a as int. Now, at what stage of compilation an error is detected? At syntax analysis stage? If that is the case, then what does lexical analyzer do? Just tokenizing the source code?

If talking about a general form of compiler,it is obvious that the error will occur at the syntax analysis phase when the parser will look for the symbol searching in symbol table entries ,and the subsequent phases - only if processed further after recovering from error.

The dragon book also clearly tells that. It is mentioned in the page where the types of error are mentioned. The topic to be studied thoroughly to understand this issue is given in 4.1.3 - Syntax Error Handling .

a = 24;   // without declaring a as an int type variable.

Here, the work of lexical phase is simply to access characters and form tokens and subsequently pass them to the further phases,i.e., to the parse in the syntax analysis phase,etc.

I'm trying to optimize my simple C interpretter that I made just for fun, I am doing parsing like this - firstly I parse file into tokens inside doubly linked list, then I do syntax and semantic analysis.
I want to optimize function with this prototype:

bool parsed_keyword(struct token *, char dictionary[][]);

Inside the function I basically call strcmp against all keywords and edit token type. This of course lead to 20 strcmp calls for each string that is being parsed (almost).

I was thinking Rabin-Karp would be best, but it sounds to me like it isn't best suited for this job (matching one word against small dictionary). What would be the best algoritm to do this work? Thanks for any suggestions.

A hash table would probably be my choice for this particular problem. It will provide O(1) lookup for a table of your size. A trie would also be a good choice though.

But, the simplest to implement would be to place your words in an array alphabetically, and then use bsearch from the C library. It should be almost as fast as a hash or trie, since you are only dealing with 30 some words. It might actually turn out to be faster than a hash table, since you won't have to compute a hash value.

Steve Jessop's idea is a good one, to layout your strings end to end in identically sized char arrays.

const char keywords[][MAX_KEYWORD_LEN+1] = {
 "auto", "break", "case", /* ... */, "while"

#define NUM_KEYWORDS sizeof(keywords)/sizeof(keywords[0])

int keyword_cmp (const void *a, const void *b) {
    return strcmp(a, b);

const char *kw = bsearch(word, keywords, NUM_KEYWORDS, sizeof(keywords[0]),

int kw_index = (kw ? (const char (*)[MAX_KEYWORD_LEN+1])kw - keywords : -1);

If you don't already have it, you should consider acquiring a copy of Compilers: Principles, Techniques, and Tools. Because of its cover, it is often referred to as The Dragon Book.

I want to parse text with javascript. The syntax i want to parse is a markup language. this language has 2 main kind of markup:


mean the following characters will be of color #F56. Until the following $ with 3 hex char it is using this color.


Mean until the following $z (closing tag) the text is in italic. They are other one letter tags.

So basically this language is composed of 3 character long hexa tags for color and one letter long tags.

I can craft something ugly to parse my text, storing char position and current status of tags (formatting and color) but i'd like to learn proper parsing. Could you give me a few tips/principle to make a clean parser for this language ?

If you really want to learn about parsing, pick up this book: Compilers: Principles, Techniques, and Tools aka The Dragon book. It is very dense, but offers the most complete take on parsing.

I've heard good things about ANTLR (mentioned above) but have not used it. I have used Bison though, which worked pretty well for me to define the grammar.

I want to write a program that scans javascript code, and replace variable names with short, meaningless identifiers, without breaking the code.

I know YUI compresser and google's closure compiler can do this. I am wondering how can I implement this?

Is it necessary to build the abstract syntax tree? If not, how can I find the candidate variables for renaming?

Most modern javascript compressors are actually compilers. They parse javascript input into an abstract syntax tree, perform operations on the tree (some safe, some not) and then use the syntax tree to print out code. Both Uglify and Closure-Compiler are true compilers.

Implementing your own compiler is a large project and requires a great knowledge of computing theory. The dragon book is a great resource from which to get started.

You may be able to leverage existing work. I recommend starting from a non-optimizing compiler for reference.

Let's say I have a programming language where I can write: x = f(g(1), h(1)) in this case the directed acyclic graph will show the dependencies of calculation like in a spreadsheet (assuming non recursive expressions):

| \
g  h
 \ /

This is a simple example but it turns interesting trying to "compress" more complex expressions within a DAG. The goal here is optimizing the number of recalculations based on the dependencies.

What algorithms and papers are available for dealing with this problem?

A bit more specific, it's Local Common Subexpression Elimination. An algorithm is given in Dragon Book, "6.1.2 The Value-Number Method for Constructing DAG's"

There is a question on stackoverflow about Learning to write a compiler. I have looked at it and I think it's an undertaking I want to tackle. I think (like many others) the knowledge will be invaluable to a programmer. However, my skills are primarily in C++ and I would say I am very comfortable with the syntax of the language and some basic algorithms and design concepts, but I am by no means a seasoned programmer. My only programming experience comes from academic textbooks at the college level and the completion of introductory/intermediate courses (300 level classes). Hence, the rise of my question.

With only a general knowledge of the C++ language and no Assembly knowledge, would a book aimed at the theories and workings of a compiler and the implementation of those theories, such as the book Compilers: Principles, Techniques, and Tools (2nd Edition), be difficult for me to understand?

I would recommend you start with an interpreter first as you don't need proprietary hardware knowledge to implement it. The key concepts are usually how is the language defined, what makes a statement, building parse trees, etc... The hardware knowledge to me is secondary than actually understanding how to read and evaluate the statements.

What I did when learning was write a small interpreter for a Pascal like language and started small with simple statements and variable storage and slowly added different things to it as I got better.

I am looking to write an interpreted language in C#, where should I start? I know how I would do it using fun string parsing, but what is the correct way?

Checkout the Phoenix compiler from Microsoft. This will provide many of the tools you will need to build a compiler targeting native or managed environments. Among these tools us a optimizing back end.

I second Cycnus' suggestion on reading Aho Sethi and Ullman's "Dragon Book" (Wikipedia, Amazon).


Assume that we are given input file in the following form:

12 a -5 
T 23 -1 34 R K s 3 4 r  
a a 34 12 -12 y 

Now, we need to read the entire file and print the following:

number of integers
number of lowercase char
number of uppercase char
sum of all integers

Questions like this have always been a thorn in my flesh and I want to get this over with once and for all.

You need to parse the file:

1) separate raw text into tokens, then

2) "decide" if the token is a string, an integer, or "something else".

3) Count the results (#/integers, #/strings, etc)

Here's an example: Parse A Text File In C++

Here's the canonical textbook on the subject: The Dragon Book

Can lex and yacc be used for making a programming language? and any recommendation for some books.

some references ?

So far i have found some like :

Build code with lex and yacc, Part 1: Introduction

Yes, you can certainly use lex and yacc to build a compiler/translator for a programming or scripting language. There are GNU variants of these tools, called flex and bison. John Levine's lex & yacc was for many years the gold standard for books about these tools. That book may be out of print, but I expect that the successor book, Flex & Bison, is just as good. To dig deeper into building a compiler, start with Aho et al., Compilers: Principles, Techniques, and Tools, 2/e. (Again, my recommendation is based on the first edition of this book.)

I am confused between Syntax Directed Translation and parser written using Bison. (The main confusion is whether parser written in Bison already consists of syntax directed translator.)I rephrase the above sentence in parenthesis as (How does Bison implement Syntax Directed Translation, is it by attaching for E.g. $$ = $1 + $3)

This link says that

The C code in an action can refer to the semantic values of the components matched by the rule with the construct $n, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$. (Bison translates both of these constructs into array element references when it copies the actions into the parser file.)

And also in chapter 5 (Syntax Directed Analysis) of the book says

Grammar + Semantic rules = Syntax Directed Translation

 →1 +           {. = 1. ┤| . |′+′}

When looking at the following snippet of translation rules for a simple parser from the book Flex and Bison

E:  F default $$ = $1
        | E ADD F { $$ = $1 + $3; }
        | E SUB F { $$ = $1 - $3; }

Is the .code equavelent to $$ I am so confused. Is syntax directed analysis the same as semantic analysis? The more I read more I get confused. Someone please help me sort this out.

Your understanding seems correct, but is confused by the fact that your example from the Dragon book and example parser are doing two different things -- the Dragon book is translating the expression into code, while the simple parser is evaluating the expression, not translating (so this is really syntax directed evaluation, not syntax directed translation).

In the semantic rules described in the Dragon book, symbols can have multiple attributes, both synthesized and inherited. That's what the .code suffix means -- its an attribute of the symbols it is applied to. Bison on the other hand allows each symbol to have a single synthesized attribute -- no more, and no inherited attributes. If you want multiple attrbutes, you can gather them together into a struct and use that as your attribute (requires some careful management). If you want inherited attributes you can use $0 and even more careful management, or you can use globals to get the same effect.

The bison snippet that would correspond to your Dragon book example snippet would be something like:

E : E ADD F { $$ = AppendCode($1, $3, PLUS); }

using the single bison attribute for the .code attribute and doing the append operation for the code being generated as a function.

I am new to Visual C++ and I am using Microsoft Visual C++ 6.0 to build an application. The application for now has to generate a .cpp file from a proprietory .cfg file. Can anyone please guide how this can be achieved. Any help or guidance is much appreciated.

Thanks, Viren

Your question is a little vague, however it sounds like you need to develop some kind of parser to read in the cfg files and translate it into some form of intermediate language or object graph, optimize it, and then output it to c++. Sounds to me like a job for a home-grown compiler.

If you aren't familar with the different phases of a compiler I would highly recommend you check out the infamous dragon book

Then again, if this is for an important project with a deadline you probably don't have a lot of time to spend in the world of compiler theory. Instead you might want to check out antlr. It is really useful for creating a lexar and parser for you based on grammar rules that you define from the syntax of the cfg files. You can use the antlr parser to translate the cfg files into an AST or some other form of object graph. At that point you are going to be responsible for manipulating, optimizing and outputting the c++ syntax to a new file.

I haven't read it yet but this is supposed to be an excellent book for novice and experienced antlr users

plus there a plenty of antlr tutorials and examples online that I've used to help learn it. Hope that helps put you in the right direction.

I am trying to create a lexical analyzer program using java.Program must have the concept of tokenization .I have beginner level knowledge in compiler programming.I know there are lot of Lexical generators exist on internet.I can use them to test my own lexical analyzer out put .But i need to do my own lexical analyzer .Can any one please give some best references or articles or ideas to start my cording ?

I would try taking a look at the source code for some of the better ones out there. I have used Sablecc in the past. If you go to this page describing how to to set you your environment, there is a link to the source code for it. Antlr is also a really commonly used one. Here is the source code for it.

Also, The Dragon Book is really good.

As Suggested by SK-logic I am adding Modern Compiler Implementation as another option.

If i have a String as follow

( (a || b) && c) || (d && e)

How can i split them into diffrent string based on the brackets and form a tree like that?

         ( (a || b) && c) || (d && e)  ---> Root

               /                \
              /                  \
           ( (a|| b) || c)      (d && e)
           /           \             /  \             
          /             \            /   \
         (a || b)        c           d    e

The problem you are suggesting probably falls into computer science's branch of parsers and formal languages.

A parser program based on an arbitrary grammar for an arbitrary string can be generated with tools like lex & yacc.

Lex is a lexical analyzer program generation tool, which takes as input a text file that defines the lexical rules of your grammar as regexp, and outputs a program capable of recognize tokens from an arbitrary input string as you defined them in the rules.

Yacc is a syntax parser program generation tool, which takes as input a lexer, a text file that represents the grammar of your language (in your case, that would be an expression-like grammar), and outputs a program called parser which will be able to transform your expression string into a tree as you mention (i.e. parse the string into a parse-tree).

Yacc and lex can be easily used together to generate a parser program that creates a parse-tree based on so-called semantic-actions with which you instruct the parser to build the tree in the way you want.

I suggest you the following as an introductory reading:

If you are interested in the matter, a more challenging reading would be:

Yacc and Lex are made only for the C language, equivalent tools exist for the Java. My favorite parser-generator tool in java would be:

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.

e.g. String "How can I download codec from internet for facebook, Professional programmer support"

[the string above may not make any sense but I am using it for illustration purpose]

segment list

  • Microsoft word
  • Microsoft excel
  • Professional Programmer.
  • Google
  • Facebook
  • Download codec from internet.

Ouptut :

  • Download codec from internet
  • facebook
  • Professional programmer

Bascially i am trying to do a query reduction.

I want to achieve it less than O(list length + string length) time. As my list is more than 15000 segments, it will be time consuming to search entire list in string. The segments are prepared manully and placed in a txt file.



What your basically asking how to do is write a custom lexer/parser.

Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).

Take a look at this question:

Poor man's lexer for C#

Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.

What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).

Hope this helps.

I'm reading Compilers: Principles, Techniques, and Tools (2nd Edition) and I'm trying to compute the FOLLOW() sets of the following grammar:

S  →  iEtSS' | a
S' →  eS | ε
E  →  b

where S, S', E are non-terminal symbols, S is the start symbol, i, t, a, e, b are terminal symbols, and ε is the empty string.

What I've done so far

FOLLOW(S) = {$} ∪ FOLLOW(S')
FOLLOW(E) = FIRST(tSS') - {ε} = FIRST(t) - {ε} = {t} - {ε} = {t}

where $ is the input right endmaker.


$ ∈ FOLLOW(S), since S is the start symbol. We also know that S' → eS, so everything in FOLLOW(S') is in FOLLOW(S). Therefore, FOLLOW(S) = {$} ∪ FOLLOW(S').

We also know that S → iEtSS', so everything in FOLLOW(S) is in FOLLOW(S'). Therefore, FOLLOW(S') = FOLLOW(S).

The problem is that I can't compute FOLLOW(S), since I don't know FOLLOW(S'). Any ideas?

The simple algorithm, described in the text, is a least fixed-point computation. You basically cycle through the nonterminals, putting terminals into the follow sets, until you get through an entire cycle without changes.

Since nothing is ever removed from any follow set, and the number of terminals is finite, the algorithm must terminate. It usually only takes a few cycles.

In my project I have a view where I write words in some textfield, when I press a button these string must be stored in a csv file as this example: (example with 5 textfield)


this is an example of the result that I want. How can I do?

Edited to add:

code for the string

 NSMutableString *csvString = [NSMutableString stringWithString:textfield1.text];
[csvString appendString:@"#"];
[csvString appendString:textfield2.text];
[csvString appendString:@"#"];
[csvString appendString:dtextfield3.text];
[csvString appendString:@"#"];
[csvString appendString:textfield4.text];
[csvString appendString:@"#"];
[csvString appendString:textfield5.text];
[csvString appendString:@"#"];
[csvString appendString:textfield6.text];
[csvString appendString:@"#"];
[csvString appendString:textfield7.text];
[csvString appendString:@"#"];
if (uiswitch.on) { //switch
    [csvString appendString:@"1"];
else [csvString appendString:@"0"];
[csvString appendString:@";"];

finally csvString

NSLog(@"string = %@", csvString);

is exactly my string

Just as noted in my answer to another almost identical question of yours from earlier today: Do NOT do that.

As soon as a user enters a "#" or ";" into one of the text fields your csv file (or rather: what you call a CSV file, but actually isn't one at all) will get corrupted and crash your code once read in again (or at least result in malformed data).

Again: Do NOT do that.

Instead: stick with real CSV and a parser/writer written by a professional.

Generally speaking: Unless you have very good knowledge of Chomsky's hierarchy of formal languages and experience in writing language/file-format parsers do NOT (as in NEVER!) attempt to write one. Neither for your personal projects, let alone public ones. (Do the latter and I'll hunt you down! ;) )
Languages/Formats such as CSV look trivial at first glance but aren't by any means (as in type-2-language).

I searched the internet to find an answer, but I couldn't. Is there anyone who will help me?

expr         ->term moreterms

moreterms    -> +term {print(‘+’)} moreterms
         |­‐term {print(‘‐’)} moreterms

term        ->factor morefactors

morefactors ->*factor {print(‘*’)} morefactors
         |/factor {print(‘/’)} morefactors

factor      ->(expr)
         |id {print(id)}
         |num {print(num)}  

I will use this code for a very basic calculator compiler and a interpreter.

How can I convert this grammer into C++ or Java?

There are many tools that take grammars and generate parsers, ranging from Yacc to boost spirit.

The art of writing parsers has been widely studied. It isn't trivial. One approach is to determine if you can make your BNF into an LR(1) grammar and write a LR parser for it.

An easy way to parse is to split your parsing into tokenizing (where you bundle things into identifiers), and syntax tree generation.

Wikipedia has a cursory description of LR parsing. Knuth's Canonical LR(1) parser is also worth looking at.

Teaching how to write an LR(1) parser (with whatever restrictions, let alone an LR(k) parser) is a matter of a short college course or a book chapter, not a stack overflow post.

But the general idea is you read from left to right. You look ahead k tokens (typically 1) to determine which rule to apply to the next token you encounter. You build the parse tree from the bottom-up.

There are lots of technical details, techniques, quirks and problems. Not every BNF grammar can be turned into a LR(1) grammar, let alone the restricted ones that many parse generators can handle.

As mentioend by @UnholySheep, The Dragon Book is the book that most people learn these techniques from.

It might sound stupid, but i decided to take the challenge to program the Translation Algorithm with help of OOP NetBeans - Java, having only basic knowledge of Java, and the theory only in the Translation Algorithm (Compiler).

I am here to ask for your assistance, if somehow any of you did something like Translation from one programming language into another I happy if you could provide me with the links to the information you've used or set me on to the right direction so I could start correctly!

Thank you in Advance

Best Armani

Theory of compilation is a huge field of research, that among others include formal languages, graph theory, low level optimizations and more.

A good place to start learning about it is the Dragon Book .

If you are using java, a useful tool that helps you do most of the front-end tasks of a compiler is JavaCC

I want to experiment with Programming Language Design.
The feature set that I imagined would be doable in C++, meaning you could rewrite anything from "MyLang" in C++.

I thought it would be great to have a two-way-converter, from MyLang to C++ and the other way around. This way I can avoid writing a Compiler/Optimizer/Linker/VirtualMachine/whatever and just use all the good stuff which is available for C++.

In my preliminary search I came across LLVM/Clang and thought that it would be a great ease of work to use its underlying parsing and AST generation to do what I want. But closer looks have shown me that it is a gigantic beast of a project where getting started is not an easy thing to do. My current point of entry in clang is the clang-modernizer, since it looks nice, small enough and pluggable, but I imagine it would break as soon as I break anywhere with C++ syntax. I want to stay on a higher level than LLVM IR, since MyLang would be very similar to C++ on a high level.

An example of conversion would be something that takes a my.cpp and a my.hpp file and combines it into a my.lang file, at this time it may be beeing 100% valid C++ in the output file. Later the my.lang file shall be reconverted, splitting the definitions and inline methods into the my.hpp file and the non-inline methods into my.cpp again. Later on I plan to add more deviations from C++ syntax, but this might be a good start.

The Questions:

  1. Do You know of a Project/Framework/Toolkit that does exacly supply a two way converter, which is Open Source or maybe completely configurable to be allow what I want?
  2. Do You think LLVM/Clang is the best option for creating a MyLanguage to C++ converter? Do You have good Alternatives?
  3. Any (web-)literature that helps getting my foot in the right spot in the door for a Framework/Clang/YourAlternative?

The Not-Questions:

  • I know what an AST is.
  • I know C++ is a beast itself, I'm not really aiming at 100% C++ code compatibility right now.
  • This is just research, I don't want to get anything done with it. ;-)

Thank you for your time!
Please be gentle, this is my first Question here.

Do You know of a Project/Framework/Toolkit that does exacly supply a two way converter, which is Open Source or maybe completely configurable to be allow what I want?

I believe LLVM can do just what you want. However, I can't guarantee the resulting translation would be human readable.

I would create a front-end that compiles to LLVM IR. The IR can be easily converted to C++ with the the llvm static compiler, by targeting the C++ backend (llc -march=C++).

If you just want your new language to execute, there is no reason to convert it to C++ and then recompile it. You can JIT/Interpret utilizing the LLVM framework.

If you want any LLVM IR to be able to convert to your language, you can create a compiler target, that handles the generation.

Do You think LLVM/Clang is the best option for creating a MyLanguage to C++ converter? Do You have good Alternatives?

I believe the LLVM framework is the way to go. If all you want to do is focus on the compiler front-end, you can do just that. You will get all the back-end optimizations and all the targets included in the framework. This is nice to scope your focus.

In terms of developing the front-end for your language, you can take advantage of the ANTLR parser generator. This will help you develop to an AST. In addition, perform any optimizations and validations that you can do to an AST. After you have your AST you can create a visitor that navigates the AST to generate LLVM IR. There already exists a grammar file for C++ to start with here.

Any (web-)literature that helps getting my foot in the right spot in the door for a Framework/Clang/YourAlternative?

Compilers are awesome and extremely complex. I suggest you at least have the purple dragon book. To get going on LLVM I would go through their tutorial. You go through the development of a language, from the front-end all the way to JITing.

I'd like to learn more about the LLVM system, as I use the compiler a lot. I have no background in compiler technology. Is the Dragon Book still a must read in order to understand LLVM or is it outdated? Is there anything better (and shorter) at this moment?

The Dragon book is arguably THE book for compiler concepts. The level of familiarity with compiler concepts that you should have before digging into LLVM depends on what exactly do you want to achieve and where do you want to contribute.

For example, to build a new LLVM front-end you should probably be first familiar with the concepts of lexical and semantics analysis. Further, to implement optimizations and/or instrumentation you should probably be familiar with the concepts of data-flow analysis to apply them on LLVM IR.

I need a simple lexical analyzer that reports for-loop errors in C/C++.

For purely lexical analysis, you could use regular expressions, or any of dozens scanner generators (flex/lex, ANTLR). For syntactic analysis on the other hand, you'd probably need a parser generator that could read a context-free grammar. However, from what I understand, most C++ parsers are hand written. I'm not sure if even an LALR parser would do the trick; you might need to bring out the big guns and use something like Bison's GLR support. Also, for a ton more information on lexical/syntactic analysis, I'd recommend 'The Dragon Book'. Good luck!

Recently I have been extremely interested in language development, I've got multiple working front ends and have had various systems for executing the code. I've decided I would like to try to develop a virtual machines type system. (Kind of like the JVM but much simpler of course) So I've managed to create a basic working instruction set with a stack and registers but I'm just curious about how some things should be implemented.

In Java for example after you've written a program you compile it with the java compiler and it creates a binary (.class) for the JVM to execute. I don't understand how this is done, how does the JVM interpret this binary, what's the transition from human readable instructions to this binary, how could I create something similar?

Thanks for any help/suggestions!

Alright, I'll bite on this generic question.

Implementing an compiler/assembler/vm combo is a tall order, especially if you're doing it by yourself. That being said: If you keep your language specification simple enough, it is quite doable; also by yourself.

Basically, to create a binary, the following is done (this is a tad bit simplified*:

1) Input source is read, lexed, and tokenized

2) The program logic is analyzed for semantical correctness.

E.g. while the following C++ would parse & tokenize, it would fail semantic analysis

float int* double = const (_identifier >><<) operator& * 

3) Build an Abstract Syntax Tree to represent the statements

4) Build symbol tables and resolve identifiers

5) Optional: Optimization of code

6) Generate code in an output format of your choice; for example binary opcodes/operands, string tables. Whatever format suits your needs best. Alternatively, you could create bytecode for an existing VM, or for a native CPU.

EDIT If you want to devise your own bytecode format, you can write, for example:

1) File Header
DWORD filesize
DWORD checksum
BYTE  endianness;
DWORD entrypoint <-- Entry point for first instruction in main() or whatever
2) String table
DWORD numstrings
DWORD stringlen
<string bytes/words>

3) Instructions
DWORD numinstructions
DWORD opcode
DWORD numops <--- or deduce from opcode
DWORD op1_type <--- stack index, integer literal, index to string table, etc
DWORD operand1
DWORD op1_type
DWORD operand2


Overall, the steps are managable, but, as always, the devil is in the details.

Some good references are:

The Dragon Book - This is heavy on theory, so it's a dry read, but worthwhile

Game Scripting Mastery - Guides you along while developing all three components in a more practical matter. However, the example code is rife with security issues, memory leaks, and overall lousy coding style (imho). However, you can take a lot of concepts away from this book, and it's worth a read.

The Art of Compiler Design - I have not read this one personally, but heard positive things about it.

If you decide to go down this road, be sure you know what you're getting yourself into. This is not something some the faint of heart, or someone new to programming. It requires a lot of conceptual thinking and prior planning. It is, however, quite rewarding and fun

I have been thinking about building my own compiler for a while and a few days ago I finally started on it. My compiler works like this:

  • Parse code from my own file. (With a .exe file made with c++)
  • Create assembly code
  • Create a file containing those assembly code
  • Compile that assembly file if it is made (done with a vbs script)
  • Link the .obj file
  • And we have our .exe file

Now I am having difficulties with finding the best way to parse my code. I haven't really made this yet but I will put my ideas here.

  • Find all variables and declare them. variables will be preceded with a 'var ' (for now). uninitialized variables will be put in the .data? section and initialized ones in the .data section.
  • Find the main procedure and start executing the functions and operations.

Now I was simply wondering if someone can improve my ideas. Or if someone has a better idea to make some kind of compiler and your own programming language.

Get yourself a copy of A. V. Aho, M. S. Lam, R. Sethi, J. D. Ullman: Compilers: Principles, Techniques, and Tools and start studying

The book covers the necessary theoretical background, especially:

  • Context-free grammars
  • Recursive-descent, LL, LR parsing
  • Symbol handling
  • Intermediate representation