Compilers

Alfred V. Aho, Ravi Sethi

Mentioned 31

This book provides the foundation for understanding the theory and pracitce of compilers. Revised and updated, it reflects the current state of compilation. Every chapter has been completely revised to reflect developments in software engineering, programming languages, and computer architecture that have occurred since 1986, when the last edition published.& The authors, recognizing that few readers will ever go on to construct a compiler, retain their focus on the broader set of problems faced in software design and software development. Computer scientists, developers, & and aspiring students that want to learn how to build, maintain, and execute a compiler for a major programming language.

More on Amazon.com

Mentioned in questions and answers.

I'm having issues 'describing each step' when creating an NFA from a regular expression. The question is as follows:

Convert the following regular expression to a non-deterministic finite-state automaton (NFA), clearly describing the steps of the algorithm that you use: (b|a)*b(a|b)

I've made a simple 3-state machine but it's very much from intuition. This is a question from a past exam written by my lecturer, who also wrote the following explanation of Thompson's algorithm: http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html

Can anyone clear up how to 'describe each step clearly'? It just seems like a set of basic rules rather than an algorithm with steps to follow.

Maybe there's an algorithm I've glossed over somewhere but so far I've just created them with an educated guess.

Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).

Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA

Building the top level

  1. Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.

  2. Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a). Intermediate result:
    Non-Deterministic Finite Automata for P*

  3. Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
    Non-Deterministic Finite Automata for P*bQ

  4. There's no alternation outside of parentheses, so we skip it.

Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.


Building P

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for b|a. Intermediate result:
    NFA for b or a


Integrating P

Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
enter image description here


Building Q

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)


Integrating Q

We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
enter image description here

Tada! Er, I mean, QED.


Want to know more?

All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.

The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.

This question may sound cliched, but I am in a situation here.

I am trying to implement a finite state automaton to parse a certain string in C. As I started writing the code, I realised the code may be more readable if I used labels to mark the different states and use goto to jump from one state to another as the case comes.

Using the standard breaks and flag variables is quite cumbersome in this case and hard to keep track of the state.

What approach is better? More than anything else I am worried it may leave a bad impression on my boss, as I am on an internship.

I would recommend you the "Dragon book": Compilers, Principles-Techniques-Tools from Aho, Sethi and Ullman. (It is rather expensive to buy but you for sure will find it in a library). There you will find anything you will need to parse strings and build finite automatons. There is no place I could find with a goto. Usually the states are a data table and transitions are functions like accept_space()

I've been programming for years (mainly Python), but I don't understand what happens behind the scenes when I compile or execute my code.

In the vein of a question I asked earlier about operating systems, I am looking for a gentle introduction to programming language engineering. I want to be able to define and understand the basics of terms like compiler, interpreter, native code, managed code, virtual machine, and so on. What would be a fun and interactive way to learn about this?

If you want to know how one goes from source code to something that actually runs on a target machine, you should get a copy of the famous Red Dragon Book. I've used it for building parsers and lexical analyzers. While it dates back to 1986, and I'm sure there's been progress in the interim, as far as I can tell, it hasn't been surpassed as a text.

It appears that Addison-Wesley has done a reprint of its predecessor, the Green Dragon Book, and is passing it off as something recent, so be careful to get the genuine article.

I know there are a few questions floating around here on the subject, but it was hard to find anything useful to what I'm after...

I also know it will probably end up being quite the task to complete, but I really want to make a simple scripting language for gaming engines... I want to use it in C++ and my Android Java game engines... but I don't know where to start... I've tried looking for tutorials online, but alot require converting things to byte code, virtual machines and such...

I really just want to create a simple scripting language which can be read from the engine, have some simple "if/else" logic... simple functions that can be called from other scripts and so on... Maybe even simpler for early versions... I really don't know where to start, but I do know this is something I need to start studying and understanding.

If anyone could point me in the right direction and point out some links to very simple "making a simple scripting language for games" kind of tutorial or even point out some key concepts that I should look into... I'd be really thankful.

I'd prefer a minimalist C based scripting language, but I guess the specifics will come into it once I've actually learnt more about it.

Thanks for any help anyone can give.

Part of my job is maintaining a proprietary game scripting language. I'm not aware of any "How-to" books on the subject in its narrow sense. There is an interesting article by Bruce Wilcox on writing such languages. It doesn't discuss implementation detail, but goes into the design process somewhat.

http://www.gamasutra.com/view/feature/1570/reflections_on_building_three_.php

Writing a scripting language is like writing any language and involves all the same principles and design questions. You need to think about what concepts your language should revolve around, you need to define a grammar, then you need to write a compiler or translator and/or an interpreter. Which you choose and the details of their implementations are entirely up to you, and there is no one or best way to accomplish these things.

There are standard ways of thinking when it comes to parsing syntax and defining language grammars. Regular expressions are your friend here. Thankfully, C++11 includes the <regex> library (originally from boost). It might help to pick up a book on compilers to get you started on important concepts, if you really want to get into the subject in depth. When I was in university taking a compilers course, this was my textbook, which I've kept with me to this day, so I recommend it.

http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886/ref=sr_1_2?ie=UTF8&qid=1326618572&sr=8-2

Writing a language is a wonderful exercise in computer science. However, if this is for a real project or product, then like others who have already commented, my professional advice is: Don't do it if you don't really have to. This is a huge potential time investment that you're considering, and you have you ask yourself: What benefits will I get out of my own language that I can't find in existing free-to-use languages like Lua and Python, and are those benefits worth the months of extra time it's going to take to implement?

And don't forget the tools. If you're writing a language that you intend others to use, you need lots of documentation and tools, tools, tools. You will almost certainly want a source level debugger at least, and writing a good debugger can take as long as writing the language itself!

If all you want is a language for fast-prototyping game logic, then Lua is likely more than suitable for your needs, and comes with lots of existing documentation, literature and tools.

I am looking for a very short working example of flex and bison with an accompanying Makefile which makes use of the builtin rules. I've tried several google results that were messy, wouldn't build, or were in C++ which isn't acceptable. Good online resources and short sample code is appreciated.


ADDITIONAL

     # Makefile example -- scanner and parser.
     # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c"
     #
     LEX     = flex
     YACC    = bison -y
     YFLAGS  = -d
     objects = scan.o parse.o myprogram.o

     myprogram: $(objects)
     scan.o: scan.l parse.c
     parse.o: parse.y
     myprogram.o: myprogram.c

I would like a Makefile which looks approximately like this with attached source files which do something arbitrarily simple.

Compilers: Principles, Techniques, and Tools, Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman

many (may be all?) programming language consist of assembly language

how lisp implemented in assembly language?

is there any good reference, manual, tutorial, or keyword for google?

any official rule/convention for build your own lisp implementation?

such as tail recursion should follow some embodiment rule or something..

thanks

Although the other comments and posts are right, this question is overly vague and maybe a bit confused, I can't help but share some recommendations. I've collected a number of links and books on implementing Lisp as I have recently developed a bit of a fascination with language implementation. It's a deep topic, of course, but reading the stuff related to Lisp is especially compelling because you can skip a lot of the intense reading on parsing if you implement a Lisp compiler or interpreter in Lisp, and just use read. This allows the authors to get to the meat of compilation or interpretation quickly. These recommendations are books I've read or started or am reading, and mostly deal with Scheme, not Common Lisp, but still might be of some interest.

If you've got no background in language implementation, and have never had the pleasure of reading about the classic Lisp and Scheme "metacircular evaluators", I would recommend Structure and Interpretation of Computer Programs. If you've seen Lisp-in-Lisp (or Scheme-in-Scheme...) you can skip ahead. In the last two chapters of SICP, the authors present a few different interpreters for Lisp/Scheme and a few variants, as well as a byte-code compiler and a virtual machine. It's simply a brilliant book, and free.

If you don't have time to read SICP, or don't want to slog through it just to get to the interpretation and compilation chapters, I would recommend The Little Schemer. Even though it's very short and intended for newcomers to Lisp and Scheme, if you've never seen a Lisp interpreter written in Lisp, they do present one, and it's quite a delightful book, but might not be for everyone due to the cute style.

There's another free book on Scheme similar to SICP, called An Introduction to Scheme and its Implementation, which I haven't read but used as a reference for a few bits. There are sections on interpreters and compilers in there, and it seems to go a bit deeper than SICP, dealing with hairier things like parsing too. It maybe needed an editor, but it's an impressive offering nonetheless.

With a decent idea of how to do Lisp in Lisp, you can approach implementing Lisp in something lower level.

Lisp in Small Pieces is frequently recommended. I've read most of it, and can say it's definitely a great book, full of nitty gritty stuff. I'm going back over it with a fine comb, because it's easy to skim when you don't understand stuff. I also struggled with getting the code from the author's site to run; if you get it, I recommend using Gambit Scheme and running the code that relies on Meroonet with Meroon, from this distribution. Lisp in Small Pieces presents a number of interpreters written in Scheme as well as a byte-code compiler and a compiler-to-C.

Lisp in Small Pieces moves fast, and it's quite dense. If it's too much for you, perhaps start with The Essentials of Programming Languages. I've read some of it and it's quite good, but it's more interpreters. Apparently one of the older editions (1st? I'm not sure...) included a compiler. There seems to be a lot of change between the 3 editions, but the first is super cheap on Amazon, so check it out.

As for compilation to C, this is kind of a gross subject with lots of hairy bits. Compilation to C brings up all these weird corner issues, like how to optimize tail-calls and handle closures, first-class continuations and garbage collection, but it's pretty interesting, and a lot of "real" implementations of Scheme go this route. Marc Feeley's presentation on this is pretty interesting, titled The 90 Minute Scheme to C compiler.

I have fewer resources on compiling all the way down to assembly, but there is a often recommended paper which introduces compilation of Scheme to x86, called An Incremental Approach to Compiler Construction. It assumes little of the reader, however I found that it simply goes too fast and doesn't fill in enough details. Maybe you'll have better luck.

A lot of the above recommendation comes from this monster comment on Hacker News from over a year ago, from mahmud. It references a number of ML resources, and compilation using continuations. I haven't gotten that far in my study, so I can't say what's good or not. But it's an incredibly valuable comment. The referenced works include Andrew Appel's "Compiling with Continuations" and Paul Wilson's "Uniprocessor Garbage Collection Techniques" paper.

Good luck!

That it's a big question to answer well.

Short answer: JIT.

Big answer: Dragon book.

Can anybody give me a clear and concise explanation of the last 2 chapters of SICP(structure and interpretation of computer programs), ch4 meta-linguistic abstraction and ch5 computing with register machines?

I'd also like to know if(and how) these two chapters differ in content from a standard undergraduate compiler course.

Long Disclaimer, actual answer below

If you really want information about how to build a compiler from scratch and need to be up to speed with all the related practical parsing, compilation, generation and optimization techniques, then the dragon book would be the better option.

If you want to build a clean programming language from scratch using an interpreter, I would recommend Friedman's EPL book.

If what you are after with your bachelor paper is a deeper understanding of all the fundamental issues in both previous books, then read on to my answer below. SICP is an educational work, it tries to convey the essential concepts in a clear a language as possible. It will not go into details about left-recursive parsers, common sub-expression elimination, x86 SSE extensions and so forth.

SICP CH4-5

The technique used to explain the complex concepts involved is by constructing a series of computer languages from scratch before your eyes.

Chapter 4 starts by building a meta-circular Scheme interpreter: a small Scheme interpreter written in Scheme itself. This will give you the basics of a recursive interpreter, and forms a basis for the construction of a series of mini-languages in the rest of ch4-5. It answers the question of how to represent your parsed code, what datastructures are involved, how to separate the host from the base language, etc. Importantly, it shows you that a language interpreter is itself just another computer program. The rest of Chapter 4 shows you how you can vary the flavor of your language by modifying the previous interpreter. The two big ones are lazy evaluation and logic programming.

Chapter 5 makes a rough model of 'register machines', to represent your current day computer at an abstract level. They construct a little register machine language that acts as an assembly-language for all intents and purpose. They introduce all the datastructures and control flow constructs needed to do the next bit: building a scheme interpreter in this machine language. Somehow still similar to the meta-circular interpreter. Afterwards they jump off the deep end and construct a scheme compiler in their register machine language; complete with assembly step, tail-recursion optimization, garbage-collection, lexical addressing, tracing, etc.

Although SICP constructs toy interpreters and compilers, these are conceptually complete enough to get you up to speed in no time with current practical techniques. GCC's intermediate code looks a lot like SICP's register machine code for example and these guys implemented SICP's interpreter for ARM micro-controllers straight in assembly.

I am reading this bison introduction.

I have two questions and it will be great if someone can help me understand:

  1. What does term context free grammar mean?

  2. From the link above: Not all context-free languages can be handled by Bison, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"

1) In simple terms -- It means that the formating and context of the code is not important for the individual parts, and you don't have see how it is formatted to understand the meaning. As an example, you could write your C-program in one line, or with every word on a different line and the program would mean the same. Wikipedia have a more formal definition.

2) What LALR(1) means is more complicated, but in simple terms I would describe that as understanding the meaning incrementally by reading one word at the time -- only seeing the next word/symbol -- some spoken languages are famous for not being LALR(1) -- where you unly understand the sentence being a question or a statement when you get to the end of the sentence. Some programming languages could be constructed that way as well, but I'm not aware of any as they for practical reasons all conform to a LALR(1) syntax. I do however think there are exceptions where C/C++ needs a 2 symbol look ahead to correctly parse the statements (hence being LALR(2), but since I cannot remember from top of head what they are, I hope somebody will point that out in a comment.

In any cases this book here is the bible when it comes to understand compilers and parsers.

Possible Duplicate:
Learning to Write a Compiler

How to design simple compiler? I want to design compiler for MCA project.

In my ongoing effort to quench my undying thirst for more programming knowledge I have come up with the idea of attempting to write a (at least for now) simple programming language that compiles into bytecode. The problem is I don't know the first thing about language design. Does anyone have any advice on a methodology to build a parser and what the basic features every language should have? What reading would you recommend for language design? How high level should I be shooting for? Is it unrealistic to hope to be able to include a feature to allow one to inline bytecode in a way similar to gcc allowing inline assembler? Seeing I primarily code in C and Java which would be better for compiler writing?

You might want to read a book on compilers first.

For really understanding what's going on, you'll likely want to write your code in C.

Java wouldn't be a bad choice if you wanted to write an interpreted language, such as Jython. But since it sounds like you want to compile down to machine code, it might be easier in C.

I've been parsing poker hand histories for the past year and have learned quite a deal about parsing in general.

We started with regexes but quickly realized that wouldn't scale easily. We skipped languages from ruby to c++ and finally came to grips that it was the algorithim that had to change.

We picked up Boost::Spirit and watched our speed dramatically rise on orders of more than 10 times our original speed. We then skipped over to java and are currently using antlr to create grammars for each site. This is definitely the fastest method yet and it's very thorough which is nice cause you know exactly where you stand in terms of a 'complete' grammar. Unfortunately, I have spent incredible amounts of time working with these grammars -- they work pretty damn well but not perfectly yet.

Anyways, enough with the background on to the question at hand -- are there any 'exotic' or less well known techniques to parsing that I'm not aware of? I only know of lexing/parsing a grammar and the other inferior regex/loop method.

For those of you who are not familiar with poker hand histories I'll post one so you can tell what the structure is.

Full Tilt Poker Game #12037626529: Table durrrr (heads up, deep) - $500/$1000 -
Pot Limit Omaha Hi - 2:00:48 ET - 2009/05/05
Seat 1: durrrr ($196,456.50)
Seat 2: Gus Hansen ($65,499)
durrrr posts the small blind of $500
Gus Hansen posts the big blind of $1,000
The button is in seat #1
*** HOLE CARDS ***
durrrr raises to $3,000
Gus Hansen raises to $9,000
durrrr calls $6,000
*** FLOP *** [3d 4d 7d]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr checks
*** TURN *** [3d 4d 7d] [Jh]
Gus Hansen checks
durrrr checks
*** RIVER *** [3d 4d 7d Jh] [Ah]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr has 15 seconds left to act
123stayfree (Observer): GUS I NOW BRING U LUCK
durrrr bets $7,600
Gus Hansen has 15 seconds left to act
Gus Hansen has requested TIME
Hernandez777 (Observer): Gus has the super-duper nuts
Gus Hansen calls $7,600
Podobed45 (Observer): fluuuuuuuuuush
*** SHOW DOWN ***
durrrr shows [Kc 3s Qd As] two pair, Aces and Threes
Gus Hansen mucks
durrrr wins the pot ($33,199.50) with two pair, Aces and Threes
*** SUMMARY ***
Total pot $33,200 | Rake $0.50
Board: [3d 4d 7d Jh Ah]
Seat 1: durrrr (small blind) collected ($33,199.50)
Seat 2: Gus Hansen (big blind) mucked

I'm well aware of other methods of collecting the information (such as screen-scraping and dll injection) but the need to transform the hand history into structured data is still there so I'm looking only at methods that grab the info such as regex/grammars...

I think if I don't find something I'm going to rewrite our grammars with ocamllex/ocamlyacc.

update

fyi: regexen speed was ~60 hands/sec while the grammars were processing 600+ hands/sec... the entire hand is transformed into xml after the data is all sorted out... there are between 20-30 regexes needed (at last count) for EACH site you want to parse....each site on the grammar side has it's own grammar with ungodly amounts of lexer/parser rules (but it's still smaller code size)

I do have the dragon book and have been reading through it -- which has spurned my interest in using the ocamllex/ocamlyacc.... speed is the name of the game here..

Read the Dragon Book: http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886

It covers lexical and syntactical analysis in depth (among other topics). You can use this to help you understand the "language" you are trying to parse to determine the best way to go about it.

I'm designing a programming language for my personal use and education. The first versions of the reference compiler will compile the source to some other language like C. What things should I keep in mind to ensure the compile times will be fast for both compiling to another source and binary executable? Any other things that are good to know?

Even though I talk about compile speeds the main focus of the question is the language, not the compiler.

If your focus is on learning compiler design, I think compile speed won't be your priority. I suggest you implement first a top down parser manually using a recursive descendant technique, which is easy and straighfordward. Also use a lexer manually designed using a finite state machine which is also kind of easy, but very rich conceptually.

While defining the syntax of your language, you must ensure it's consistent and unambiguos. If you are familiar with Prolog, you could use Definite Clause Grammars (DCG) to play with your language before implementing it. I find it quite useful.

Then you can look fordward implementing a lexer and parser using tools for generating them (like Bison, ANTLR, Lemon, Yacc, etc.)

Another advise is to keep it simple. You can implement a subset of the language you want to build and polish it until you feel satisfied with it, then expand it implementing new features and so on. You will find yourself growing your language and your compiler and seing how it gets bigger and more complete and as it gets better and better you will feel more confident and satisfied. It's a nice and beautiful challenge after all.

Have fun. Learning compiler design is just a fun and a very interesting subject!

PS. Read the dragon book

It is a university task in my group to write a compiler of C-like language. Of course I am going to implement a small part of our beloved C++.
The exact task is absolutely stupid, and the lecturer told us it need to be self-compilable (should be able to compile itself) - so, he meant not to use libraries such as Boost and STL.
He also does not want us to use templates because it is hard to implement.
The question is - is it real for me, as I`m going to write this project on my own, with the deadline at the end of May - the middle of June (this year), to implement not only templates, but also nested classes, namespaces, virtual functions tables at the level of syntax analysis?
PS I am not noobie in C++

I will like to stress a few points already mentioned and give a few references.

1) STICK TO THE 1989 ANSI C STANDARD WITH NO OPTIMIZATION.

2) Don't worry, with proper guidance, good organization and a fair amount of hard work this is doable.

3) Read the The C Programming Language cover to cover.

4) Understand important concepts of compiler development from the Dragon Book.

5) Take a look at lcc both the code as well as the book.

6) Take a look at Lex and Yacc (or Flex and Bison)

7) Writing a C compiler (up to the point it can self compile) is a rite of passage ritual among programmers. Enjoy it.

I don't know if this is the right SE site for this question, if not, I would appreciate anyone who could point me in the right direction.

For my college project (UK college ~ high school) I want to design a basic programming language. It won't have all the necessary functionality, but enough to write some basic programs on the Console. I want to make it interpreted as I've heard how horribly complex compiled languages are; object-oriented because I only know VB.NET and am most comfortable with OOP; and my goal is to create a simple language which is easily learned by non-programmers.

I've been looking around but struggling to find helpful resources that explain about creating programming languages in any good detail. I'd really appreciate any online resources you can suggest - they have to be free - if there are similar StackOverflow questions that I missed, in-depth online articles or tutorials, extracts from free online textbooks... anything you think might be useful.

Aho-Ulman have an excellent (and very deep) book on compilers. http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886

However, if you want a quick recipe for writing a simple compiler, it might be too deep. Still, it may be good to have it for reference.

Look no further than SICP - this book will enlighten you about the principles of programming and programming languages, and in its last two chapters it'll teach you how to build an interpreter and a compiler for the Scheme programming language - written in Scheme.

I can assure you, the material in the book will profoundly change the way you think about computation. Coupled with the DrRacket IDE, you'll have a great environment to learn how to create your own programming language starting from first principles.

Another recommended book would be Essentials of Programming Languages, although the material covered there is a bit more advanced. It'll also show you how to implement feature-rich languages in Scheme, this time including typed languages and an OOP language.

I realize that this question will raise some eyebrows, and I realize that JavaScript is traditionally an interpreted language, please let me explain:

I am software engineer specializing in web applications (.NET stack specifically). As a hobby, I enjoy creating RC UAVs that run on Arduino-based components.

There are several other things I would like to do with Arduino as well but, frankly, C / C++ is not my strongest language and I don't want to spend limited spare time reading books on C.

It has occurred to me--and many others I am sure--that the Arduino / embedded ecosystem would be far richer if the language used to interface with it were a more common one. JavaScript seems like a good candidate to me because most software developers know it and the culture of building open source frameworks and plugins is very strong in the JavaScript world.

So, back to my first question: If I wanted to be able to write and compile code for my Arduino in JavaScript, how would I get started? I am envisioning an open-source project of course, but I need some help getting traction. I have never written a compiler, so any help is appreciated.

This is quite an ask, the microcontroller in the Arduino UNO, LEO, etc is the ATmega328p, which has 32K of flash for program storage, 2K of RAM, and 2K of EEPROM (for persistent storage). That is quite tight for a language like Javascript.

Now someone has written a Javascript compiler for the ATmega128, which you will find in the Arduino Mega, which has 4K of RAM and much more flash.

If you move up to the Arduino DUE, the Arduino Zero, or the Teensy 3.x — all of which are ARM based — then you can look into Espruino which is a version of JavaScript for ARM, but you will still have to port it to the Arduino hardware.

So if what you really want is an embedded board that can run JavaScript then I would just look at the Esprino board itself.

Finally if you are still set on JavaScript for the ATmega328p, then you should look into writing a JavaScript to C++ translator for a subset of the JavaScript language. The scope of doing this is well outside of a SO reply, so I would suggest starting with the famous Dragon Book as it is still probably the best resource for learning how to write compilers.

Is it possible to create an AST for any arbitrary programming language or IR using C or C++ alone (without the help of tools like YACC and LEX )?

If so, how to implement the lexical and syntactic analysis ?

If not, what are the tools that have to augmented to C or C++ to successfully create an AST ?

Hope I made my doubt clear. If My question looks vague or out of context, please indicate the required.

P.S : I Am actually trying to create the AST for the LLVM's .ll format of IR representation. I do know that .ll is derived from AST. But I am trying out static analysis practices. So I am looking at creating the AST.

The most straight-forward methodology for creating the parser without a parser-generator is recursive descent. It is very well documented - the standard book in the field is The Dragon Book.

A scanner, that takes text as input and produces a string of tokens as output, can be written using standard string manipulation techniques.

Does a configuration parsing library exist already that will read the following style of file:

Keyword Label Value;

With nesting by { } replacing Values; optional Labels; support for "Include" would be nice.

An example configuration file might looks like:

Listen Inside 127.0.0.1:1000;
Listen Outside {
    IP 1.2.3.4;
    Port 1000;
    TLS {
        CertFile /path/to/file;
    };
};
ACL default_acl {
    IP 192.168.0.0/24;
    IP 10.0.0.0/24;
};

What programming languages are you familiar with? My impression from your question is C.

It looks like like the tokens of your configuration language are regular expressions:

  • Listen
  • 127.0.0.1:1000
  • 1000
  • ;
  • {
  • }
  • etc.

Almost all modern programming languages have some form of support for those.

If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.

Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.

Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)

I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?

Lots of people will (at this stage) tell you to get the 'Dragon Book' or one of its newer versions, often because that is what they had at college. The Dragon book is great, but it is like telling someone to read all of Wikipedia to find out about whales. Great if you have the time, and you'll learn a lot.

A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.

A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.

Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools. end Pontification.

If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.

I found two links to stackoverflow answers which look helpful:

Here is an example of using flex.

Flex takes a 'script', and generates a C function called yylex(). This is the input script.

Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.

If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.

Effectively the yylex function is everything from the first %% to the second %%. It behaves like a big switch() statement.
The regular expressions are like (very exotic) case: labels.
The code inside the { ... } is ordinary C. It can contain any C statements, and must be properly nested within the { ... }

The stuff before the first %% is the place to put flex definitions, and a few 'instructions' to flex. The stuff inside %{ ... %} is ordinary C, and can include any headers needed by the C in the file, or even define global variables.

The stuff after the second %% is ordinary C, with no need for extra syntax, so no %{ ... %].

/* scanner for a configuration files */ 
%{ 
    /* Put headers in here */ 
#include <config.h>

%} 
%%
[0-9]+                  { return TOK_NUMBER; } 
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+":"[0-9]+ { return TOK_IP_PORT; } 
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+"/"[0-9]+ { return TOK_IP_RANGE; } 
"Listen"                { return TOK_KEYWORD_LISTEN; } 
[A-Za-z][A-Za-z0-9_]*   { return TOK_IDENTIFIER; }
"{"                     { return TOK_OPEN_BRACE; }
"}"                     { return TOK_CLOSE_BRACE; }
";"                     { return TOK_SEMICOLON; }
[ \t\n]+        /* eat up whitespace, do nothing */ 
.           { fprintf(stderr,  "Unrecognized character: %s\n", yytext ); 
              exit(1); 
            }
%%

/* -------- A simple test ----------- */
int main(int argc, char *argv[])
{ 
    int tok;
    yyin = stdin;
    while (tok=yylex()) {
        fprintf(stderr, "%d %s\n", tok, yytext);
    }
}

That has a minimal, dummy main, which calls the yylex() function to get the next token (enum) value. yytext is the string matched by the regular expression, so main just prints it.

WARNING, this is barely tested, little more than:

flex config.l
gcc lex.yy.c -ll
./a.out <tinytest

The values are just integers, so an enum in a header:

#ifndef _CONFIG_H_
#define _CONFIG_H_

enum TOKENS { 
 TOK_KEYWORD_LISTEN = 256, 
 TOK_IDENTIFIER = 257, 
 TOK_OPEN_BRACE = 258, 
 TOK_CLOSE_BRACE = 259, 
 TOK_SEMICOLON = 260, 
 TOK_IP_PORT = 261, 
 TOK_IP_RANGE = 262, 
 TOK_NUMBER = 263, 
};
#endif _CONFIG_H_

In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.

You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.

How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.

How to fetch the row and column number of error (i.e which part of string does not follow the grammar rules)? I am using yacc parser to check the grammar. Thank you.

you'd better read the dragon book and the aho book that explain and show example of how to write a lex/yacc based compiler.

In order to get line/column of the error, you shall make your lexer preserve the column and line. So in your lexer, you have to declare two globals, SourceLine and SourceCol (of course you can use better non-camel cased names).

In each token production, you have to calculate the column of the produced token, for that purpose I use a macro as follows:

#define Return(a, b, c)    \
{\
    SourceCol = (SourceCol + yyleng) * c; \
    DPRINT ("## Source line: %d, returned token: "a".\n", SourceLine); \
    return b; \
}

and the token production, with that macro, is:

"for" { Return("FOR", FOR, 1);

then to keep lines, for each token that makes a new line, I'm using:

{NEWLINES}     {
    BEGIN(INITIAL);
    SourceLine += yyleng;
    Return("LINE", LINE, 0);
}

Then in your parser, you can get SourceCol and SourceLine if you declare those as extern globals:

extern unsigned int SourceCol;
extern unsigned int SourceLine;

and now in your parse_error grammar production, you can do:

parse_error : LEXERROR
{ 
    printf("OMG! Your code sucks at line %u and col %u!", SourceLine, SourceCol); 
}

of course you may want to add yytext, handle a more verbose error message etc.. But all that's up to you!

I am designing a compiler in C. I want to know which technique I should use, top-down or bottom up? I have only implemented operator precedence using bottom up. I have applied the following rules:

E:=E+E
E:=E-E
E:=E/E
E:=E*E
E:=E^E

I want to know that am I going the right away? If I want to include if-else, loops, arrays, functions, do I need to implement parsing? If yes, how do I implement it? Any one can I have only implemented token collection and operator precedence. What is the next steps?

Lex & Yacc is your answer. Or Flex and Bison which are branched version of original tools.

They are free, they are the real standard for writing lexers and parsers in C and used all around everywhere.

In addition O'Reilly has released a small pearl of 300 pages: Flex & Bison. I bought it and it really explains you how to write a good parser for a programming language and handle all the subtle things (error recovery, conflicts, scopes and so on). It will answer also your questions about how you are parsing expressions: your approach is right with a top-down parser but you'll discover that is not enough to handle operator precedences.

Of course, for hobby, you could write your own lexer and parser but it would be just an academic effort that is nice to understand how FSM and parser work but with no so much fun :)

If you are, instead, interested in programming language design or complex implementations I suggest this book: Programming Language Pragmatics that is not so famous because of the Dragon Book but it really explains why and how various characteristics can and should be implemented in a compiler. The Dragon Book is a bible too, and it will cover at a real low level how to write a parser.. but it would be sort of boring, I warn you..

Your question is quite vague and hard to answer without a more specific, detailed question. The "Dragon book" is an excellent reference though for someone seeking to implement a compiler from scratch, or as others have pointed out Lex and Yacc.

I basically need to make a compiler for bibtex files, such that a given bibtex database can be queried. Now I'm familiar with certain aspects of theory, like automata, grammars, SLR,LR(1) and LALR parsing. However, I still find all of that theoretical and abstract, since I've never applied it. It would help a lot if someone could outline the solid steps required to build a compiler. I'll be using flex and bison/yacc probably, so if you could let me know how exactly the design process goes, what files are generated at what stage, what is the output at each stage, and in general how things tie together, I can probably get a more practical view of how things are done...

Thanks!

I am not a compiler expert but I do know that this book is actually regarded as necessary reading for anyone that wants to write a compiler. Yes the cover is antiquated but from what I have read it still has many good patterns relevant to code compilation:

http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886

I'm wanting to play around with creating an LR parser generators. Does anyone know of a good (free) resource describing how to create a state machine table from a grammar?

For a free resource, consider the Python source code for PLY - a full implementation of Lex and Yacc in Python.

I would recommend a book, however, and you can hardly do better than get The Dragon Book.

The obvious answer is the standard compiler text by Aho/Ullman/Ravi, Compilers: Principles, Techniques, and Tools

It has complete chapters on parsing. It isn't free, but it is worth every last penny, and if you are going to build parsers or other compiler-like tools, you are simply crazy if you don't have this book.

My application should read a C# code sample that is unindented, then indent the code programatically. The way I am doing it may not be correct but still could achieve partial results.

I could set white spaces when a { is found then continue with the same amount of space for rest of the lines being read. When another { is found again add spaces and continue with this new space for rest of lines. For that this is what I did:

    private void btn_format_Click(object sender, EventArgs e)
    {
        string lineInfo = "";
        string fl = "";
        string ctab= char.ConvertFromUtf32(32)+char.ConvertFromUtf32(32)+char.ConvertFromUtf32(32);
        foreach (string line in txt_codepage.Lines) // text_codepage is a textbox with code
        {
            if (line.Contains("{"))
            {
                string l = line.Replace("{", ctab+"{");
                lineInfo = lineInfo + (l + "\n");
                fl = fl + ctab;
                ctab = ctab + ctab;
            }
            else
            {
                lineInfo = lineInfo + (char.ConvertFromUtf32(32)+fl+ line + "\n");
            }

I could achieve the proper indentation that I want till here. Now when I find a } I should do the reverse process but unfortunately that is not possible with strings. The reverse process that I meant is this:

            if (line.Contains("}"))
            {
                string l = line.Replace(ctab + "}", "}");
                lineInfo = lineInfo + (l + "\n");
                fl = fl - ctab;
                ctab = ctab - ctab;
            }
            else
            {
                lineInfo = lineInfo - (char.ConvertFromUtf32(32) + fl + line + "\n");

            }
        }
        MessageBox.Show(lineInfo.ToString());

I know the above part of the code is a complete blunder but let me know how to achieve it in correct way

If you want parse string, you should use StringBuilder instead string concatenations (concatenations is to slow). I wrote some code, to demonstrate how you can parse CS or other code. It is not a full example, just a basic concepts.

If you want learn more about parsers you can read Compilers: Principles, Techniques, and Tools.

public static string IndentCSharpCode(string code)
{
    const string INDENT_STEP = "    ";

    if (string.IsNullOrWhiteSpace(code))
    {
        return code;
    }

    var result = new StringBuilder();
    var indent = string.Empty;
    var lineContent = false;
    var stringDefinition = false;

    for (var i = 0; i < code.Length; i++)
    {
        var ch = code[i];

        if (ch == '"' && !stringDefinition)
        {
            result.Append(ch);
            stringDefinition = true;
            continue;
        }

        if (ch == '"' && stringDefinition)
        {
            result.Append(ch);
            stringDefinition = false;
            continue;
        }

        if (stringDefinition)
        {
            result.Append(ch);
            continue;
        }

        if (ch == '{' && !stringDefinition)
        {
            if (lineContent)
            {
                result.AppendLine();
            }

            result.Append(indent).Append("{");

            if (lineContent)
            {
                result.AppendLine();
            }

            indent += INDENT_STEP;
            lineContent = false;

            continue;
        }

        if (ch == '}' && !stringDefinition)
        {
            if (indent.Length != 0)
            {
                indent = indent.Substring(0, indent.Length - INDENT_STEP.Length);
            }

            if (lineContent)
            {
                result.AppendLine();
            }

            result.Append(indent).Append("}");

            if (lineContent)
            {
                result.AppendLine();
            }


            lineContent = false;

            continue;
        }

        if (ch == '\r')
        {
            continue;
        }

        if ((ch == ' ' || ch == '\t') && !lineContent)
        {
            continue;
        }

        if (ch == '\n')
        {
            lineContent = false;
            result.AppendLine();

            continue;
        }

        if (!lineContent)
        {
            result.Append(indent);
            lineContent = true;
        }

        result.Append(ch);
    }

    return result.ToString();
}

Say I have a very basic rule like the one below

if a<b
  then set a=25
else set a=-25

I need to evaluate this using my Java. What is the best approach to do these kind of parsing? Using regular expression could be very difficult and will be very difficult to maintain right?

Can someone please suggest an approach? Please let me know if you need any further information.

If your syntax is regular, you can use regular expressions. If it is not, you need to use more complicated tools such as Lexers and Parsers: i.e. Antlr or Yacc/Lex/Bison.

If you don't know much about language design, I suggest you read a book on the subject before you dive in.

Consider the following mark-up input:

* Line 1
* Line 2
:* Line 2.1
:* Line 2.2
* Line 3

This is typically coded as:

  <ul>
    <li>Line 1</li>
    <li>Line 2</li>
    <ul>
      <li>Line 2.1</li>
      <li>Line 2.2</li>
    </ul>
    <li>Line 3</li>
  </ul>

My questions:

  • What would be a good representation for the same input using a single line?
  • What is the regular expression to generate the corresponding XHTML?

For example, the single line input format could be:

> Line 1 > Line 2 >> Line 2.1 >> Line 2.2 > Line 3

With > being unordered list item delimiter. I chose > because the text might include typical punctuation marks. Using » (or other such non-104-key keys) would be fun, but not as easy to type.

The line input format could also be:

[Line 1][Line 2 [Line 2.1][Line 2.2]][Line 3]

Update #1 - The problem is a little simpler. The number of nests can be limited to three. A general solution for n-levels deep would still be cool.

Update #2 - XHTML, not HTML.

Update #3 - Another possible input format.

Update #4 - Java solutions (or pure regex) are most welcome.

Update #5

Revised code:

String in = " * Line 1 * Line 2 > * Line 2.1 * Line 2.2 < * Line 3";

String sub = "<ul>" + in.replace( " > ", "<ul>" ) + "</ul>";

sub = sub.replace( " < ", "</ul>" );

sub = sub.replaceAll( "( | >)\\* ([^*<>]*)", "<li>$2</li>" );

System.out.println( "Result: " + sub );

Prints the following:

Result: <ul><li>Line 1 </li>* Line 2<ul>* Line 2.1<li>Line 2.2</li></ul>* Line 3

I would not recommend using regular expressions as a parsing and transformation tool. Regular expressions tend to have high overhead, and are not the most efficient means of parsing a language...which is what you are really asking it to do. You have created a language, simple as it is, and you should treat it as such. I recommend writing an actual, dedicated parser for your WIKI-style formatting code. Since you can target the parser specifically to your language, it should be more efficient. In addition, you won't have to create some frightening monstrosity that is a regex to parse your language and handle all of its nuances. In the long run, you gain the benefits of clearer code, better maintainability, etc.

I suggest the following resources:

I have created a very simple SQL parser, but during fuzz testing I've come across this situation:

SELECT   123     +      ,
K_SELECT INTEGER T_PLUS T_COMMA

Of course this is a syntax error, but I don't know how to "catch" it.

How does it decide between the "next_column_expression came too early" and "binary_expression didn't finish". I've worked with ANTLR3 a fair bit on Java project. But this is totally different.

Here is the skeleton parser rules:

/* be more versbose about error messages */
%error-verbose

/* keywords */
%token K_CREATE
%token K_FROM
%token K_INTEGER
%token K_SELECT
%token K_TABLE
%token K_TEXT
%token K_WHERE
%token K_VALUES
%token K_INSERT
%token K_INTO

/* variable tokens */
%token IDENTIFIER
%token INTEGER

/* fixed tokens */
%token T_ASTERISK
%token T_PLUS
%token T_EQUALS
%token T_END ";"
%token T_COMMA
%token T_BRACKET_OPEN
%token T_BRACKET_CLOSE

%token END 0 "end of file"

%%

input:
    statement {
    }
    END
;

statement:
    select_statement {
    }
    |
    create_table_statement {
    }
    |
    insert_statement {
    }
;

keyword:
    K_CREATE | K_FROM | K_INTEGER | K_SELECT | K_TABLE | K_TEXT | K_WHERE | K_VALUES | K_INSERT | K_INTO
;

table_name:
    error {
        // "Expected table name"
    }
    |
    keyword {
        // "You cannot use a keyword for a table name."
    }
    |
    IDENTIFIER {
    }
;

select_statement:
    K_SELECT column_expression_list {
        // "Expected FROM after column list."
    }
    error
    |
    K_SELECT error {
        // "Expected column list after SELECT."
    }
    |
    K_SELECT column_expression_list {
    }
    K_FROM table_name {
    }
;

column_expression_list:
    column_expression {
    }
    next_column_expression
;

column_expression:
    T_ASTERISK {
    }
    |
    expression {
    }
;

next_column_expression:
    |
    T_COMMA column_expression {
    }
    next_column_expression
;

binary_expression:
    value {
    }
    operator {
    }
    value {
    }
;

expression:
    value
    |
    binary_expression
;

operator:
    T_PLUS {
    }
    |
    T_EQUALS {
    }
;

value:
    INTEGER {
    }
    |
    IDENTIFIER {
    }
;

%%

You need to understand LR (shift-reduce) parsing, and you need to understand how yacc recovers from errors, using error rules in the grammar. The former is a big question, and there are a number of books that cover the theory and practice PDAs and shift-reduce parsing (The classics, Hopcroft & Ullman and Aho, Sethi & Ullman are complete if rather dense).

Once you understand shift-reduce parsing, yacc error recovery is reasonably straight-forward. Basically, whenever it gets into a state where it can't shift or reduce on the current tokens, it takes a simple sequence of steps to try to recover:

  1. It pops states until it gets to one that can shift the special error token. This might be zero pops if the current state can shift error.

  2. It shifts the error token, and then does any default reductions in the target state.

  3. It throws away input tokens until it finds one that can be handled in the current state. As with the state dropping, that might be zero discards if the state after shifting error can handle the next token.

and that's it.

So If we look at what happens with your current grammar and example erroneous input we that it:

  1. shifts the SELECT token going into a state select_statement: K_SELECT ...
  2. shifts the 123 token, reduces it to a value and shifts to a state *expr: value ...
  3. shifts the + token, reduces it to an operator and shifts to a state binary_expression: value operator ...
  4. Sees the token , and can't shift or reduce in the current state, so issues a syntax error.
  5. Pops states looking for one that can handle error. The top two states (from 3 and 2 above) can't so are discarded. The next state can, so we end up in a state select_statement: K_SELECT error
  6. That is a default reduction state, so it is reduced to select_statement which is then reduced to statement which shifts to a state input: statement END
  7. It starts throwing away input tokens until it finds one the current state can handle, which is only END. So it throws aways everything until it gets to END or eof.

Now your question seems to be, "How do I do something different?"

If you want a 'binary expression not complete' recovery, you could add a rule like:

binary_expression: value error

This would end up as part of the *expr: value state above, so error recovery would stop popping there and shift the error token, ending up in a state that can shift the , token.

Whenever you're trying to untangle the states in a large grammar and understand what error recovery will do, it helps tremendously to run yacc/bison with the -v flags to produce a .output file with all the states in it.

I have this loop

for (it= someCollection.iterator; it.hasNext(); )
{
    //some code here
}

I changed it to:

for (it= someCollection.iterator;; )
{
    if (!it.hasNext())
        break;
    //some code here
}

The second code ran a little bit faster in unit tests in junit on eclipse. Is the second loop faster? I'm asking because the times given by Junit are not too exact, but they give an approximate value

When looking into this sort of problems, it's useful to think about the generated bytecode in terms of block control flow graph, where a block is a sequence of bytecode instructions that can only be entered from its first instruction and only left after its last instruction (leaving out exits to simplify the problem).

Using this example code:

    for (Iterator it = c.iterator(); it.hasNext(); ) {
        System.out.println(it.next());
    }
    System.out.println("Out");

You would get the following block control flow graph. I've put back the equivalent bytecode into source for readability, but all the instructions generated by System.out.println(it.next()); belong to one block, since you can't jump in the middle or get out of it.

Loop Control Flow Diagram

If you check a compiler book, you'll find that it.hasNext() dominates System.out.println(it.next()) because you need to go through it.hasNext() to go to System.out.println(it.next()). The edge from System.out.println(it.next()) to it.hasNext() is called a back-edge because it links a node to one of its dominators. This is what formally defines what the loop is. The first statement in the for-loop (Iterator it = c.iterator()) doesn't actually belong to the loop. There is no difference with a while loop preceded by this statement, except for the scope of the declared variable, but this doesn't matter once compiled. The first block (it.hasNext()) is the loop header.

A second example like this would produce the same graph:

    for (Iterator it = c.iterator();; ) {
        if (!it.hasNext()) {
           break;
        }
        System.out.println(it.next());
    }
    System.out.println("Out");

The main difference is that there may be some useless goto statements depending on the compiler strategy.

If you look at the generated bytecode using javap -c for these two examples, you get this (this was compiled with javac, you may get something slightly different if you compile with the Eclipse compiler, for example):

public void loop1();
  Code:
   0:   new #2; //class java/util/ArrayList
   3:   dup
   4:   invokespecial   #3; //Method java/util/ArrayList."<init>":()V
   7:   astore_1
   8:   aload_1
   9:   invokevirtual   #4; //Method java/util/ArrayList.iterator:()Ljava/util/Iterator;
   12:  astore_2
   13:  aload_2
   14:  invokeinterface #5,  1; //InterfaceMethod java/util/Iterator.hasNext:()Z
   19:  ifeq    37
   22:  getstatic   #6; //Field java/lang/System.out:Ljava/io/PrintStream;
   25:  aload_2
   26:  invokeinterface #7,  1; //InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
   31:  invokevirtual   #8; //Method java/io/PrintStream.println:(Ljava/lang/Object;)V
   34:  goto    13
   37:  getstatic   #6; //Field java/lang/System.out:Ljava/io/PrintStream;
   40:  ldc #9; //String Out
   42:  invokevirtual   #10; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
   45:  return

public void loop2();
  Code:
   0:   new #2; //class java/util/ArrayList
   3:   dup
   4:   invokespecial   #3; //Method java/util/ArrayList."<init>":()V
   7:   astore_1
   8:   aload_1
   9:   invokevirtual   #4; //Method java/util/ArrayList.iterator:()Ljava/util/Iterator;
   12:  astore_2
   13:  aload_2
   14:  invokeinterface #5,  1; //InterfaceMethod java/util/Iterator.hasNext:()Z
   19:  ifne    25
   22:  goto    40
   25:  getstatic   #6; //Field java/lang/System.out:Ljava/io/PrintStream;
   28:  aload_2
   29:  invokeinterface #7,  1; //InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
   34:  invokevirtual   #8; //Method java/io/PrintStream.println:(Ljava/lang/Object;)V
   37:  goto    13
   40:  getstatic   #6; //Field java/lang/System.out:Ljava/io/PrintStream;
   43:  ldc #9; //String Out
   45:  invokevirtual   #10; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
   48:  return

The only difference is that the first one uses ifeq 37 to go straight to the end or proceed with the next block (22), whereas the other one uses ifne to go to the block after goto (25, equivalent to 22 in the other) and uses a goto to go to the end otherwise. This is effectively equivalent and a modern JIT compiler should optimise this little difference without trouble. Apart from this, your two loops are exactly the same.

I'm not sure how you've made your measurements, but you should also be aware that it's not because System.nanoTime() give you a result in nanoseconds that it has a resolution of that order, far from that. High-resolution timers are quite hard to implement and will depend on the hardware and OS. See JavaDoc:

This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis().

It's likely that, if you don't get a high enough difference, you won't get something significant compared to the resolution.

I'm unsure how to finish the left recursion removal algorithm for this grammar.

S ::= a B | B S b | S a | S B A | b
B ::= S b A | B B | A S B | a
D ::= b a | S b
A ::= b S A | b | a b

Here is my working.

using the order S, B, D, A.

S ::= a B M | B S b M | b M
M ::= a M | B A M | ε

B ::= a B M b A | B S b M b A | b M b A | B B | A S B | a

B ::= a B M b A N | b M b A N | A S B b A N | a N
N ::= S b M N | B N | ε

How should I progress from here?

From the Dragon Book.

Given the following rule:

A → Aα1 | ... | Aαm | β1 | ... | βn

where the βi are the non left-recursive right sides, write:

A → β1 A' | ... | βn A'
A' → α1 A' | ... | αm A' | ε

To remove all left-recursion, use this algorithm, assign a number to each non terminal, A1...An, and:

for(int i = 1; i <= n; i++)
    for(int j = 1; j < i; j++)
        foreach(Ai → Ajα && Aj → β1 | ... | βn)
            replace with Ai → β1α |... | βnα
   remove left recursion from Ai

I'm writing a recursive method that checks each letter of the string to compare them. I'm having trouble making the "*" character match with any, and act as as many letters as needed. (Making it a wildcard)

I was wondering if someone can give me a hint on the algorithm that would be used?

Here is what I have so far.

public static boolean match(String x, String y) {
    return match_loop(x, y, 0, 1);
}

public static boolean match_loop(String a, String b, int i, int s) {
    try {
        if (a == b) {
            return true;
        }

        if (i >= a.length() && i >= b.length()) {
            return true;
        }

        if (a.charAt(i) == b.charAt(i)) {
            return match_loop(a, b, i + 1, s);
        }


        //(((...A bunch of if statements for my other recursion requirements

        return false;

    } catch (java.lang.StringIndexOutOfBoundsException e) {
        return false;
    }
}

public static void main(String[] args) {
    System.out.println(match("test", "t*t")); // should return true
}

What I was thinking of doing is adding another arguement to the method, an int that will act as a letter backcounter. Basically I'm thinking of this if a or b at char(i-s) (s originally being 1.) is a *, recall the recursion with s+1. and then a few more different ifs statements to fix the bugs. However this method seems really long and repetitive. Are there any other algorithms I can use?

This book will tell you exactly how to do it: http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886

Here's a simple Java implementation that might get you on track: http://matt.might.net/articles/implementation-of-nfas-and-regular-expressions-in-java/

Basically the industrial-strength implementation is a state machine. You deconstruct the regular expression - the string with the '*' in it - and create a graph for it. Then you recursively search the graph, for example in a breadth-first tree search.

Here's some discussion of different ways to do it, that will help illustrate the approach: http://swtch.com/~rsc/regexp/regexp1.html

HI I am a Parsing Newbie and I intend to learn it for my project. Can anyone suggest me good books or tutorials for the same? I know a little bit about Context free grammar but that is all the exposure I have

This book teachs many things, including parsing. It's considered a classic in compilers.

http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886