Mastering Regular Expressions

Jeffrey Friedl

Mentioned 88

Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation.Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions, Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch

More on Amazon.com

Mentioned in questions and answers.

If you could go back in time and tell yourself to read a specific book at the beginning of your career as a developer, which book would it be?

I expect this list to be varied and to cover a wide range of things.

To search: Use the search box in the upper-right corner. To search the answers of the current question, use inquestion:this. For example:

inquestion:this "Code Complete"

Applying UML and Patterns by Craig Larman.

The title of the book is slightly misleading; it does deal with UML and patterns, but it covers so much more. The subtitle of the book tells you a bit more: An Introduction to Object-Oriented Analysis and Design and Iterative Development.

Masters of doom. As far as motivation and love for your profession go: it won't get any better than what's been described in this book, truthfully inspiring story!

Beginning C# 3.0: An Introduction to Object Oriented Programming

This is the book for those who want to understand the whys and hows of OOP using C# 3.0. You don't want to miss it.

alt text

Mastery: The Keys to Success and Long-Term Fulfillment, by George Leonard

It's about about what mindsets are required to reach mastery in any skill, and why. It's just awesome, and an easy read too.

Pro Spring is a superb introduction to the world of Inversion of Control and Dependency Injection. If you're not aware of these practices and their implications - the balance of topics and technical detail in Pro Spring is excellent. It builds a great case and consequent personal foundation.

Another book I'd suggest would be Robert Martin's Agile Software Development (ASD). Code smells, agile techniques, test driven dev, principles ... a well-written balance of many different programming facets.

More traditional classics would include the infamous GoF Design Patterns, Bertrand Meyer's Object Oriented Software Construction, Booch's Object Oriented Analysis and Design, Scott Meyer's "Effective C++'" series and a lesser known book I enjoyed by Gunderloy, Coder to Developer.

And while books are nice ... don't forget radio!

... let me add one more thing. If you haven't already discovered safari - take a look. It is more addictive than stack overflow :-) I've found that with my google type habits - I need the more expensive subscription so I can look at any book at any time - but I'd recommend the trial to anyone even remotely interested.

(ah yes, a little obj-C today, cocoa tomorrow, patterns? soa? what was that example in that cookbook? What did Steve say in the second edition? Should I buy this book? ... a subscription like this is great if you'd like some continuity and context to what you're googling ...)

Database System Concepts is one of the best books you can read on understanding good database design principles.

alt text

Algorithms in C++ was invaluable to me in learning Big O notation and the ins and outs of the various sort algorithms. This was published before Sedgewick decided he could make more money by dividing it into 5 different books.

C++ FAQs is an amazing book that really shows you what you should and shouldn't be doing in C++. The backward compatibility of C++ leaves a lot of landmines about and this book helps one carefully avoid them while at the same time being a good introduction into OO design and intent.

Here are two I haven't seen mentioned:
I wish I had read "Ruminations on C++" by Koenig and Moo much sooner. That was the book that made OO concepts really click for me.
And I recommend Michael Abrash's "Zen of Code Optimization" for anyone else planning on starting a programming career in the mid 90s.

Perfect Software: And Other Illusions about Testing

TITLE Cover

Perfect Software: And Other Illusions about Testing by Gerald M. Weinberg

ISBN-10: 0932633692

ISBN-13: 978-0932633699

Rapid Development by McConnell

The most influential programming book for me was Enough Rope to Shoot Yourself in the Foot by Allen Holub.

Cover of the book

O, well, how long ago it was.

I have a few good books that strongly influenced me that I've not seen on this list so far:

The Psychology of Everyday Things by Donald Norman. The general principles of design for other people. This may seem to be mostly good for UI but if you think about it, it has applications almost anywhere there is an interface that someone besides the original developer has to work with; e. g. an API and designing the interface in such a way that other developers form the correct mental model and get appropriate feedback from the API itself.

The Art of Software Testing by Glen Myers. A good, general introduction to testing software; good for programmers to read to help them think like a tester i. e. think of what may go wrong and prepare for it.

By the way, I realize the question was the "Single Most Influential Book" but the discussion seems to have changed to listing good books for developers to read so I hope I can be forgiven for listing two good books rather than just one.

alt text

C++ How to Program It is good for beginner.This is excellent book that full complete with 1500 pages.

Effective C++ and More Effective C++ by Scott Myers.

Inside the C++ object model by Stanley Lippman

I bough this when I was a complete newbie and took me from only knowing that Java existed to a reliable team member in a short time

Not a programming book, but still a very important book every programmer should read:

Orbiting the Giant Hairball by Gordon MacKenzie

The Pragmatic programmer was pretty good. However one that really made an impact when I was starting out was :

Windows 95 System Programming Secrets"

I know - it sounds and looks a bit cheesy on the outside and has probably dated a bit - but this was an awesome explanation of the internals of Win95 based on the Authors (Matt Pietrek) investigations using his own own tools - the code for which came with the book. Bear in mind this was before the whole open source thing and Microsoft was still pretty cagey about releasing documentation of internals - let alone source. There was some quote in there like "If you are working through some problem and hit some sticking point then you need to stop and really look deeply into that piece and really understand how it works". I've found this to be pretty good advice - particularly these days when you often have the source for a library and can go take a look. Its also inspired me to enjoy diving into the internals of how systems work, something that has proven invaluable over the course of my career.

Oh and I'd also throw in effective .net - great internals explanation of .Net from Don Box.

I recently read Dreaming in Code and found it to be an interesting read. Perhaps more so since the day I started reading it Chandler 1.0 was released. Reading about the growing pains and mistakes of a project team of talented people trying to "change the world" gives you a lot to learn from. Also Scott brings up a lot of programmer lore and wisdom in between that's just an entertaining read.

Beautiful Code had one or two things that made me think differently, particularly the chapter on top down operator precedence.

K&R

@Juan: I know Juan, I know - but there are some things that can only be learned by actually getting down to the task at hand. Speaking in abstract ideals all day simply makes you into an academic. It's in the application of the abstract that we truly grok the reason for their existence. :P

@Keith: Great mention of "The Inmates are Running the Asylum" by Alan Cooper - an eye opener for certain, any developer that has worked with me since I read that book has heard me mention the ideas it espouses. +1

I found the The Algorithm Design Manual to be a very beneficial read. I also highly recommend Programming Pearls.

This one isnt really a book for the beginning programmer, but if you're looking for SOA design books, then SOA in Practice: The Art of Distributed System Design is for you.

For me it was Design Patterns Explained it provided an 'Oh that's how it works' moment for me in regards to design patterns and has been very useful when teaching design patterns to others.

Code Craft by Pete Goodliffe is a good read!

Code Craft

The first book that made a real impact on me was Mastering Turbo Assembler by Tom Swan.

Other books that have had an impact was Just For Fun by Linus Torvalds and David Diamond and of course The Pragmatic Programmer by Andrew Hunt and David Thomas.

In addition to other people's suggestions, I'd recommend either acquiring a copy of SICP, or reading it online. It's one of the few books that I've read that I feel greatly increased my skill in designing software, particularly in creating good abstraction layers.

A book that is not directly related to programming, but is also a good read for programmers (IMO) is Concrete Mathematics. Most, if not all of the topics in it are useful for programmers to know about, and it does a better job of explaining things than any other math book I've read to date.

For me "Memory as a programming concept in C and C++" really opened my eyes to how memory management really works. If you're a C or C++ developer I consider it a must read. You will defiantly learn something or remember things you might have forgotten along the way.

http://www.amazon.com/Memory-Programming-Concept-C/dp/0521520436

Agile Software Development with Scrum by Ken Schwaber and Mike Beedle.

I used this book as the starting point to understanding Agile development.

Systemantics: How Systems Work and Especially How They Fail. Get it used cheap. But you might not get the humor until you've worked on a few failed projects.

The beauty of the book is the copyright year.

Probably the most profound takeaway "law" presented in the book:

The Fundamental Failure-Mode Theorem (F.F.T.): Complex systems usually operate in failure mode.

The idea being that there are failing parts in any given piece of software that are masked by failures in other parts or by validations in other parts. See a real-world example at the Therac-25 radiation machine, whose software flaws were masked by hardware failsafes. When the hardware failsafes were removed, the software race condition that had gone undetected all those years resulted in the machine killing 3 people.

It seems most people have already touched on the some very good books. One which really helped me out was Effective C#: 50 Ways to Improve your C#. I'd be remiss if I didn't mention The Tao of Pooh. Philosophy books can be good for the soul, and the code.

Discrete Mathematics For Computer Scientists

Discrete Mathematics For Computer Scientists by J.K. Truss.

While this doesn't teach you programming, it teaches you fundamental mathematics that every programmer should know. You may remember this stuff from university, but really, doing predicate logic will improve you programming skills, you need to learn Set Theory if you want to program using collections.

There really is a lot of interesting information in here that can get you thinking about problems in different ways. It's handy to have, just to pick up once in a while to learn something new.

I saw a review of Software Factories: Assembling Applications with Patterns, Models, Frameworks, and Tools on a blog talking also about XI-Factory, I read it and I must say this book is a must read. Altough not specifically targetted to programmers, it explains very clearly what is happening in the programming world right now with Model-Driven Architecture and so on..

Solid Code Optimizing the Software Development Life Cycle

Although the book is only 300 pages and favors Microsoft technologies it still offers some good language agnostic tidbits.

Managing Gigabytes is an instant classic for thinking about the heavy lifting of information.

My vote is "How to Think Like a Computer Scientist: Learning With Python" It's available both as a book and as a free e-book.

It really helped me to understand the basics of not just Python but programming in general. Although it uses Python to demonstrate concepts, they apply to most, if not all, programming languages. Also: IT'S FREE!

Object-Oriented Programming in Turbo C++. Not super popular, but it was the one that got me started, and was the first book that really helped me grok what an object was. Read this one waaaay back in high school. It sort of brings a tear to my eye...

My high school math teacher lent me a copy of Are Your Lights Figure Problem that I have re-read many times. It has been invaluable, as a developer, and in life generally.

I'm reading now Agile Software Development, Principles, Patterns and Practices. For those interested in XP and Object-Oriented Design, this is a classic reading.

alt text

Kernighan & Plauger's Elements of Programming Style. It illustrates the difference between gimmicky-clever and elegant-clever.

to get advanced in prolog i like these two books:

The Art of Prolog

The Craft of Prolog

really opens the mind for logic programming and recursion schemes.

Here's an excellent book that is not as widely applauded, but is full of deep insight: Agile Software Development: The Cooperative Game, by Alistair Cockburn.

What's so special about it? Well, clearly everyone has heard the term "Agile", and it seems most are believers these days. Whether you believe or not, though, there are some deep principles behind why the Agile movement exists. This book uncovers and articulates these principles in a precise, scientific way. Some of the principles are (btw, these are my words, not Alistair's):

  1. The hardest thing about team software development is getting everyone's brains to have the same understanding. We are building huge, elaborate, complex systems which are invisible in the tangible world. The better you are at getting more peoples' brains to share deeper understanding, the more effective your team will be at software development. This is the underlying reason that pair programming makes sense. Most people dismiss it (and I did too initially), but with this principle in mind I highly recommend that you give it another shot. You wind up with TWO people who deeply understand the subsystem you just built ... there aren't many other ways to get such a deep information transfer so quickly. It is like a Vulcan mind meld.
  2. You don't always need words to communicate deep understanding quickly. And a corollary: too many words, and you exceed the listener/reader's capacity, meaning the understanding transfer you're attempting does not happen. Consider that children learn how to speak language by being "immersed" and "absorbing". Not just language either ... he gives the example of some kids playing with trains on the floor. Along comes another kid who has never even SEEN a train before ... but by watching the other kids, he picks up the gist of the game and plays right along. This happens all the time between humans. This along with the corollary about too many words helps you see how misguided it was in the old "waterfall" days to try to write 700 page detailed requirements specifications.

There is so much more in there too. I'll shut up now, but I HIGHLY recommend this book!

alt text

The Back of the Napkin, by Dan Roam.

The Back of the Napkin

A great book about visual thinking techniques. There is also an expanded edition now. I can't speak to that version, as I do not own it; yet.

Agile Software Development by Alistair Cockburn

Do users ever touch your code? If you're not doing solely back-end work, I recommend About Face: The Essentials of User Interface Design — now in its third edition (linked). I used to think my users were stupid because they didn't "get" my interfaces. I was, of course, wrong. About Face turned me around.

"Writing Solid Code: Microsoft's Techniques for Developing Bug-Free C Programs (Microsoft Programming Series)" by Steve MacGuire.

Interesting what a large proportion the books mentioned here are C/C++ books.

While not strictly a software development book, I would highly recommend that Don't Make me Think! be considered in this list.

As so many people have listed Head First Design Patterns, which I agree is a very good book, I would like to see if so many people aware of a title called Design Patterns Explained: A New Perspective on Object-Oriented Design.

This title deals with design patterns excellently. The first half of the book is very accessible and the remaining chapters require only a firm grasp of the content already covered The reason I feel the second half of the book is less accessible is that it covers patterns that I, as a young developer admittedly lacking in experience, have not used much.

This title also introduces the concept behind design patterns, covering Christopher Alexander's initial work in architecture to the GoF first implementing documenting patterns in SmallTalk.

I think that anyone who enjoyed Head First Design Patterns but still finds the GoF very dry, should look into Design Patterns Explained as a much more readable (although not quite as comprehensive) alternative.

Even though i've never programmed a game this book helped me understand a lot of things in a fun way.

How influential a book is often depends on the reader and where they were in their career when they read the book. I have to give a shout-out to Head First Design Patterns. Great book and the very creative way it's written should be used as an example for other tech book writers. I.e. it's written in order to facilitate learning and internalizing the concepts.

Head First Design Patterns

97 Things Every Programmer Should Know

alt text

This book pools together the collective experiences of some of the world's best programmers. It is a must read.

Extreme Programming Explained: Embrace Change by Kent Beck. While I don't advocate a hardcore XP-or-the-highway take on software development, I wish I had been introduced to the principles in this book much earlier in my career. Unit testing, refactoring, simplicity, continuous integration, cost/time/quality/scope - these changed the way I looked at development. Before Agile, it was all about the debugger and fear of change requests. After Agile, those demons did not loom as large.

One of my personal favorites is Hacker's Delight, because it was as much fun to read as it was educational.

I hope the second edition will be released soon!

You.Next(): Move Your Software Development Career to the Leadership Track ~ Michael C. Finley (Author), Honza Fedák (Author) link text

I've been arounda while, so most books that I have found influential don't necessarily apply today. I do believe it is universally important to understand the platform that you are developing for (both hardware and OS). I also think it's important to learn from other peoples mistakes. So two books I would recommend are:

Computing Calamities and In Search of Stupidity: Over Twenty Years of High Tech Marketing Disasters

Working Effectively with Legacy Code is a really amazing book that goes into great detail about how to properly unit test your code and what the true benefit of it is. It really opened my eyes.

It wasn't that long ago that I was a beginning coder, trying to find good books/tutorials on languages I wanted to learn. Even still, there are times I need to pick up a language relatively quickly for a new project I am working on. The point of this post is to document some of the best tutorials and books for these languages. I will start the list with the best I can find, but hope you guys out there can help with better suggestions/new languages. Here is what I found:

Since this is now wiki editable, I am giving control up to the community. If you have a suggestion, please put it in this section. I decided to also add a section for general be a better programmer books and online references as well. Once again, all recommendations are welcome.

General Programming

Online Tutorials
Foundations of Programming By Karl Seguin - From Codebetter, its C# based but the ideas ring true across the board, can't believe no-one's posted this yet actually.
How to Write Unmaintainable Code - An anti manual that teaches you how to write code in the most unmaintable way possible. It would be funny if a lot of these suggestions didn't ring so true.
The Programming Section of Wiki Books - suggested by Jim Robert as having a large amount of books/tutorials on multiple languages in various stages of completion
Just the Basics To get a feel for a language.

Books
Code Complete - This book goes without saying, it is truely brilliant in too many ways to mention.
The Pragmatic Programmer - The next best thing to working with a master coder, teaching you everything they know.
Mastering Regular Expressions - Regular Expressions are an essential tool in every programmer's toolbox. This book, recommended by Patrick Lozzi is a great way to learn what they are capable of.
Algorithms in C, C++, and Java - A great way to learn all the classic algorithms if you find Knuth's books a bit too in depth.

C

Online Tutorials
This tutorial seems to pretty consise and thourough, looked over the material and seems to be pretty good. Not sure how friendly it would be to new programmers though.
Books
K&R C - a classic for sure. It might be argued that all programmers should read it.
C Primer Plus - Suggested by Imran as being the ultimate C book for beginning programmers.
C: A Reference Manual - A great reference recommended by Patrick Lozzi.

C++

Online Tutorials
The tutorial on cplusplus.com seems to be the most complete. I found another tutorial here but it doesn't include topics like polymorphism, which I believe is essential. If you are coming from C, this tutorial might be the best for you.

Another useful tutorial, C++ Annotation. In Ubuntu family you can get the ebook on multiple format(pdf, txt, Postscript, and LaTex) by installing c++-annotation package from Synaptic(installed package can be found in /usr/share/doc/c++-annotation/.

Books
The C++ Programming Language - crucial for any C++ programmer.
C++ Primer Plus - Orginally added as a typo, but the amazon reviews are so good, I am going to keep it here until someone says it is a dud.
Effective C++ - Ways to improve your C++ programs.
More Effective C++ - Continuation of Effective C++.
Effective STL - Ways to improve your use of the STL.
Thinking in C++ - Great book, both volumes. Written by Bruce Eckel and Chuck Ellison.
Programming: Principles and Practice Using C++ - Stroustrup's introduction to C++.
Accelerated C++ - Andy Koenig and Barbara Moo - An excellent introduction to C++ that doesn't treat C++ as "C with extra bits bolted on", in fact you dive straight in and start using STL early on.

Forth

Books
FORTH, a text and reference. Mahlon G. Kelly and Nicholas Spies. ISBN 0-13-326349-5 / ISBN 0-13-326331-2. 1986 Prentice-Hall. Leo Brodie's books are good but this book is even better. For instance it covers defining words and the interpreter in depth.

Java

Online Tutorials
Sun's Java Tutorials - An official tutorial that seems thourough, but I am not a java expert. You guys know of any better ones?
Books
Head First Java - Recommended as a great introductory text by Patrick Lozzi.
Effective Java - Recommended by pek as a great intermediate text.
Core Java Volume 1 and Core Java Volume 2 - Suggested by FreeMemory as some of the best java references available.
Java Concurrency in Practice - Recommended by MDC as great resource for concurrent programming in Java.

The Java Programing Language

Python

Online Tutorials
Python.org - The online documentation for this language is pretty good. If you know of any better let me know.
Dive Into Python - Suggested by Nickola. Seems to be a python book online.

Perl

Online Tutorials
perldoc perl - This is how I personally got started with the language, and I don't think you will be able to beat it.
Books
Learning Perl - a great way to introduce yourself to the language.
Programming Perl - greatly referred to as the Perl Bible. Essential reference for any serious perl programmer.
Perl Cookbook - A great book that has solutions to many common problems.
Modern Perl Programming - newly released, contains the latest wisdom on modern techniques and tools, including Moose and DBIx::Class.

Ruby

Online Tutorials
Adam Mika suggested Why's (Poignant) Guide to Ruby but after taking a look at it, I don't know if it is for everyone. Found this site which seems to offer several tutorials for Ruby on Rails.
Books
Programming Ruby - suggested as a great reference for all things ruby.

Visual Basic

Online Tutorials
Found this site which seems to devote itself to visual basic tutorials. Not sure how good they are though.

PHP

Online Tutorials
The main PHP site - A simple tutorial that allows user comments for each page, which I really like. PHPFreaks Tutorials - Various tutorials of different difficulty lengths.
Quakenet/PHP tutorials - PHP tutorial that will guide you from ground up.

JavaScript

Online Tutorials
Found a decent tutorial here geared toward non-programmers. Found another more advanced one here. Nickolay suggested A reintroduction to javascript as a good read here.

Books
Head first JavaScript
JavaScript: The Good Parts (with a Google Tech Talk video by the author)

C#

Online Tutorials
C# Station Tutorial - Seems to be a decent tutorial that I dug up, but I am not a C# guy.
C# Language Specification - Suggested by tamberg. Not really a tutorial, but a great reference on all the elements of C#
Books
C# to the point - suggested by tamberg as a short text that explains the language in amazing depth

ocaml

Books
nlucaroni suggested the following:
OCaml for Scientists Introduction to ocaml
Using Understand and unraveling ocaml: practice to theory and vice versa
Developing Applications using Ocaml - O'Reilly
The Objective Caml System - Official Manua

Haskell

Online Tutorials
nlucaroni suggested the following:
Explore functional programming with Haskell
Books
Real World Haskell
Total Functional Programming

LISP/Scheme

Books
wfarr suggested the following:
The Little Schemer - Introduction to Scheme and functional programming in general
The Seasoned Schemer - Followup to Little Schemer.
Structure and Interpretation of Computer Programs - The definitive book on Lisp (also available online).
Practical Common Lisp - A good introduction to Lisp with several examples of practical use.
On Lisp - Advanced Topics in Lisp
How to Design Programs - An Introduction to Computing and Programming
Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp - an approach to high quality Lisp programming

What about you guys? Am I totally off on some of there? Did I leave out your favorite language? I will take the best comments and modify the question with the suggestions.

Java: SCJP for Java 6. I still use it as a reference.

Haskell:

O'Reilly Book:

  1. Real World Haskell, a great tutorial-oriented book on Haskell, available online and in print.

My favorite general, less academic online tutorials:

  1. The Haskell wikibook which contains all of the excellent Yet Another Haskell Tutorial. (This tutorial helps with specifics of setting up a Haskell distro and running example programs, for example.)
  2. Learn you a Haskell for Great Good, in the spirit of Why's Poignant Guide to Ruby but more to the point.
  3. Write yourself a Scheme in 48 hours. Get your hands dirty learning Haskell with a real project.

Books on Functional Programming with Haskell:

  1. Lambda calculus, combinators, more theoretical, but in a very down to earth manner: Davie's Introduction to Functional Programming Systems Using Haskell
  2. Laziness and program correctness, thinking functionally: Bird's Introduction to Functional Programming Using Haskell

Some books on Java I'd recommend:

For Beginners: Head First Java is an excellent introduction to the language. And I must also mention Head First Design Patterns which is a great resource for learners to grasp what can be quite challenging concepts. The easy-going fun style of these books are ideal for ppl new to programming.

A really thorough, comprehensive book on Java SE is Bruce Eckel's Thinking In Java v4. (At just under 1500 pages it's good for weight-training as well!) For those of us not on fat bank-bonuses there are older versions available for free download.

Of course, as many ppl have already mentioned, Josh Bloch's Effective Java v2 is an essential part of any Java developer's library.

Let's not forget Head First Java, which could be considered the essential first step in this language or maybe the step after the online tutorials by Sun. It's great for the purpose of grasping the language concisely, while adding a bit of fun, serving as a stepping stone for the more in-depth books already mentioned.

Sedgewick offers great series on Algorithms which are a must-have if you find Knuth's books to be too in-depth. Knuth aside, Sedgewick brings a solid approach to the field and he offers his books in C, C++ and Java. The C++ books could be used backwardly on C since he doesn't make a very large distinction between the two languages in his presentation.

Whenever I'm working on C, C:A Reference Manual, by Harbison and Steele, goes with me everywhere. It's concise and efficient while being extremely thorough making it priceless(to me anyways).

Languages aside, and if this thread is to become a go-to for references in which I think it's heading that way due to the number of solid contributions, please include Mastering Regular Expressions, for reasons I think most of us are aware of... some would also say that regex can be considered a language in its own right. Further, its usefulness in a wide array of languages makes it invaluable.

C: “Programming in C”, Stephen G. Kochan, Developer's Library.

Organized, clear, elaborate, beautiful.

C++

The first one is good for beginners and the second one requires more advanced level in C++.

I know this is a cross post from here... but, I think one of the best Java books is Java Concurrency in Practice by Brian Goetz. A rather advanced book - but, it will wear well on your concurrent code and Java development in general.

C#

C# to the Point by Hanspeter Mössenböck. On a mere 200 pages he explains C# in astonishing depth, focusing on underlying concepts and concise examples rather than hand waving and Visual Studio screenshots.

For additional information on specific language features, check the C# language specification ECMA-334.

Framework Design Guidelines, a book by Krzysztof Cwalina and Brad Abrams from Microsoft, provides further insight into the main design decisions behind the .NET library.

For Lisp and Scheme (hell, functional programming in general), there are few things that provide a more solid foundation than The Little Schemer and The Seasoned Schemer. Both provide a very simple and intuitive introduction to both Scheme and functional programming that proves far simpler for new students or hobbyists than any of the typical volumes that rub off like a nonfiction rendition of War & Peace.

Once they've moved beyond the Schemer series, SICP and On Lisp are both fantastic choices.

For C++ I am a big fan of C++ Common Knowledge: Essential Intermediate Programming, I like that it is organized into small sections (usually less than 5 pages per topic) So it is easy for me to grab it and read up on concepts that I need to review.

It is a must read for me the night before and on the plane to a job interview.

C Primer Plus, 5th Edition - The C book to get if you're learning C without any prior programming experience. It's a personal favorite of mine as I learned to program from this book. It has all the qualities a beginner friendly book should have:

  • Doesn't assume any prior exposure to programming
  • Enjoyable to read (without becoming annoying like For Dummies /
  • Doesn't oversimplify

For Javascript:

For PHP:

For OO design & programming, patterns:

For Refactoring:

For SQL/MySQL:

  • C - The C Programming Language - Obviously I had to reference K&R, one of the best programming books out there full stop.
  • C++ - Accelerated C++ - This clear, well written introduction to C++ goes straight to using the STL and gives nice, clear, practical examples. Lives up to its name.
  • C# - Pro C# 2008 and the .NET 3.5 Platform - Bit of a mouthful but wonderfully written and huge depth.
  • F# - Expert F# - Designed to take experienced programmers from zero to expert in F#. Very well written, one of the author's invented F# so you can't go far wrong!
  • Scheme - The Little Schemer - Really unique approach to teaching a programming language done really well.
  • Ruby - Programming Ruby - Affectionately known as the 'pick axe' book, this is THE defacto introduction to Ruby. Very well written, clear and detailed.

Are there any good books for a relatively new but not totally new *nix user to get a bit more in depth knowledge (so no "Linux for dummies")? For the most part, I'm not looking for something to read through from start to finish. Rather, I'd rather have something that I can pick up and read in chunks when I need to know how to do something or whenever I have one of those "how do I do that again?" moments. Some areas that I'd like to see are:

  • command line administration
  • bash scripting
  • programming (although I'd like something that isn't just relevant for C programmers)

I'd like this to be as platform-independent as possible (meaning it has info that's relevant for any linux distro as well as BSD, Solaris, OS X, etc), but the unix systems that I use the most are OS X and Debian/Ubuntu. So if I would benefit the most from having a more platform-dependent book, those are the platforms to target.

If I can get all this in one book, great, but I'd rather have a bit more in-depth material than coverage of everything. So if there are any books that cover just one of these areas, post it. Hell, post it even if it's not relevant to any of those areas and you think it's something that a person in my position should know about.

I've wiki'd this post - could those with sufficient rep add in items to it.

System administration, general usage books

Programming:

Specific tools (e.g. Sendmail)

Various of the books from O'Reilly and other publishers cover specific topics. Some of the key ones are:

Some of these books have been in print for quite a while and are still relevant. Consequently they are also often available secondhand at much less than list price. Amazon marketplace is a good place to look for such items. It's quite a good way to do a shotgun approach to topics like this for not much money.

As an example, in New Zealand technical books are usurously expensive due to a weak kiwi peso (as the $NZ is affectionately known in expat circles) and a tortuously long supply chain. You could spend 20% of a week's after-tax pay for a starting graduate on a single book. When I was living there just out of university I used this type of market a lot, often buying books for 1/4 of their list price - including the cost of shipping to New Zealand. If you're not living in a location with tier-1 incomes I recommend this.

E-Books and on-line resources (thanks to israkir for reminding me):

  • The Linux Documentation project (www.tldp.org), has many specific topic guides known as HowTos that also often concern third party OSS tools and will be relevant to other Unix variants. It also has a series of FAQ's and guides.

  • Unix Guru's Universe is a collection of unix resources with a somewhat more old-school flavour.

  • Google. There are many, many unix and linux resources on the web. Search strings like unix commands or learn unix will turn up any amount of online resources.

  • Safari. This is a subscription service, but you can search the texts of quite a large number of books. I can recommend this as I've used it. They also do site licences for corporate customers.

Some of the philosophy of Unix:

I recommend the Armadillo book from O'Reilly for command line administration and shell scripting.

alt text

Jason,

Unix Programming Environment by Kernighan and Pike will give you solid foundations on all things Unix and should cover most of your questions regarding shell command line scripting etc.

The Armadillo book by O'Reilly will add the administration angle. It has served me well!

Good luck!

The aforementioned Unix Power Tools is a must. Other classics are sed&awk and Mastering Regular Expressions. I also like some books from the O'Reilly "Cookbook" series:

Is there a way in PHP to compile a regular expression, so that it can then be compared to multiple strings without repeating the compilation process? Other major languages can do this -- Java, C#, Python, Javascript, etc.

I'm not positive that you can. If you check out Mastering Regular Expressions, some PHP specific optimization techniques are discussed in Chapter10: PHP. Specifically the use of the S pattern modifier to cause the regex engine to "Study" the regular expression before it applies it. Depending on your pattern and your text, this could give you some speed improvements.

Edit: you can take a peek at the contents of the book using books.google.com.

It used to be considered beneficial to include the 'o' modifier at the end of Perl regular expressions. The current Perl documentation does not even seem to list it, certainly not at the modifiers section of perlre.

Does it provide any benefit now?

It is still accepted, for reasons of backwards compatibility if nothing else.


As noted by J A Faucett and brian d foy, the 'o' modifier is still documented, if you find the right places to look (one of which is not the perlre documentation). It is mentioned in the perlop pages. It is also found in the perlreref pages.

As noted by Alan M in the accepted answer, the better modern technique is usually to use the qr// (quoted regex) operator.

I'm sure it's still supported, but it's pretty much obsolete. If you want the regex to be compiled only once, you're better off using a regex object, like so:

my $reg = qr/foo$bar/;

The interpolation of $bar is done when the variable is initialized, so you will always be using the cached, compiled regex from then on within the enclosing scope. But sometimes you want the regex to be recompiled, because you want it to use the variable's new value. Here's the example Friedl used in The Book:

sub CheckLogfileForToday()
{
  my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];

  my $today_regex = qr/^$today:/i; # compiles once per function call

  while (<LOGFILE>) {
    if ($_ =~ $today_regex) {
      ...
    }
  }
}

Within the scope of the function, the value of $today_regex stays the same. But the next time the function is called, the regex will be recompiled with the new value of $today. If he had just used

if ($_ =~ m/^$today:/io)

...the regex would never be updated. So, with the object form you have the efficiency of /o without sacrificing flexibility.

In an attempt to be a better programmer, I am planning to read a lot of books and learn at least one new language (which I think is going to be python) during the 3-month long holiday that I am going to have.

The list of books that I am planning to read --

Ebooks:

The list of things that I want to do --

  • Start using Linux (probably starting with Ubuntu).
  • Learning to use the bunch of tools mentioned here.
  • Setup a blog (hopefully).

I enjoy watching lectures as well, so, along with Introduction to Algorithms I am going to watch a bunch of Stanford Courses.

A little background: I am a 17 year old guy and I really enjoy programming and every aspect related to it. I have been programming in C and C++ for a while now. The former is more stronger than the latter. I was hoping to utilize the time that I have got on hand thoroughly. Any changes needed with the plan or any additions?

EDIT: Did not mention programming projects.

  1. Making a game using Allegro.
  2. Using QT4 to create a GUI-based database for my school.

Do not just passively read all that information, instead practice after every book chapter or lecture. Practice by writing those algorithms or regexpes for yourself, check your previous code in the light of what code complete has taught you and so on.

If that doesn't give you enough time to do it all, doesn't matter, you better learn properly instead of swallowing material as fast as you can. And all of that is probably too much to do it in only three months anyway, so prioritize by your interests. Take one book and a set of lectures and go at them until finished, then pickup the next, never forgetting to put in practice the concepts shown.

Along with the excellent books you and others listed here I recommend The Little Schemer which will give you a perspective of functional programming. Do not lock yourself into imperative languages (C, C++, C#, Pascal, Java,.. ) while you can dive into different paradigms easily. 17 is a great age :-) Enjoy your journey!

These probably won't fit within your three month plan, but should you wish to master C and Unix, long have I been glad that I spent the time to follow The Loginataka. After that, I found Lions' Commentary on UNIX 6th Edition to be deeply enlightening.

I would add the following two books if you haven't read them already:

Programming Pearls by Jon Bentley

Code by Charles Petzold

Even as it stands, you're going to have a very busy, hopefully productive break. Good luck!

Response to question edit: If you're interested in learning about databases, then I recommend Database in Depth by Chris Date. I hope by "create a GUI-based database" you mean implementing a front-end application for an existing database back end. There are plenty of database solutions out there, and it will be well worth it for your future career to learn a few of them.

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good at regex).

Basically, here are my existing lines:

$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );

$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );

They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".

Any help would be greatly appreciated.

For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".

This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:

Good:

"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.

Better:

"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).

Best:

"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)

Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:

$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";

This arose from a discussion on formalizing regular expressions syntax. I've seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.

Take the following expression (adjust it for your favorite language):

replace("input", "(.*)*", "$1")

it will return an empty string. Why?

More curiously even, the expression replace("input", "(.*)*", "A$1B") will return the string ABAB. Why the double empty match?

Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .* matches everything and that no further backtracking or matching is done. Then why is $1 empty?

Note: compare with (.+)*, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.

Let's see what happens:

  1. (.*) matches "input".
  2. "input" is captured into group 1.
  3. The regex engine is now positioned at the end of the string. But since (.*) is repeated, another match attempt is made:
  4. (.*) matches the empty string after "input".
  5. The empty string is captured into group 1, overwriting "input".
  6. $1 now contains the empty string.

A good question from the comments:

Then why does replace("input", "(input)*", "A$1B") return "AinputBAB"?

  1. (input)* matches "input". It is replaced by "AinputB".
  2. (input)* matches the empty string. It is replaced by "AB" ($1 is empty because it didn't participate in the match).
  3. Result: "AinputBAB"

I'm reading Douglas Crockfords Javascript: The Good Parts, I just finished the regular expressions chapter. In this chapter he calls JavaScript's \b, positive lookahead (?=) and negative lookahead (?!) "not a good part"

He explains the reason for \b being not good (it uses \w for word boundary finding, and \w fails for any language that uses unicode characters), and that looks like a very good reason to me.

Unfortunately, the reason for positive and negative lookahead being not good is left out, and I cannot come up with one. Mastering Regular Expressions showed me the power that comes with lookahead (and of course explains the issues it brings with it), but I can't really think of anything that would qualify it as "not a good part".

Can anyone explain why JavaScript (positive|negative) lookahead or (positive|negative) lookahead in general should be considered "not good"?

It seems I'm not the only one with this question: one and two.

Maybe it's because of Internet Explorer's perpetually buggy implementation of lookaheads. For anyone authoring a book about JavaScript, any feature that doesn't work in IE might as well not exist.

I searched for a question about regexp testing/learning tools, but people usually suggest Windows based solution. I found one for ubuntu: redit. However, I'm wondering if there are better tools for the job. So, without further ado

Q: What is the best tool for testing/leadning regular expressions
   for linux/ubuntu?

Sorry if this is a superuser kind of question. Thx

Best "tool" for learning regex?

Easy: Mastering Regular Expressions (3rd Edition)

After reading a pretty good article on regex optimization in java I was wondering what are the other good tips for creating fast and efficient regular expressions?

I suggest you to read the book Mastering Regular Expressions by Jeffrey Friedl.

I am reading the PCRE doc, and it refers to possessive quantifiers, but does not explicitly or specifically define them. I know what a greedy quantifier is, and I know what a lazy quantifer is. But possessive?

The PCRE man page seems to be cheating when it uses the term without defining it. The man page specifically states that the term possessive quantifiers was first defined in Friedl's book. Well, that's great, but I don't have Friedl's book, and in reading the man page, between the lines, I cannot figure out what distinguishes possessive quantifiers from greedy ones.

  • ? = zero or one, greedy
  • ?? = zero or one, lazy
  • ?+ = zero or one, possessive
  • '+' = one or more, greedy
  • +? = one or more, lazy
  • ++ = one or more, possessive

Perhaps the best place to start is Regex Tutorial - Possessive Quantifiers:

When discussing the repetition operators or quantifiers, I explained the difference between greedy and lazy repetition. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. A greedy quantifier will first try to repeat the token as many times as possible, and gradually give up matches as the engine backtracks to find an overall match. A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match.


Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches.

Is it possible to emulate possessive quantifiers (.NET doesn’t support it) using atomic grouping (or in other way)?

Note. I found that (x+x+)++y can be replaced with (?>(x+x+)+)y, but this is just an example and I don’t know whether always {something}@+ equals to (?>{something}@) (where @ is a quantifier).

Yup. May I quote the master himself, Jeffrey Friedl, from page 142 of his classic Mastering Regular Expressions (3rd Edition):

"In one sense, possessive quantifiers are just syntactic sugar, as they can be mimicked with atomic grouping. Something like .++ has exactly the same result as (?>.+), although a smart implementation can optimize possessive quantifiers more than atomic grouping."

I would like to be able to match a string literal with the option of escaped quotations. For instance, I'd like to be able to search "this is a 'test with escaped\' values' ok" and have it properly recognize the backslash as an escape character. I've tried solutions like the following:

import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar'  End. """)
print match.groups() 
# I want ("Foo \' Bar") to be printed above

After looking at this, there is a simple problem that the escape character being used, "\", can't be escaped itself. I can't figure out how to do that. I wanted a solution like the following, but negative lookbehind assertions need to be fixed length:

# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...

Any regex gurus able to tackle this problem? Thanks.

re_single_quote = r"'[^'\\]*(?:\\.[^'\\]*)*'"

First note that MizardX's answer is 100% accurate. I'd just like to add some additional recommendations regarding efficiency. Secondly, I'd like to note that this problem was solved and optimized long ago - See: Mastering Regular Expressions (3rd Edition), (which covers this specific problem in great detail - highly recommended).

First let's look at the sub-expression to match a single quoted string which may contain escaped single quotes. If you are going to allow escaped single quotes, you had better at least allow escaped-escapes as well (which is what Douglas Leeder's answer does). But as long as you're at it, its just as easy to allow escaped-anything-else. With these requirements. MizardX is the only one who got the expression right. Here it is in both short and long format (and I've taken the liberty to write this in VERBOSE mode, with lots of descriptive comments - which you should always do for non-trivial regexes):

# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\\.|[^\\'])*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # Capture group $1: Contents.
      (?:       # Group for contents alternatives
        \\.     # Either escaped anything
      | [^\\']  # or one non-quote, non-escape.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

This works and correctly matches all the following string test cases:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"

Ok, now lets begin to improve on this. First, the order of the alternatives makes a difference and one should always put the most likely alternative first. In this case, non escaped characters are more likely than escaped ones, so reversing the order will improve the regex's efficiency slightly like so:

# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\\']|\\.)*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # $1: Contents.
      (?:       # Group for contents alternatives
        [^\\']  # Either a non-quote, non-escape,
      | \\.     # or an escaped anything.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

"Unrolling-the-Loop":

This is a little better, but can be further improved (significantly) by applying Jeffrey Friedl's "unrolling-the-loop" efficiency technique (from MRE3). The above regex is not optimal because it must painstakingly apply the star quantifier to the non-capture group of two alternatives, each of which consume only one or two characters at a time. This alternation can be eliminated entirely by recognizing that a similar pattern is repeated over and over, and that an equivalent expression can be crafted to do the same thing without alternation. Here is an optimized expression to match a single quoted string and capture its contents into group $1:

# Better regex to match single quoted string:
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
re_sq_long = r"""
    '            # Literal opening quote
    (            # $1: Contents.
      [^'\\]*    # {normal*} Zero or more non-', non-escapes.
      (?:        # Group for {(special normal*)*} construct.
        \\.      # {special} Escaped anything.
        [^'\\]*  # More {normal*}.
      )*         # Finish up {(special normal*)*} construct.
    )            # End $1: Contents.
    '
    """

This expression gobbles up all non-quote, non-backslashes (the vast majority of most strings), in one "gulp", which drastically reduces the amount of work that the regex engine must perform. How much better you ask? Well, I entered each of the regexes presented from this question into RegexBuddy and measured how many steps it took the regex engine to complete a match on the following string (which all solutions correctly match):

'This is an example string which contains one \'internally quoted\' string.'

Here are the benchmark results on the above test string:

r"""
AUTHOR            SINGLE-QUOTE REGEX   STEPS TO: MATCH  NON-MATCH
Evan Fosmark      '(.*?)(?<!\\)'                  374     376
Douglas Leeder    '(([^\\']|\\'|\\\\)*)'          154     444
cletus/PEZ        '((?:\\'|[^'])*)(?<!\\)'        223     527
MizardX           '((?:\\.|[^\\'])*)'             221     369
MizardX(improved) '((?:[^\\']|\\.)*)'             153     369
Jeffrey Friedl    '([^\\']*(?:\\.[^\\']*)*)'       13      19
"""

These steps are the number of steps required to match the test string using the RegexBuddy debugger function. The "NON-MATCH" column is the number of steps required to declare match failure when the closing quote is removed from the test string. As you can see, the difference is significant for both the matching and non-matching cases. Note also that these efficiency improvements are only applicable to a NFA engine which uses backtracking (i.e. Perl, PHP, Java, Python, Javascript, .NET, Ruby and most others.) A DFA engine will not see any performance boost by this technique (See: Regular Expression Matching Can Be Simple And Fast).

On to the complete solution:

The goal of the original question (my interpretation), is to pick out single quoted sub-strings (which may contain escaped quotes) from a larger string. If it is known that the text outside the quoted sub-strings will never contain escaped-single-quotes, the regex above will do the job. However, to correctly match single-quoted sub-strings within a sea of text swimming with escaped-quotes and escaped-escapes and escaped-anything-elses, (which is my interpretation of what the author is after), requires parsing from the beginning of the string No, (this is what I originally thought), but it doesn't - this can be achieved using MizardX's very clever (?<!\\)(?:\\\\)* expression. Here are some test strings to exercise the various solutions:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"
test07 = r"out1           'str1' out2 'str2' out2"
test08 = r"out1 \'        'str1' out2 'str2' out2"
test09 = r"out1 \\\'      'str1' out2 'str2' out2"
test10 = r"out1 \\        'str1' out2 'str2' out2"
test11 = r"out1 \\\\      'str1' out2 'str2' out2"
test12 = r"out1         \\'str1' out2 'str2' out2"
test13 = r"out1       \\\\'str1' out2 'str2' out2"
test14 = r"out1           'str1''str2''str3' out2"

Given this test data let's see how the various solutions fare ('p'==pass, 'XX'==fail):

r"""
AUTHOR/REGEX     01  02  03  04  05  06  07  08  09  10  11  12  13  14
Douglas Leeder    p   p  XX   p   p   p   p   p   p   p   p  XX  XX  XX
  r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
cletus/PEZ        p   p   p   p   p  XX   p   p   p   p   p  XX  XX  XX
  r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'"
MizardX           p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'"
ridgerunner       p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'"
"""

A working test script:

import re
data_list = [
    r"out1 'escaped-escape:        \\ ' out2",
    r"out1 'escaped-quote:         \' ' out2",
    r"out1 'escaped-anything:      \X ' out2",
    r"out1 'two escaped escapes: \\\\ ' out2",
    r"out1 'escaped-quote at end:   \'' out2",
    r"out1 'escaped-escape at end:  \\' out2",
    r"out1           'str1' out2 'str2' out2",
    r"out1 \'        'str1' out2 'str2' out2",
    r"out1 \\\'      'str1' out2 'str2' out2",
    r"out1 \\        'str1' out2 'str2' out2",
    r"out1 \\\\      'str1' out2 'str2' out2",
    r"out1         \\'str1' out2 'str2' out2",
    r"out1       \\\\'str1' out2 'str2' out2",
    r"out1           'str1''str2''str3' out2",
    ]

regex = re.compile(
    r"""(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'""",
    re.DOTALL)

data_cnt = 0
for data in data_list:
    data_cnt += 1
    print ("\nData string %d" % (data_cnt))
    m_cnt = 0
    for match in regex.finditer(data):
        m_cnt += 1
        if (match.group(1)):
            print("  quoted sub-string%3d = \"%s\"" %
                (m_cnt, match.group(1)))

Phew!

p.s. Thanks to MizardX for the very cool (?<!\\)(?:\\\\)* expression. Learn something new every day!

Currently i am reading a book Regular Expressions. The book is very very detailed. Although it gives examples while explaning the subjects it is hard to learn without doing a good amount of exercises/practices.

So, can you suggest a site, a book, a place for regex exercises, so that i can solve them and help myself to absorb regexes while reading the book?

Take a look at Zed Shaw's Learn Regex the Hard Way. It's free and the author provides a hands on way to incrementally learn the topic through exercises.

After that, you can go back to your current reading or move on to Mastering Regular Expressions if you want to dive into more details, or the Regular Expressions Cookbook for problems and solutions.

In http://llvm.org/svn/llvm-project/libcxx/trunk/test/re/re.alg/re.alg.match/ecma.pass.cpp, the following test exists:

    std::cmatch m;
    const char s[] = "tournament";
    assert(!std::regex_match(s, m, std::regex("tour|to|tournament")));
    assert(m.size() == 0);

Why should this match be failed?

On VC++2012 and boost, the match succeeds.
On Javascript of Chrome and Firefox, "tournament".match(/^(?:tour|to|tournament)$/) succeeds.

Only on libc++, the match fails.

I believe the test is correct. It is instructive to search for "tournament" in all of the libc++ tests under re.alg, and compare how the different engines treat the regex("tour|to|tournament"), and how regex_search differs from regex_match.

Let's start with regex_search:

awk, egrep, extended:

regex_search("tournament", m, regex("tour|to|tournament"))

matches the entire input string: "tournament".

ECMAScript:

regex_search("tournament", m, regex("tour|to|tournament"))

matches only part of the input string: "tour".

grep, basic:

regex_search("tournament", m, regex("tour|to|tournament"))

Doesn't match at all. The '|' character is not special.

awk, egrep and extended will match as much as they can with alternation. However the ECMAScript alternation is "ordered". This is specified in ECMA-262. Once ECMAScript matches a branch in the alternation, it quits searching. The standard includes this example:

/a|ab/.exec("abc")

returns the result "a" and not "ab".

<plug>

This is also discussed in depth in Mastering Regular Expressions by Jeffrey E.F. Friedl. I couldn't have implemented <regex> without this book. And I will freely admit that there is still much more that I don't know about regular expressions, than what I know.

At the end of the chapter on alternation the author states:

If you understood everything in this chapter the first time you read it, you probably didn't read it in the first place.

Believe it!

</plug>

Anyway, ECMAScript matches only "tour". The regex_match algorithm returns success only if the entire input string is matched. Since only the first 4 characters of the input string are matched, then unlike awk, egrep and extended, ECMAScript returns false with a zero-sized cmatch.

Is there a good tutorial on using regular expressions, especially with grep? I tried googling for some, but most tutorials are too basic and cover things I already know.

Here is an introductory article. Here are some other tutorials.

The definitive book is Mastering Regular Expressions by J. Friedl:

alt text

I wrote a small, naive regular expression that was supposed to find text inside parentheses:

re.search(r'\((.|\s)*\)', name)

I know this is not the best way to do it for a few reasons, but it was working just fine. What I am looking for is simply an explanation as to why for some strings this expression starts taking exponentially longer and then never finishes. Last night, after running this code for months, one of our servers suddenly got stuck matching a string similar to the following:

x (y)                                            z

I've experimented with it and determined that the time taken doubles for every space in between the 'y' and 'z':

In [62]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (22 * ' ') + 'z')
1 loops, best of 3: 1.23 s per loop

In [63]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (23 * ' ') + 'z')
1 loops, best of 3: 2.46 s per loop

In [64]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * ' ') + 'z')
1 loops, best of 3: 4.91 s per loop

But also that characters other than a space do not have the same effect:

In [65]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * 'a') + 'z')
100000 loops, best of 3: 5.23 us per loop

Note: I am not looking for a better regular expression or another solution to this problem. We are no longer using it.

Catastrophic Backtracking

As CaffGeek's answer correctly implies, the problem is due to one form of catastrophic backtracking. The two alternatives both match a space (or tab) and this is applied unlimited times greedily. Additionally the dot matches the closing parentheses so once the opening paren is matched this expression always matches to the very end of the string before it must painstakingly backtrack to find the closing bracket. And during this backtracking process, the other alternative is tried at each location for (which is also successful for spaces or tabs). Thus, every possible matching combination sequence must be tried before the engine can backtrack one position. With a lot of spaces after the closing paren, this adds up quickly. The specific problem for the case where there is a matching close paren can be solved by simply making the star quantifier lazy (i.e. r'\((.|\s)*?\)'), but the runaway regex problem still exists for the non-matching case where there is an opening paren with no matching close paren in the subject string.

The original regex is really, really bad! (and also does not correctly match up closing parens when there are more than one pair).

The correct expression to match innermost parentheses (which is very fast for both matching and non-matching cases), is of course:

re_innermostparens = re.compile(r"""
    \(        # Literal open paren.
    [^()]*    # Zero or more non-parens.
    \)        # Literal close paren.
    """, re.VERBOSE)

All regex authors should read MRE3!

This is all explained in great detail, (with thorough examples and recommended best practices) in Jeffrey Friedl's must-read-for-regex-authors: Mastering Regular Expressions (3rd Edition). I can honestly say that this is the most useful book I've ever read. Regular expressions are a very powerful tool but like a loaded weapon must be applied with great care and precision (or you will shoot yourself in the foot!)

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:

menu_item = 'casserole';
menu_item = 'meat 
            loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
    gyro';

I want to grab only the text (and spaces), ignoring the tabs/newlines - and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:

casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.

I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:

menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
    m = re.search(menuPat, line)
    if m is not None:
        print m.group()

There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.

Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.

Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.

This tested script should do the trick:

import re
re_sq_long = r"""
    # Match single quoted string with escaped stuff.
    '            # Opening literal quote
    (            # $1: Capture string contents
      [^'\\]*    # Zero or more non-', non-backslash
      (?:        # "unroll-the-loop"!
        \\.      # Allow escaped anything.
        [^'\\]*  # Zero or more non-', non-backslash
      )*         # Finish {(special normal*)*} construct.
    )            # End $1: String contents.
    '            # Closing literal quote
    """
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"

data = r'''
        menu_item = 'casserole';
        menu_item = 'meat 
                    loaf';
        menu_item = 'Tony\'s magic pizza';
        menu_item = 'hamburger';
        menu_item = 'Dave\'s famous pizza';
        menu_item = 'Dave\'s lesser-known
            gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
    match = re.sub('\s+', ' ', match) # Clean whitespace
    match = re.sub(r'\\', '', match)  # remove escapes
    menu_items.append(match)          # Add to menu list

print (menu_items)

Here is the short version of the regex:

'([^'\\]*(?:\\.[^'\\]*)*)'

This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

'((?:[^'\\]|\\.)*)'

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.

For what do you use regular expressions?

P.S. sorry for my bad english

If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.

I'm searching for the best BOOK or tuts, recommendations and way to earn a concise knoledge on Regular expression (REGEX)

I'll be very happy to try your suggestions,

Thank you

My answer got closed! "We expect answers to be supported by facts, references, or specific expertise"

What if a Stanford professor or teacher has the "specific expertise" as to recommend me a way to learn "supporting" his answer on his many years teaching and his conversation with other colleagues and "references" ?

I'll rephrase:

I'm searching for the best BOOK or tuts, recommendations and way to earn a concise knowledge on Regular expression (REGEX).

I would be very happy to try your answers which I'll assume to be supported by facts, references, or specific expertise.

Thank you

Mastering Regular Expressions is the canonical text. It is how I learned regular expressions and it changed my [programming] life.

When you need to quickly look something up for reference, Regular-Expressions.info is an invaluable resource.

Finally, when you are trying to work out a regular expression, use an online tool that is tailored to the "flavor" of Regex that you are using (.NET, Javascript, Ruby, etc.), or you can purchase standalone apps, such as Regex Buddy (which also helps with learning, though I have not used it personally).

Is there any resource out there that covers regular expression in a very basic starter level? I have looked at http://www.regular-expressions.info/ but it is not exactly 'quick reference' friendly.. any other sources?

For me I mostly use it in php, so php friendly would be helpful.

When I was starting out with regular expressions I got recommended this book: Mastering Regular Expressions. It really helped me understand the basic concept and how regular expressions work. But also I can still learn from it today because it also covers more advanced topics.

There are some pdf versions of the books out there, but I don't think they are legal. Otherwise I would really recommend buying this book if you really like to actually understand regexp and not just copy ready regexps and kind of understand how it works.

NOTE :- The question is bit long as it includes a section from book.

I was reading about atomic groups from Mastering Regular Expression.

It is given that atomic groups leads to faster failure. Quoting that particular section from the book

Faster failures with atomic grouping. Consider ^\w+: applied to Subject. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

enter image description here

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

enter image description here

After the attempt from the final state fails, overall failure can finally be announced. All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the + is forced to give up!

So, knowing that none of the states left by \w+, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them: ^(?>\w+):. By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly.


I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)


My Questions

  1. In the second paragraph from above section, it is mentioned that

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

enter image description here

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

enter image description here

but on this site, I see no backtracking. Why?

Is there some optimization going on inside(even after it is disabled)?

  1. Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?

The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:

Normal (greedy)

After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):

Atomic

After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:

Possessive

As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.

I have the following string:

[{names: {en: "US 30 - 5 Minute Level", es: "US 30 - 5 Minute Level"}, status: "A", displayed: "Y", start_time: "2011-05-20 00:00:00", start_time_xls: {en: "20th of May 2011  00:00 am", es: "20 May 2011 00:00 am"}, suspend_at: "2011-05-20 16:53:48", is_off: "Y", score_home: "", score_away: "", bids_status: "", period_id: "", curr_period_start_time: "", score_extra_info: "", settled: "N", ev_id: 2688484, ev_type_id: 10745, num_mkts: 5, venues: {en: "", es: ""}, disporder: 2040, ev_stream_available: false}]

I need to surround all the variable names with quotation marks so this will validate as JSON. I was doing the following, but it's also splitting up the dates. .

Regex.Replace(input, @"(\w+:)", "\"$0\":", RegexOptions.None);

Output after Regex.Replace:

[{"names" {"en" "US 30 - 5 Minute Level", "es" "US 30 - 5 Minute Level"}, "status" "A", "displayed" "Y", "start_time" "2011-05-20 "00""00"00", "start_time_xls" {"en" "20th of May 2011 "00"00 am", "es" "20 May 2011 "00"00 am"}, "suspend_at" "2011-05-20 "16""53"48", "is_off" "Y", "score_home" "", "score_away" "", "bids_status" "", "period_id" "", "curr_period_start_time" "", "score_extra_info" "", "settled" "N", "ev_id" 2688484, "ev_type_id" 10745, "num_mkts" 5, "venues" {"en" "", "es" ""}, "disporder" 2040, "ev_stream_available" false}]

How can I change this to ignore them? Also, what's a good web-based resource to get to the bottom of Regular Expressions once and for all?!

Thanks.

Try this pattern:

string pattern = @"\b([A-Za-z_]+)\b(?=:)";
string replace = "\"$0\"";
string result = Regex.Replace(input, pattern, replace);
Console.WriteLine(result);

The [A-Za-z_]+ matches any upper/lower-case alphabet and the underscore character one or more times. This works fine if none of the JSON names contain numbers. The \b metacharacter matches on a word boundary and (?=:) matches - but doesn't capture - a colon. You'll notice that the replace pattern doesn't include a colon.

Similarly, this pattern would work: @"\b([^\d\s]+)\b(?=:)" since it matches everything that is not a number or space.

Learning regex takes a lot of practice to understand the concepts, try out examples, and wrap your head around how things work. I suggest grabbing a tool that let's you try them out along with some tutorials. To get you started:

  • The 30 Minute Regex Tutorial
  • Regular Expressions Info - great resource, explains different concepts and highlights differences between regex engines in various languages.
  • Expresso - this is a free tool, you just need to provide an email to register it. It also includes the 30 min regex tutorial I linked to above.

That should get you started. If you really want to dive deeper then two good books to check out are:

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"

I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E

i tried doing this:

p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)

It gives output: AX&EUr)

Is there any way to correct this, rather than iterating each element in the string?

Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.

It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)

A little regex help please.

Why are these different?

Regex.Replace("(999) 555-0000 /x ext123", "/x.*|[^0-9]", String.Empty)
"9995550000"


Regex.Replace("(999) 555-0000 /x ext123", "[^0-9]|/x.*", String.Empty)
"9995550000123"

I thought the pipe operator did not care about order... or maybe there is something else that can explain this?

I think you've got the wrong idea about alternation (i.e., the pipe). In a pure DFA regex implementation, it's true that alternation favors the longest match no matter how the alternatives are ordered. In other words, the whole regex, whether it contains alternation or not, always returns the earliest and longest possible match--the "leftmost-longest" rule.

However, the regex implementations in most of today's popular programming languages, including .NET, are what Friedl calls Traditional NFA engines. One of the most important differences between them and DFA engines is that alternation is not greedy; it attempts the alternatives in the order they're listed and stops as soon as one of them matches. The only thing that will cause it to change its mind is if the match fails at a later point in the regex, forcing it to backtrack into the alternation.

Note that if you change the [^0-9] to [^0-9]+ in both regexes you'll get the same result from both--but not the one you want. (I'm assuming the /x.* alternative is supposed to match--and remove--the rest of the string, including the extension number.) I'd suggest something like this:

"[^0-9/]+|/x.*$"

That way, neither alternative can even start to match what the other one matches. Not only will that will prevent the kind of confusion you're experiencing, it avoids potential performance bottlenecks. One of the other major differences between DFA's and NFA's is that badly-written NFA's are prone to serious (even catastrophic) performance problems, and sloppy alternations are one of the easiest ways to trigger them.

I am trying to find a fast way in Python to check if a list of terms can be matched against strings ranging in size from 50 to 50,000 characters.

A term can be:

  • A word, eg. 'apple'
  • A phrase, eg. 'cherry pie'
  • Boolean ANDing of words and phrases, eg. 'sweet pie AND savoury pie AND meringue'

A match is where a word or phrase exists around word boundaries, so:

match(term='apple', string='An apple a day.') # True
match(term='berry pie', string='A delicious berry pie.') # True
match(term='berry pie', string='A delicious blueberry pie.') # False

I currently have about 40 terms, most of them are simple words. The number of terms will increase over time, but I wouldn't expect it to get beyond 400.

I'm not interested in which term(s) a string matches, or where in the string it matches, I just need a true/false value for a match against each string - it is much more likely that no terms will match the string, so for the 1 in 500 where it does match, I can store the string for further processing.

Speed is the most important criteria, and I'd like to leverage the existing code of those smarter than me rather than trying to implement a white-paper. :)

So far the speediest solution I've come up with is:

def data():
    return [
        "The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae).",
        "This resulted in early armies adopting the style of hunter-foraging.",
        "Beef pie fillings are popular in Australia. Chicken pie fillings are too."
    ]

def boolean_and(terms):
    return '(%s)' % (''.join(['(?=.*\\b%s\\b)' % (term) for term in terms]))

def run():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and(terms) for terms in [['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]]
    regex = re.compile(r'(?i)(\b(%s)\b|%s)' % ('|'.join(words_and_phrases), '|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex.search(d):
            matched_data.append(d)

The regex winds up as:

(?i)(\b(apple|cherry pie)\b|((?=.*\bsweet pie\b)(?=.*\bsavoury pie\b)(?=.*\bmeringue\b))|((?=.*\bchicken pie\b)(?=.*\bbeef pie\b)))

So all the terms are ORed together, case is ignored, the words/phrases are wrapped in \b for word boundaries, the boolean ANDs use lookaheads so that all the terms are matched, but they do not have to match in a particular order.

Timeit results:

 print timeit.Timer('run()', 'from __main__ import run').timeit(number=10000)
 1.41534304619

Without the lookaheads (ie. the boolean ANDs) this is really quick, but once they're added the speed slows down considerably.

Does anybody have ideas on how this could be improved? Is there a way to optimise the lookahead, or maybe an entirely different approach? I don't think stemming will work, as it tends to be a bit greedy with what it matches.

The boolean AND regex with the multiple lookahead assertions can be sped up considerably by anchoring them to the beginning of the string. Or better yet, use two regexes: one for the ORed list of terms using the re.search method, and a second regex with the boolean ANDed list using the re.match method like so:

def boolean_and_new(terms):
    return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms])

def run_new():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and_new(terms) for terms in [
        ['sweet pie', 'savoury pie', 'meringue'],
        ['chicken pie', 'beef pie']]]
    regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases)))
    regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex1.search(d) or regex2.match(d):
            matched_data.append(d)

The effective regexes for this data set are:

regex1 = r'(?i)\b(?:apple|cherry pie)\b'
regex2 = r'(?i)(?=.*?\bsweet pie\b)(?=.*?\bsavoury pie\b)(?=.*?\bmeringue\b)|(?=.*?\bchicken pie\b)(?=.*?\bbeef pie\b)'

Note that the second regex effectively has a ^ anchor at the start since its being used with the re.match method. This also includes a few extra (minor) tweaks; removing unnecessary capture groups and changing the greedy dot-star to lazy. This solution runs nearly 10 times faster than the original on my Win32 box running Python 3.0.1.

Additional: So why is it faster? Lets look at a simple example which describes how the NFA regex "engine" works. (Note that the following description derives from the classic work on the subject: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3), which explains all this efficiency stuff in great detail - highly recommended.) Lets say you have a target string containing one word and you want a regex to see if that word is: "apple". Here are two regexes one might compose to do the job:

re1 = r'^apple'
re2 = r'apple'
s = r'orange'

If your target string is: apple (or applesauce or apple-pie etc.), then both regexes will successfully match very quickly. But if your target string is say: orange, the situation is different. An NFA regex engine must try all possible permutations of the regex on the target string before it can declare match failure. The way that the regex "engine" works, is that it keeps an internal pointer to its current location within the target string, and a second pointer to a location within the regex pattern, and advanced these pointers as it goes about its business. Note that these pointers point to locations between the characters and to begin with, the target string pointer is set to the location before the first letter if the target string.

re1: The first token in the regex is the ^ start of string anchor. This "anchor" is one of the special "assertion" expressions which matches a location in a target string and does not actually match any characters. (Lookahead and lookbehind and the \b word boundary expressions are also assertions which match a location and do not "consume" any characters.) Ok, with the target string pointer initialized to the location before the first letter of the word orange, the regex engine checks if the ^ anchor matches, and it does (because this location is, in fact, the beginning of the string). So the pattern pointer is advanced to the next token in the regex, the letter a. (The target string pointer is not advanced). It then checks to see if the regex literal a matches the target string character o. It does not match. At this point, the regex engine is smart enough to know that the regex can never succeed at any other locations within the target string (because the ^ can never match anywhere but at the start). Thus it can declare match failure.

re2: In this case the engine begins by checking if the first pattern char a matches the first target char 'o'. Once again, it does not. However, in this case the regex engine is not done! It has determined that the pattern will not match at the first location, but it must try (and fail) at ALL locations with the target string before it can declare match failure. Thus, the engine advances the target string pointer to the next location (Friedl refers to this as the transmission "bump-along"). For each "bump along", it resets the pattern pointer back to the beginning. Thus it checks the first token in the pattern a against the second char in the string: r. This also does not match, so the transmission bumps along again to the next location within the string. At this point it tests the first char of the pattern a against the third char of the target: a, which does match. The engine advances both pointers and checks the second char in the regex p against the fourth character in the target n. This fails. At this point the engine declares failure at the location before the a in orange and then bumps along again to the n. It goes on like this until it fails at every location in the target string, at which point it can declare overall match failure.

For long subject strings, this extra unnecessary work can take a lot of time. Crafting an accurate and efficient regex is equal parts art and science. And to craft a really great regex, one must understand exactly how the engine actually works under the hood. Gaining this knowledge requires time and effort, but the time spent will (in my experience) pay for itself many times over. And there really is only one good place to effectively learn these skills and that is to sit down and study Mastering Regular Expressions (3rd Edition) and then practice the techniques learned. I can honestly say that this is, hands down, the most useful book I have ever read. (Its even funny!)

Hope this helps! 8^)

I can never remember the differences in regular expression syntax used by tools like grep and AWK, or languages like Python and PHP. Generally, Perl has the most expansive syntax, but I'm often hamstrung by the limitations of even egrep ("extended" grep).

Is there a site that lists the differences in a concise and easy-to-read fashion?

Mastering Regular Expressions, devotes the last four chapters to Java, PHP, Perl, and .NET. One chapter for each. From what I know, the pocket edition contains just those final four chapters.

I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.

I'm needing help to find the text/expression immediately after the last open-parenthesis that doesn't have a matching close-parenthesis.

It is for CallTip of a open source software (Object Pascal) in development.

Below some examples:

------------------------------------
Text                  I need
------------------------------------
aaa(xxx               xxx
aaa(xxx,              xxx
aaa(xxx, yyy          xxx
aaa(y=bbb(xxx)        y=bbb(xxx)
aaa(y <- bbb(xxx)     y <- bbb(xxx)
aaa(bbb(ccc(xxx       xxx
aaa(bbb(x), ccc(xxx   xxx
aaa(bbb(x), ccc(x)    bbb(x)
aaa(bbb(x), ccc(x),   bbb(x)
aaa(?, bbb(??         ??
aaa(bbb(x), ccc(x))   ''
aaa(x)                ''
aaa(bbb(              ''
------------------------------------

For all text above the RegEx proposed by @Bohemian
(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-)
matches all cases.

For the below (I found these cases when implementing the RegEx in the software) not
------------------------------------
New text              I need
------------------------------------
aaa(bbb(x, y)         bbb(x, y)
aaa(bbb(x, y, z)      bbb(x, y, z)
------------------------------------

Is it possible to write a RegEx (PCRE) for these situations?

In an previous post (RegEx: Word immediately before the last opened parenthesis) Alan Moore (many thanks newly) help me to find the text immediately before the last open-parenthesis with the RegEx below:

\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)

However, I was not able to make an appropriate adjustment to match immediately after.

Anyone can help please?

This is similar to this problem. And since you are using PCRE, using the recursion syntax, there is actually a solution.

/
(?(DEFINE)                # define a named capture for later convenience
  (?P<parenthesized>      # define the group "parenthesized" which matches a
                          # substring which contains correctly nested
                          # parentheses (it does not have to be enclosed in
                          # parentheses though)
    [^()]*                # match arbitrarily many non-parenthesis characters
    (?:                   # start non capturing group
      [(]                 # match a literal opening (
      (?P>parenthesized)  # recursively call this "parenthesized" subpattern
                          # i.e. make sure that the contents of these literal ()
                          # are also correctly parenthesized
      [)]                 # match a literal closing )
      [^()]*              # match more non-parenthesis characters
    )*                    # repeat
  )                       # end of "parenthesized" pattern
)                         # end of DEFINE sequence

# Now the actual pattern begins

(?<=[(])                  # ensure that there is a literal ( left of the start
                          # of the match
(?P>parenthesized)?       # match correctly parenthesized substring
$                         # ensure that we've reached the end of the input
/x                        # activate free-spacing mode

The gist of this pattern is obviously the parenthesized subpattern. I should maybe elaborate a bit more on that. It's structure is this:

(normal* (?:special normal*)*)

Where normal is [^()] and special is [(](?P>parenthesized)[)]. This technique is called "unrolling-the-loop". It's used to match anything that has the structure

nnnsnnsnnnnsnnsnn

Where n is matched by normal and s is matched by special.

In this particular case, things are a bit more complicated though, because we are also using recursion. (?P>parenthesized) recursively uses the parenthesized pattern (which it is part of). You can view the (?P>...) syntax a bit like a backreference - except the engine does not try to match what the group ... matched, but instead applies it's subpattern again.

Also note that my pattern will not give you an empty string for correctly parenthesized patterns, but will fail. You could fix this, by leaving out the lookbehind. The lookbehind is actually not necessary, because the engine will always return the left-most match.

EDIT: Judging by two of your examples, you don't actually want everything after the last unmatched parenthesis, but only everything until the first comma. You can use my result and split on , or try Bohemian's answer.

Further reading:

EDIT: I noticed that you mentioned in your question that you are using Object Pascal. In that case you are probably not actually using PCRE, which means there is no support for recursion. In that case there can be no full regex solution to the problem. If we impose a limitation like "there can only be one more nesting level after the last unmatched parenthesis" (as in all your examples), then we can come up with a solution. Again, I'll use "unrolling-the-loop" to match substrings of the form xxx(xxx)xxx(xxx)xxx.

(?<=[(])         # make sure we start after an opening (
(?=              # lookahead checks that the parenthesis is not matched
  [^()]*([(][^()]*[)][^()]*)*
                 # this matches an arbitrarily long chain of parenthesized
                 # substring, but allows only one nesting level
  $              # make sure we can reach the end of the string like this
)                # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
                 # now actually match the desired part. this is the same
                 # as the lookahead, except we do not allow for commas
                 # outside of parentheses now, so that you only get the
                 # first comma-separated part

If you ever add an input example like aaa(xxx(yyy()) where you want to match xxx(yyy()) then this approach will not match it. In fact, no regex that does not use recursion can handle arbitrary nesting levels.

Since your regex flavor doesn't support recursion, you are probably better off without using regex at all. Even if my last regex matches all your current input examples, it's really convoluted and maybe not worth the trouble. How about this instead: walk the string character by character and maintain a stack of parenthesis positions. Then the following pseudocode gives you everything after the last unmatched (:

while you can read another character from the string
    if that character is "(", push the current position onto the stack
    if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
     take a substring from there to the end of the string

To then obtain everything up to the first unnested comma, you can walk that result again:

nestingLevel = 0
while you can read another character from the string
    if that character is "," and nestingLevel == 0, stop
    if that character is "(" increment nestingLevel
    if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
  you left the loop

These two short loops will be much easier for anyone else to understand in the future and are a lot more flexible than a regex solution (at least one without recursion).

I'm trying to find all the occurrences of "Arrows" in text, so in

"<----=====><==->>"

the arrows are:

"<----", "=====>", "<==", "->", ">"

This works:

 String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
    for (String p : patterns) {
      Matcher A = Pattern.compile(p).matcher(s);
       while (A.find()) {
        System.out.println(A.group());
      }         
    }

but this doesn't:

      String p = "<=*|<-*|=*>|-*>";
      Matcher A = Pattern.compile(p).matcher(s);
       while (A.find()) {
        System.out.println(A.group());
      }         

No idea why. It often reports "<" instead of "<====" or similar.

What is wrong?

for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC: Mastering Regular Expressions

Could you expand on why the output of Console.WriteLine(m.Groups[1]); would be Contoso, Inc? And could you also elaborate on the matching steps of this example? Thanks.

PS: I'm not familiar with the concept of groups

string input = "Company Name: Contoso, Inc."; 
Match m = Regex.Match(input, @"Company Name: (.*$)"); 
Console.WriteLine(m.Groups[1]);

If you are interesting in learning the ins and outs of Regular Expressions (along with how the different implementations vary based on language and platform), then I definitely recommend Mastering Regular Expressions from O'Reilly.

if i have a string like this "Hello - World - Hello World"

I want to replace the characters PRECEDING the FIRST instance of the substring " - "

e.g. so replacing the above with "SUPERDOOPER" would leave: "SUPERDOOPER - World - Hello World"

So far I got this: "^[^-]* - "

But this INCLUDES the " - " which is wrong.

how to do this with regex please?

You can try using regex with positive lookahead: "^[^-]*(?= - )". As far as I know C# supports it. This regex will match exactly what you want. You can find out more about lookahead, look-behind and other advanced regex techniques in famous book "Mastering Regular Expressions".

I have a file loaded into a stream reader. The file contains ip addresses scattered about. Before and after each IP addresses there is "\" if this helps. Also, the first "\" on each line ALWAYS comes before the ip address, there are no other "\" before this first one.

I already know that I should use a while loop to cycle through each line, but I dont know the rest of the procedure :<

For example:

Powerd by Stormix.de\93.190.64.150\7777\False

Cupserver\85.236.100.100\8178\False

Euro Server\217.163.26.20\7778\False

in the first example i would need "93.190.64.150" in the second example i would need "85.236.100.100" in the third example i would need "217.163.26.20"

I really struggle with parsing/splicing/dicing :s

thanks in advance

*** I require to keep the IP in a string a bool return is not sufficient for what i want to do.

using System.Text.RegularExpressions;

…

var sourceString = "put your string here";
var match = Regex.Match(sourceString, @"\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b");
if(matches.Success) Console.WriteLine(match.Captures[0]);

This will match any IP address, but also 999.999.999.999. If you need more exactness, see details here: http://www.regular-expressions.info/examples.html

The site has lots of great info an regular expressions, which is a domain-specific language used within most popular programming languages for text pattern matching. Actually, I think the site was put together by the author of Mastering Regular Expressions.

update

I modified the code above to capture the IP address, as you requested (by adding parentheses around the IP address pattern). Now we check to make sure there was a match using the Success property, and then you can get the IP address using Captures[0] (because we only have one capture group, we know to use the first index, 0).

I'm trying to match a block of div that have a particular id.. Here's my regex code:

<div\s+[^>]*\s*id\s*=\s*["|']content["|']\s*>[^/div]+

I want the regex to match the whole div block. So I put [^/div]+ in my regex, I assume that it will match the remaining characters until it reaches the end of the but it failed to match until the end because the [^] expression thought that I don't want to match anything that is < / | d | i | v | >. What I want is to consider the whole thing as a whole.Putting a [^()] doens't help either..

So please tell me how should i code this problem

<div id="content">
    <noscript></noscript>
    <a href="blabla.com">
    <h1>
       <a href="blablac.com">Blablabla</a>
    </h1>
</div>

DISCLAIMER: First, I agree that, in general, regex is not the best tool for parsing HTML. However, in the right hands, (and with a few caveats), Philip Hazel's powerful (and most assuredly non-REGULAR) PCRE library, (used by PHP's preg_*() family of functions), does allow solving non-trivial data scraping problems such as this one (with some limitations and caveats - see below). The problem stated above is particularly complex to solve using regex alone, and regex solutions such as the one presented below are not for everyone and should never be attempted by a regex novice. To properly understand the answer below requires fairly deep comprehension of several advanced regex constructs and techniques.

Won't someone please think of the Children! Yes, I have read bobince's legendary answer and I know this is a touchy subject around here (to say the least). But please, if you are tempted to immediately click the down-vote arrow, because I am '/(?:actual|brave|stupid)ly/' using the words: REGEX and: HTML in the same breath (and on a non-trivial problem no-less), I would humbly ask you to refrain long enough to read this entire post and to actually try this solution out for yourself.

With that in mind, if you would like to see how an advanced regex can be crafted to solve this problem, (for all but a few (unlikely) special cases - see below for examples), read on...

AN ADVANCED RECURSIVE REGEX SOLUTION: As Wes Hardaker correctly points out, DIVs can (and frequently are) nested. However, he is not 100% correct when he says "you can't construct one that will match up until the correct </div>". The truth is, with PHP, you can! (with some limitations - see below). Like Perl and .NET, the PCRE regex engine in PHP provides recursive expressions (i.e. (?R), (?1), (?2), etc) which allow matching nested structures to any arbitrary depth (limited only by memory). For example, you can easily match balanced nested parentheses with this expression: '/\((?:[^()]++|(?R))*+\)/'. Run this simple test if you have any doubts:

$text = 'zero(one(two)one(two(three)two)one)zero';
if (preg_match('/\((?:[^()]++|(?R))*+\)/', $text, $matches)) {
    print_r($matches);
}

So if we can all agree that a PHP regex can, indeed, match nested structures, let's move on to the problem at hand. This particular problem is complicated by the fact that the outermost DIV must have the id="content" attribute, but any nested DIVs may or may not. Thus, we can't use the (?R) recursively-match-the-whole-expression construct, because the subexpression to match the outer DIV is not the same as the one needed to match the inner DIVs. In this case, we need to have a capture group (in this case group 2), that will serve as a "recursive subroutine", which matches inner, nested DIV's. So here is a tested PHP code snippet, sporting an advanced not-for-the-faint-of-heart-but-fully-commented-so-that-you-might-actually-be-able-to-make-some-sense-out-of-it regex, which correctly matches (in most cases - see below), a DIV having id="content", which may itself contain nested DIVs:

$re = '% # Match a DIV element having id="content".
    <div\b             # Start of outer DIV start tag.
    [^>]*?             # Lazily match up to id attrib.
    \bid\s*+=\s*+      # id attribute name and =
    ([\'"]?+)          # $1: Optional quote delimiter.
    \bcontent\b        # specific ID to be matched.
    (?(1)\1)           # If open quote, match same closing quote
    [^>]*+>            # remaining outer DIV start tag.
    (                  # $2: DIV contents. (may be called recursively!)
      (?:              # Non-capture group for DIV contents alternatives.
      # DIV contents option 1: All non-DIV, non-comment stuff...
        [^<]++         # One or more non-tag, non-comment characters.
      # DIV contents option 2: Start of a non-DIV tag...
      | <            # Match a "<", but only if it
        (?!          # is not the beginning of either
          /?div\b    # a DIV start or end tag,
        | !--        # or an HTML comment.
        )            # Ok, that < was not a DIV or comment.
      # DIV contents Option 3: an HTML comment.
      | <!--.*?-->     # A non-SGML compliant HTML comment.
      # DIV contents Option 4: a nested DIV element!
      | <div\b[^>]*+>  # Inner DIV element start tag.
        (?2)           # Recurse group 2 as a nested subroutine.
        </div\s*>      # Inner DIV element end tag.
      )*+              # Zero or more of these contents alternatives.
    )                  # End 2$: DIV contents.
    </div\s*>          # Outer DIV end tag.
    %isx';
if (preg_match($re, $text, $matches)) {
    printf("Match found:\n%s\n", $matches[0]);
}

As I said, this regex is quite complex, but rest assured, it does work! with the exception of some unlikely cases noted below - (and probably a few more that I would be very grateful if you could find). Try it out and see for yourself!

Should I use this? Would it be appropriate to use this regex solution in a production environment where hundreds or thousands of documents must be parsed with 100% reliability and accuracy? Of course not. Could it be useful for a limited one time run of some HTML files? (e.g. possibly the person who asked this question?) Possibly. It depends on how comfortable one is with advanced regexes. If the regex above looks like it was written in a foreign language (it is), and/or scares the dickens out of you, the answer is probably no.

It works? Yes. For example, given the following test data, the regex above correctly picks out the DIV having the id="content" (or id='content' or id=content for that matter):

<!DOCTYPE HTML SYSTEM>
<html>
<head><title>Test Page</title></head>
<body>
<div id="non-content-div">
    <h1>PCRE does recursion!</h1>
    <div id='content'>
        <h2>First level matched</h2>
        <!-- this comment </div> is tricky -->
        <div id="one-deep">
            <h3>Second level matched</h3>
            <div id=two-deep>
                <h4>Third level matched</h4>
                <div id=three-deep>
                    <h4>Fourth level matched</h4>
                </div>
                <p>stuff</p>
            </div>
            <!-- this comment <div> is tricky -->
            <p>stuff</p>
        </div>
        <p>stuff</p>
    </div>
    <p>stuff</p>
</div>
<p>stuff</p>
</body></html>

CAVEATS: So what are some scenarios where this solution does not work? Well, DIV start tags may NOT have any angle brackets in any of their attributes (it is possible to remove this limitation, but this adds quite a bit more to the code). And the following CDATA spans, which contain the specific DIV start tag we are looking for (highly unlikely), will cause the regex to fail:

<style type="text/css">
p:before {
    content: 'Unlikely CSS string with <div id=content> in it.';
}
</style>
<p title="Unlikely attribute with a <div id=content> in it">stuff</p>
<script type="text/javascript">
    alert("evil script with <div id=content> in it">");
</script>
<!-- Comment with <div id="content"> in it -->
<![CDATA[ a CDATA section with <div id="content"> in it ]]>

I would very much like to know of any others.

GO READ MRE3 As I said before, to truly grasp what is going on here requires a pretty deep understanding of several advanced techniques. These techniques are not obvious or intuitive. There is only one way that I know of to gain these skills and that is to sit down and study: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3). (You will be glad you did!)

I can honestly say that this is the most useful book I have read in my entire life!

Cheers!

EDIT 2013-04-30 Fixed Regex. It previously disallowed a non-DIV tag which immediately followed the DIV start tag.

Is there a simple and lightweight program to search over a text file and replace a string with regex?

sed or awk. I recommend the book sed&awk to master the subject or the booklet sed&awk pocket reference for a quick reference. Of course mastering regular expressions is a must...

I would gladly appreciate some help in understanding what's going on in the code below. It's just not clicking. So, the snippet (taken from this book)

s/(?<=\d)(?=(\d\d\d)+$)/,/g

converts a number 123456789 to 123,456,789. (g is the global flag). Now, say we have the number 1234. From my understanding (?<=\d) will place us in front of 1 like so 1|234. Then, (?=(\d\d\d)+$) picks up where the look behind left off and evaluate the remaining digits. Since 234 matches the pattern (3 digit and one end line), our substitution takes place (1,234). I hope I got this right.

Now, I'm confused when I make my numbers bigger say 1234567. When I put this into a regex tester I get 1|234|567 but in my mind I expected 1234|567. So...why ? Why does the look ahead for 234 evaluate to true when 4 is not terminated by an end line ? Does this have anything to do with the global flag? Thanks.

The lookahead looks for multiples of three digits: (\d\d\d)+ matches 3, 6, 9, ... digits, and therefore it matches before 234567.

And yes, the global flag has to do with the regex matching twice (although without it, as you can easily test, the result would be 1|234567).

Let's see what happens when we go through the string "1234567":

1.  1234567
   ^ (?<=\d) doesn't match - regex fails.
2. 1 234567
    ^ (?<=\d) matches "1", (?=(\d\d\d)+$) matches "234567"! MATCH!
3. 12 34567
     ^ (?<=\d) matches "2", (?=(\d\d\d)+$) doesn't match.
4. 123 4567
      ^ (?<=\d) matches "3", (?=(\d\d\d)+$) doesn't match.
5. 1234 567
       ^ (?<=\d) matches "4", (?=(\d\d\d)+$) matches "567"! MATCH!
6. 12345 67
        ^ (?<=\d) matches "5", (?=(\d\d\d)+$) doesn't match.
7. 123456 7
         ^ (?<=\d) matches "6", (?=(\d\d\d)+$) doesn't match.
8. 1234567
          ^ (?<=\d) matches "7", (?=(\d\d\d)+$) doesn't match.

Edit: I'm really just curious as to how I can get this regex to work. Please don't tell me there are easier ways to do it. That's obvious! :P

I'm writing a regular expression (using Python) to parse lines in a configuration file. Lines could look like this:

someoption1 = some value # some comment
# this line is only a comment
someoption2 = some value with an escaped \# hash
someoption3 = some value with a \# hash # some comment

The idea is that anything after a hash symbol is considered to be a comment, except if the hash is escaped with a slash.

I'm trying to use a regex to break each line into its individual pieces: leading whitespace, left side of the assignment, right side of the assignment, and comment. For the first line in the example, the breakdown would be:

  • Whitespace: ""
  • Assignment left: "someoption1 ="
  • Assignment right: " some value "
  • Comment "# some comment"

This is the regex I have so far:

^(\s)?(\S+\s?=)?(([^\#]*(\\\#)*)*)?(\#.*)?$

I'm terrible with regex, so feel free to tear it apart!

Using Python's re.findAll(), this is returning:

  • 0th index: the whitespace, as it should be
  • 1st index: the left side of the assignment
  • 2nd index: The right side of the assignment, up to the first hash, whether escaped or not (which is incorrect)
  • 5th index: The first hash, whether escaped or not, and anything after it (which is incorrect)

There's probably something fundamental about regular expressions that I'm missing. If somebody can solve this I'll be forever grateful...

Of the 5 solutions presented thus far, only Gumbo's actually works. Here is my solution, which also works and is heavily commented:

import re

def fn(line):
    match = re.search(
        r"""^          # Anchor to start of line
        (\s*)          # $1: Zero or more leading ws chars
        (?:            # Begin group for optional var=value.
          (\S+)        # $2: Variable name. One or more non-spaces.
          (\s*=\s*)    # $3: Assignment operator, optional ws
          (            # $4: Everything up to comment or EOL.
            [^#\\]*    # Unrolling the loop 1st normal*.
            (?:        # Begin (special normal*)* construct.
              \\.      # special is backslash-anything.
              [^#\\]*  # More normal*.
            )*         # End (special normal*)* construct.
          )            # End $4: Value.
        )?             # End group for optional var=value.
        ((?:\#.*)?)    # $5: Optional comment.
        $              # Anchor to end of line""", 
        line, re.MULTILINE | re.VERBOSE)
    return match.groups()

print (fn(r" # just a comment"))
print (fn(r" option1 = value"))
print (fn(r" option2 = value # no escape == IS a comment"))
print (fn(r" option3 = value \# 1 escape == NOT a comment"))
print (fn(r" option4 = value \\# 2 escapes == IS a comment"))
print (fn(r" option5 = value \\\# 3 escapes == NOT a comment"))
print (fn(r" option6 = value \\\\# 4 escapes == IS a comment"))

The above script produces the following (correct) output: (tested with Python 3.0.1)

(' ', None, None, None, '# just a comment')
(' ', 'option1', ' = ', 'value', '')
(' ', 'option2', ' = ', 'value ', '# no escape == IS a comment')
(' ', 'option3', ' = ', 'value \\# 1 escape == NOT a comment', '')
(' ', 'option4', ' = ', 'value \\\\', '# 2 escapes == IS a comment')
(' ', 'option5', ' = ', 'value \\\\\\# 3 escapes == NOT a comment', '')
(' ', 'option6', ' = ', 'value \\\\\\\\', '# 4 escapes == IS a comment')

Note that this solution uses Jeffrey Friedl's "Unrolling the loop efficiency technique (which eliminates slow alternation). It also uses no lookaround at all and is very fast. Mastering Regular Expressions (3rd edition) is a must read for anyone claiming to "know" regular expressions. (And when I say "know", I mean in the Neo "I know Kung-Fu!" sense :)

Greetings.

I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:

1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?

2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)

if (Regex.IsMatch(testString, @"^(\s*?)(" + tag + @")(\s*?),", RegexOptions.IgnoreCase))
                {
                    result = true;
                }
else if (Regex.IsMatch(testString, @",(\s*?)(" + tag + @")(\s*?),", RegexOptions.IgnoreCase))
                {
                    result = true;
                }

I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:

http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0

It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.

-Ryan

I am using regular expressions to validate user input. The following code collects a matches accessible with theMatch.Groups["identifier"]. How can i get a list of sub-strings that did not match in each group?

#region Using directives

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;


namespace RegExGroup
{

   class Test
   {
      public static void Main( )
      {
         string string1 = "04:03:27 127.0.0.0 comcom.com";

         // group time = one or more digits or colons followed by space
         Regex theReg = new Regex( @"(?<time>(\d|\:)+)\s" +
         // ip address = one or more digits or dots followed by  space
         @"(?<ip>(\d|\.)+)\s" +
         // site = one or more characters
         @"(?<site>\S+)" );

         // get the collection of matches
         MatchCollection theMatches = theReg.Matches( string1 );

         // iterate through the collection
         foreach ( Match theMatch in theMatches )
         {
           if ( theMatch.Length != 0 )
           {
            Console.WriteLine( "\ntheMatch: {0}",
               theMatch.ToString( ) );
            Console.WriteLine( "time: {0}",
               theMatch.Groups["time"] );
            Console.WriteLine( "ip: {0}",
               theMatch.Groups["ip"] );
            Console.WriteLine( "site: {0}",
               theMatch.Groups["site"] );
           }
         }
      }
   }
}

so if user inputs 0xx:03:27 127.0.0.0 ?.com
I want to output

 time:  0xx:03:27
 site:  ?.com

Also, anyone have good references for using regexs in C#?
Thanks, any help appreciated.

Are you asking how to determine which specific capture group failed to match? To my knowledge once the match fails you won't be able to extract such information. If it fails it fails; no partial matching attempt info can be retrieved. What you could do is apply the entire regex as is to check the pattern in the desired order. Then, if it fails, try each part of your regex separately and tell the user which one failed (time, ip, site). This approach may make sense in this scenario, but might not work for all types of patterns.

Regarding references, here are a few links:

If you're looking for a good book then the most popular is Jeffrey Friedl's Mastering Regular Expressions. A recent book that has positive ratings is the Regular Expressions Cookbook.

I have a list of domain names that I have pulled out of an httpd.conf file. These domain names are formated as:

tld.com
cname.tld.com

Very few of these domain names share a TLD.

I am trying to output a list of just the TLD for each domain name but can't seem to figure out how to manipulate the STDOUT to show just tld.com for each string given.

For example. Let's say my list has:

site1.com
site2.com
mail.site2.com
www.site3.com
site4.com

The result I need from this list would reflect:

site1.com
site2.com
site2.com
site3.com
site4.com

Any thoughts on how I can do this?

When you’re trying to find patterns in text, think regular expressions.

This Wikipedia page lists the most common symbols you’ll use in regular expressions.

Now, a TLD is a series (*) of non-dots ([^.]), a dot (\.), another series of non-dots ([^.]*]), and then the end of the line ($). The regular expression for this is:

[^.]*\.[^.]*$

Which you can use like this:

$ cat foo
site1.com
site2.com
mail.site2.com
www.site3.com
site4.com
$ grep -o '[^.]*\.[^.]*$' foo
site1.com
site2.com
site2.com
site3.com
site4.com

I'm trying to remove a rectangular brackets(bbcode style) using javascript, this is for removing unwanted bbcode. I try with this.

theString .replace(/\[quote[^\/]+\]*\[\/quote\]/, "")

it works with this string sample:

theString = "[quote=MyName;225]Test 123[/quote]";

it will fail within this sample:

theString = "[quote=MyName;225]Test [quote]inside quotes[/quote]123[/quote]";

if there any solution beside regex no problem

The other 2 solutions simply do not work (see my comments). To solve this problem you first need to craft a regex which matches the innermost matching quote elements (which contain neither [QUOTE..] nor [/QUOTE]). Next, you need to iterate, applying this regex over and over until there are no more QUOTE elements left. This tested function does what you want:

function filterQuotes(text)
{ // Regex matches inner [QUOTE]non-quote-stuff[/quote] tag.
    var re = /\[quote[^\[]+(?:(?!\[\/?quote\b)\[[^\[]*)*\[\/quote\]/ig;
    while (text.search(re) !== -1)
    { // Need to iterate removing QUOTEs from inside out.
        text = text.replace(re, "");
    }
    return text;
}

Note that this regex employs Jeffrey Friedl's "Unrolling the loop" efficiency technique and is not only accurate, but is quite fast to boot.

See: Mastering Regular Expressions (3rd Edition) (highly recommended).

I need help from the Regex wizards out there. I am trying to write a simple parser that can tokenize the options list of a Snort rule (Snort, the IDS/IPS software). Problem is, I can't seem to find a workable formula that breaks apart the individual rule options based on their terminating semi-colon. The formulas that I have cooked up grab all options between parenthesis into a single capture group.

I am using the excellent RegExr tool at the GSkinner site with some of the below sample rule options from Emerging Threats (I parsed off the rule header -- that's easy to tokenize):

(msg:"ET DELETED Majestic-12 Spider Bot User-Agent (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot|0d 0a|"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2003409; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2003409; rev:4;)
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent Inbound (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2007762; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2007762; rev:4;)
(msg:"ET POLICY McAfee Update User Agent (McAfee AutoUpdate)"; flow:to_server,established; content:"User-Agent|3a| "; http_header; nocase; content:"McAfee AutoUpdate"; http_header; pcre:"/User-Agent\x3a[^\n]+McAfee AutoUpdate/i"; classtype:not-suspicious; reference:url,doc.emergingthreats.net/2003381; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_McAffee; sid:2003381; rev:6;)
(msg:"ET DELETED Metacafe.com family filter off"; flow:established,to_server; content:"POST"; http_method; content:"Host|3a| www.metacafe.com"; http_header; fast_pattern:6,16; content:"submit=Continue+-+I%27m+over+18"; classtype:policy-violation; reference:url,doc.emergingthreats.net/2006367; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Metacafe; sid:2006367; rev:7;)

And this is the formula:

([a-zA-Z0-9_:]+(?:[\w\s.,\-/=<>+!\[\]\(\)\{\}\"|\\;'?`~@#$%^&*])+;)

The problem is, it doesn't handle colons. So two of the rules above will not have their 'content' options properly parsed. But on RegExr, each option will be highlighted in blue, including the terminating semi-colon, but NOT the space after the semi-colon. If I fed this into .NET, I should be able to do a Regex.Split and break apart all the tokens correctly.

If I add the colon to the character list, then on RegExr, the entire set of rules will get tokenized as a single blob of text, which is not what I want. Further attempts to tweak the formula result in Adobe Flash crashing, indicating I'm hitting a bug in either Flash or RegExr.

I've not ruled out writing my own string tokenizer, but I was hoping regex could save me from dealing with things like counting my open quotations, escaped characters, whitespace, etc.

Snort rule options typically come in the following format:

option:value;
option:"string value";
option:!"negated string value";
option:>num;
option:param1,param2,param3;

But several options tend to have more 'exotic' formats for their value, like byte_test. And everyone's favourite, 'pcre', which is basically an option for performing perl-compatible regex's. So any such tokenizer has to avoid getting confused if it runs into the 'pcre' keyword with regex in it.

Thoughts?


Edit: This below is REALLY close:

([\w]+:?(?:[\x20]|)?(?:[\x00-\xff])*?;)

But, according to RegExr, it gets messed by pcre syntax:

(msg:"ET WEB_SPECIFIC_APPS Horde 3.0.9-3.1.0 Help Viewer Remote PHP Exploit"; flow:established,to_server; content:"/services/help/"; nocase; http_uri; pcre:"/module=[^\;]*\;.*\"/UGi"; classtype:web-application-attack; reference:url,www.milw0rm.com/exploits/1660; reference:cve,2006-1491; reference:bugtraq,17292; reference:url,doc.emergingthreats.net/2002867; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/WEB_SPECIFIC_APPS/WEB_Horde; sid:2002867; rev:9; http_method;)

In the above, every single option is highlighted as a distinct grouping, except ]*\;.*\"/. I would think that \x00-\xff would get it all, but it appears that I am using a lazy match. A greedy match gets everything, including all the spaces between options, which I do not want. So I need to somehow modify the regex to handle tokenizing pcre text.


Edit2:This does the trick:

([\w]+:?(?:[\x20]|)?(?<!\\)\"?.*?(?<!\\)\"?;)

I had to play with a few example regex's that work with quoted strings. Finally realized that I am staring at negative look-behinds that avoid quotes that are escaped. This seems to solve any other escaped character, too, because escaped characters only appear inside unescaped quotes.

No need for lookaround. Just carefully write the regex to precisely match what you need. This is made much clearer (and easier to maintain) by writing this in verbose free-spacing mode like so: (Although VB.NET syntax makes it awkward to do so)

Dim RegexObj As New Regex(
    "# Match set of Snort rules enclosed within parentheses." & chr(10) & _
    "\(                              # Literal opening parentheses." & chr(10) & _
    "(?:                             # Group for one or more rules." & chr(10) & _
    "  \w+                           # Required rule name." & chr(10) & _
    "  (?:                           # Group for optional rule value." & chr(10) & _
    "    :                           # Rule name/values separated by :" & chr(10) & _
    "    (?:                         # Group for rule value alternatives." & chr(10) & _
    "      ""                        # Either a double quoted string," & chr(10) & _
    "      [^""\\]*                  # {normal} Use ""Unrolling the Loop""." & chr(10) & _
    "      (?:                       # Begin {(special normal*)*} construct." & chr(10) & _
    "        \\.                     # {special} == escaped anything." & chr(10) & _
    "        [^""\\]*                # More {normal*} non-quote, non-escapes." & chr(10) & _
    "      )*                        # Finish {(special normal*)*} construct." & chr(10) & _
    "      ""                        # Closing quote." & chr(10) & _
    "    | '[^'\\]*(?:\\.[^'\\]*)*'  # or a single quoted string," & chr(10) & _
    "    | [^;]+                     # or one or more non semi-colons." & chr(10) & _
    "    )                           # End group for rule value options." & chr(10) & _
    "  )?                            # Rule value is optional." & chr(10) & _
    "  ; \s*                         # Rule ends with ;, optional ws." & chr(10) & _
    ")+                              # One or more rules." & chr(10) & _
    "\)                              # LiteraL closing parentheses.", 
    RegexOptions.IgnorePatternWhitespace)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
    ' matched text: MatchResults.Value
    ' match start: MatchResults.Index
    ' match length: MatchResults.Length
    MatchResults = MatchResults.NextMatch()
End While

This regex demonstrates use of Jeffrey Friedl's "Unrolling the Loop" efficiency technique for correctly matching quoted strings which may contain escaped characters. (See: MRE3)

Oh yeah, one more thing... Icarus has found you!

Possible Duplicate:
Useful Regular Expression Tutorial

Hello,

I recently started coding in javascript.

I came across "Regular Expressions" while I'm searching (goolgling) to validate forms (name, email ID's, etc.)

Can someone help this newbie coder by explaining:

  1. What are regular expression?
  2. How are they useful in programming?
  3. Are they simple to understand?
  4. Where can I get some good reference to learn these?

Any help is much appreciated.

Thank you!

Best bet is to read up on them somewhere like Wikipedia and use an online tool to try them out. I've not been using regex that long but find that places like stackoverflow will come to your rescue if you get stuck and most of the time it's knowing if a regex is the correct thing to be using and then finding an example of what you are after and making it work for you.

Online regex testers:- http://www.regular-expressions.info/javascriptexample.html or http://www.fileformat.info/tool/regex.htm or google will show you loads more.

If you're on windows a good tool for regex is Expresso this has got built in examples to try out and it also shows you a breakdown of the regex so you can see what's it matching. Another nice feature of it that a colleague showed me recently is that it can generate source code (C# or VB) for the regex you enter - time saver!! Also may be worth buying a book on the subject I've heard that this book is good and there's loads of others that have good recommendations.

Again a regex question.

What's more efficient? Cascading a lot of Regex.Replace with each one a specific pattern to search for OR only one Regex.Replace with an or'ed pattern (pattern1|pattern2|...)?

Thanks in advance, Fabian

It depends on how big your text is and how many matches you expect. If at all possible, put a text literal or anchor (e.g. ^) at the front of the Regex. The .NET Regex engine will optimize this so that it searches for that text using a fast Boyer-Moore algorithm (which can skip characters) rather than a standard IndexOf that looks at each character. In the case that you have several patterns with literal text at the front, there is an optimization to create a set of possible start characters. All others are ignored quickly.

In general, you might want to consider reading Mastering Regular Expressions which highlights general optimizations to get an idea for better performance (especially chapter 6).

I'd say you might get faster perf if you put everything in one Regex, but put the most likely option first, followed by the second most likely, etc. The number one thing to watch out for is backtracking. If you do something like

".*"

to match a quoted string, realize that once it finds the first " then it will always go to the end of the line by default and then start backing up until it finds another ".

The Mastering Regular Expressions book heavily goes into how to avoid this.

I am searching for text inside of website resources (html and javascript), and need to identify 3 regular expressions that will locate this text under certain circumstances:

  1. some string of text when it is contained inside of a javascript single-quoted string
  2. some string of text when it is contained inside of a javascript double-quoted string
  3. some string of text when it is not contained inside of a javascript string

Here are some scenarios that are likely to occur (searching for the string "somestring"):

document.write("here is a bunch of text and somestring is inside of it");
var thing = 'here is a bunch of text and somestring is inside of it';
document.write("some text and 'quote' and then somestring here");
document.write('some text and "quote" and then somestring here');
var thing = "some text and '" + quotedVar + "' and then somestring here");
document.write('some text and "' + quotedVar + '" and then ' + " more " + "somestring here");
this string is outside javascript and here is a 'quoted' line and somestring is after it
this string is outside javascript and here is a "quoted" line and somestring is after it

These examples might all appear inside the same file, and so the regular expressions should not assume single-case scenarios.

I have tried the following for finding single-quoted and double-quoted strings, but alas I have failed miserably:

single quotes:

([=|(|\+]\s*?'[^']*?(?:'[^']*?'[^']*?)*?somestring)

double quotes:

([=|(|\+]\s*?"[^"]*?(?:"[^"]*?"[^"]*?)*?somestring)

These work when assuming the right conditions, but there are many real-world scenarios that I have tried (read, real javascript files) where they fail. Any help is greatly appreciated!

Edit: For clarification, I am looking for 3 regular expression for each of the conditions listed above, not one that covers all cases.

Be careful what you ask for!

The first two of your three objectives can be done fairly well using regex, but it is not trivial and is not 100% reliable - (see caveats below).

Picking strings out of JavaScript

First lets look at how to pick out single and double quoted sub-strings from a longer string of purely Javascript code (not HTML). Note that to do this correctly, a regex must not only match both types of quoted strings, it must also match both single and multi-line comments. This is because quotes may appear inside comments (e.g. /* I can't take it! */), and these quotes must be ignored. Also, comment delimiters may appear inside quoted strings (e.g. var str = "This: /* can cause trouble too!";), so all four constructs must be parsed out in one pass. Here is a regex which matches both types of comments and both types of quoted strings. It is presented in commented, verbose mode (using PHP single quoted syntax):

$re = '%# Parse comments and quoted strings from javascript code.
      /\*[^*]*\*+(?:[^*/][^*]*\*+)*/             # A multi-line comment, or
    | (\'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\')  # $1: Single quoted string, or
    | ("[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*")      # $2: Double quoted string, or
    | //.*                                       # A single line comment.
    %x';

This regex captures single quoted strings into group $1 and double quoted strings into group $2, (and either type of string may contain escaped characters, e.g. 'That\'s cool!'). Both types of comments are captured by the overall match when neither the $1 or $2 capture groups match. Note also, that this regex implements Jeffrey Friedl's "unrolling-the-loop" efficiency technique (See: Mastering Regular Expressions (3rd Edition)), so it is quite fast.

process_js()
The following Javascript function: process_js(), implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the single and double quoted strings independently, and preserves all comments. Two additional functions: process_sq() and process_dq() perform the processing on the matched single and double quoted strings respectively:

function process_js(text) { 
    // Process single and double quoted strings outside comments.
    var re = /\/\*[^*]*\*+(?:[^*\/][^*]*\*+)*\/|('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|\/\/.*/g;
    return text.replace(re,
        function(m0, m1, m2){
            if (m1) return process_sq(m1);  // 'single-quoted'.
            if (m2) return process_dq(m2);  // "double-quoted".
            return m0;                      // preserve comments.
        });
}
function process_sq(text) {
    return text.replace(/\bsomestring\b/g, 'SOMESTRING_SQ');
}
function process_dq(text) {
    return text.replace(/\bsomestring\b/g, 'SOMESTRING_DQ');
}

Note that the two quoted-string handling functions merely replace the keyword: somestring with SOMESTRING_SQ and SOMESTRING_DQ, so that the results of the processing will be evident. These functions are designed to be modified by the user as-needed. Lets see how this performs with a string of Javascript (similar to the example provided in the OP):

Test data input:

// comment with  foo somestring bar  in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with  foo somestring bar  in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */

document.write(" with  foo somestring bar  in it ");
document.write(' with  foo somestring bar  in it ');
document.write(' with "foo somestring bar" in it ');
document.write(" with 'foo somestring bar' in it ");

var str = " with  foo somestring bar  in it ";
var str = ' with  foo somestring bar  in it ';
var str = ' with "foo somestring bar" in it ';
var str = " with 'foo somestring bar' in it ";

Test data output from process_js() :

// comment with  foo somestring bar  in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with  foo somestring bar  in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */

document.write(" with  foo SOMESTRING_DQ bar  in it ");
document.write(' with  foo SOMESTRING_SQ bar  in it ');
document.write(' with "foo SOMESTRING_SQ bar" in it ');
document.write(" with 'foo SOMESTRING_DQ bar' in it ");

var str = " with  foo SOMESTRING_DQ bar  in it ";
var str = ' with  foo SOMESTRING_SQ bar  in it ';
var str = ' with "foo SOMESTRING_SQ bar" in it ';
var str = " with 'foo SOMESTRING_DQ bar' in it ";

Notice that somestring has been processed only within valid Javascript strings and has been ignored within comments. For pure Javascript, this function works pretty darn good!

Picking Javascript out of HTML

Parsing Javascript from HTML using regex is not recommended (see caveats below). However, a reasonably good job can be done if you are comfortable using a complex regex, and are happy with its limitations (once again, see caveats below). That said, here are the requirements for our regex solution: In HTML, Javascript can occur inside <SCRIPT> elements, and within onclick event tag attributes (and all the other HTML 4.01 events: ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown and onkeyup). Javascript can also occur within javascript:-pseudo-URLs, but IMHO, that is really bad practice, so this solution does not attempt to match these. HTML is a complex language and our scraping regex needs to ignore comments and CDATA sections. A multi-global-alternative regex (similar to the previous one), matches each of these structures. Here is the "pluck-js-from-html" regex in commented, verbose mode (once again presented in PHP single quoted syntax):

$re = '%# (Unreliably) parse out javascript from HTML.
    # Either... Option 1: SCRIPT element.
      (<script\b[^>]*>)         # $1: SCRIPT open tag.
      ([\S\s]*?)                # $2: SCRIPT contents.
      (<\/script\s*>)           # $3: SCRIPT close tag.
    # Or... Options 2 and 3: onclick=quoted-js-code.
    | (                         # $4: onXXXX = 
        \bon                    # All HTML 4.01 events...
        (?:click|dblclick|mousedown|mouseup|mouseover|
           mousemove|mouseout|keypress|keydown|keyup
        )\s*=\s*                # with = and optional ws.
      )                         # End $4:
      (?:                       # value alternatives are either
        "([^"]*)"               #    $5: Double-quoted-js,
      | \'([^\']*)\'            # or $6: Single-quoted-js.
      )                         # End group of alternatives.
    # Or other HTML stuff that we should not mess with.
    | <!--[\S\s]*?-->           # HTML (non-SGML) comment.
    | <!\[CDATA\[[\S\s]*?\]\]>  # or CDATA section.
    %ix';

In this regex, we capture SCRIPT javascript content in groups: $1 (open tag), $2 (contents) and $3 (closing tag) and onXXX event handler code in groups: $4 (event attribute name), $5 (double-quoted value contents) and $6 (single-quoted value contents). Comments and CDATA sections are captured by the overall match (when none of the capture groups match). Note that this regex does not make use of the "unrolling-the-loop" technique (although it certainly could), because that would add too much complexity for most readers. (All three of the lazy-dot-star expressions i.e. [\S\s]*?, can be unrolled to speed this up.)

process_html()
The following Javascript function: process_html(), implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the three different sources of javascript code. It then calls the previously described process_js() function to process the captured js code. Here it is:

function process_html(text) {
    // Pick out javascript from HTML event attributes and SCRIPT elements.
    var re = /(<script\b[^>]*>)([\S\s]*?)(<\/script\s*>)|(\bon(?:click|dblclick|mousedown|mouseup|mouseover|mousemove|mouseout|keypress|keydown|keyup)\s*=\s*)(?:"([^"]*)"|'([^']*)')|<!--[\S\s]*?-->|<!\[CDATA\[[\S\s]*?\]\]>/g;
    // Regex to match <script> element
    return text.replace(re,
        function(m0, m1, m2, m3, m4, m5, m6) {
            if (m1) { // Case 1: <script> element.
                m2 = process_js(m2);
                return m1 + m2 + m3;
            }
            if (m4) { // Case 2: onXXX event attribute.
                if (m5) {  // Case 2a: double quoted.
                    m5 = process_js(m5);
                    return m4 + '"' + m5 + '"';
                }
                if (m6) {  // Case 2b: single quoted.
                    m6 = process_js(m6);
                    return m4 + "'" + m6 + "'";
                }
            }
            return m0; // Else return other non-js matches unchanged.
        });
}

Test data input:

<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo somestring bar' in it ");
var str = " with 'foo somestring bar' in it ";
</script>

<!-- with  foo somestring bar  in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->

<![CDATA[ with  foo somestring bar  in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>

<p>non-js with  foo somestring bar  in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>

<p onclick="with  foo somestring bar  in it">stuff</p>
<p onclick="with 'foo somestring bar' in it">stuff</p>
<p onclick='with  foo somestring bar  in it'>stuff</p>
<p onclick='with "foo somestring bar" in it'>stuff</p>

Test data output from process_html() :

<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo SOMESTRING_DQ bar' in it ");
var str = " with 'foo SOMESTRING_DQ bar' in it ";
</script>

<!-- with  foo somestring bar  in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->

<![CDATA[ with  foo somestring bar  in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>

<p>non-js with  foo somestring bar  in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>

<p onclick="with  foo somestring bar  in it">stuff</p>
<p onclick="with 'foo SOMESTRING_SQ bar' in it">stuff</p>
<p onclick='with  foo somestring bar  in it'>stuff</p>
<p onclick='with "foo SOMESTRING_DQ bar" in it'>stuff</p>

As you can see, this works pretty darn good and correctly modifies only quoted strings within javascript within HTML.

Caveats: To correctly and reliably extract Javascript from HTML, (i.e. parse it) you must use a parser. Although the above algorithm does a pretty decent job, there are certainly cases where it will fail. For example the following non-javascript code will be matched:

<p title="Title with onclick='fake code erroneously matched here!'">stuff</p>
<p title='onclick="alert('> and somestring here too </p><p title=');"'>stuff</p>
<p title='<script>alert("Bad medicine!");</script>'>stuff</p>

Phew!

I am trying to use a regular expression in C# to break up a string into up to 3 distinct parts, Left, Middle, Right. The expression pattern is built dynamically using input parameters to set the Left and Right quantifiers. In most cases where the quantifier is one or higher it works fine, however if the left and right quantifiers are set to zero, the behavior is different on .NET 3.5 on Windows and on Mono 2.01.9 on Suse.

For example, using the following match string to test the string "1412":

^(?<left>.{0})(?<mid>.+)(?<right>.{0})

On Windows (.NET 3.5) the match groups show, as expected:

left:
mid:   1412
right:

On Suse (Mono 2.10.9) the match groups are:

left:   141
mid:    2
right:

Playing around with this, if I change the left and right quantifiers in the pattern to be non-greedy, I get the same (expected) result on both platforms:

^(?<left>.{0}?)(?<mid>.+)(?<right>.{0}?)

left:
mid:    1412
right:

While this seems to solve the problem, this section of code is of critical importance to our application so I would like to understand why the behavior is different in the original pattern.

There are lazy and greedy quantifiers in the regex, and as per Mastering Regular Expressions (taken from here),

In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by quantifiers, the engine always chooses to first make the attempt for greedy quantifiers, and to first skip the attempt for lazy (non-greedy) ones.

For some reason, Mono regex follows this path, and .NET Framework uses the logic set by the regular expression by applying appropriate behavior.

I don't know how to work with '(', ')', and '*' that can be in comment. Comments are multiline.

Regarding the handling of nested comments, although it is true that you cannot use a Java regex to match an outermost comment, you can craft one which will match an innermost comment (with some notable exceptions - see caveats below). (Note that the: \(\*(.*?)\*\) expression will NOT work in this case as it does not correctly match an innermost comment.) The following is a tested java program which uses a (heavily commented) regex which matches only innermost comments, and applies this in an iterative manner to correctly strip nested comments:

public class TEST {
    public static void main(String[] args) {
        String subjectString = "out1 (* c1 *) out2 (* c2 (* c3 *) c2 *) out3";
        String regex = "" +
            "# Match an innermost pascal '(*...*)' style comment.\n" +
            "\\(\\*      # Comment opening literal delimiter.\n" +
            "[^(*]*      # {normal*} Zero or more non'(', non-'*'.\n" +
            "(?:         # Begin {(special normal*)*} construct.\n" +
            "  (?!       # If we are not at the start of either...\n" +
            "    \\(\\*  # a nested comment\n" +
            "  | \\*\\)  # or the end of this comment,\n" +
            "  ) [(*]    # then ok to match a '(' or '*'.\n" +
            "  [^(*]*    # more {normal*}.\n" +
            ")*          # end {(special normal*)*} construct.\n" +
            "\\*\\)      # Comment closing literal delimiter.";
        String resultString = null;
        java.util.regex.Pattern p = java.util.regex.Pattern.compile(
                    regex,
                    java.util.regex.Pattern.COMMENTS);
        java.util.regex.Matcher m = p.matcher(subjectString);
        while (m.find())
        { // Iterate until there are no more "(* comments *)".
            resultString = m.replaceAll("");
            m = p.matcher(resultString);
        }
        System.out.println(resultString);
    }
}

Here is the short version of the regex (in native regex format):

\(\*[^(*]*(?:(?!\(\*|\*\))[(*][^(*]*)*\*\)

Note that this regex implements Jeffrey Friedl's "Unrolling-the-loop" efficient technique and is quite fast. (See: Mastering Regular Expressions (3rd Edition)).

Caveats: This will certainly NOT work correctly if any comment delimiter (i.e. (* or *)) appears within a string literal and thus, should NOT be used for general parsing. But a regex like this one is handy to use from time to time - for quick and dirty searching within an editor for example.

See also my answer to a similar question for someone wanting to handle nested C-style comments.

I got a system that only accepts regular expressions for matching and since I've never done it before I went on-line to lookup some tutorials but got really confused so I'm asking here.

Regular expressions needs to match the following:

File.f
File-1.f

in both cases it has to return what's before the . or - in the 2nd case (File).

I appreciate the help.

Thanks

I don't know what language you're using, but they all work mostly the same. In C# we would do something like the following:

List<string> files = new List<string>() {
    "File.f",
    "File-1.f"
};
Regex r = new Regex(@"^(?<name>[^\.\-]+)");
foreach(string file in files) {
    Match m = r.Match(file);
    Console.WriteLine(m.Groups["name"]);
}

The named group allows you to easily extract the prefix that you are seeking. The above prints

File
File

on the console.

I strongly encourage you to pick up the book Mastering Regular Expressions. Every programmer should be comfortable with regular expressions and Friedl's book is by far the best on the subject. It has pertinent to Perl, Java, .NET and PHP depending on your language choice.

I have a password field which should match the following conditions

  1. should have at least one letter and one digit
  2. should have minimum length of 5 and maximum length of 20

Also, I would like to know if regular expressions are the same for all languages?

Links to good tutorials to get started with Regular Expressions (assuming I have just a basic understanding) would also be nice.

Thank you

Regular expressions are a very good tool for password validation. Multiple rules can be applied using lookahead assertions (which work using AND logic) applied from the beginning of the string like so:

$re = '/
    # Match password with 5-20 chars with letters and digits
    ^                # Anchor to start of string.
    (?=.*?[A-Za-z])  # Assert there is at least one letter, AND
    (?=.*?[0-9])     # Assert there is at least one digit, AND
    (?=.{5,20}\z)    # Assert the length is from 5 to 20 chars.
    /x';
if (preg_match($re, $text)) {
    // Good password
}

Here's the Javascript equivalent:

var re = /^(?=.*?[A-Za-z])(?=.*?[0-9])(?=.{5,20}$)/;
if (re.test(text)) {
  // Good password
}

A good article regarding password validation using regex is: Password Strength Validation with Regular Expressions. (Although his final expressions include an erroneous dot-star at the beginning - see my comment on his blog).

Also note that regex syntax does vary from language to language (but most are converging on the Perl syntax). If you really want to know regex (in the Neo: "I know Kung-Fu" sense), then there is no better way than to sit down and read: Mastering Regular Expressions (3rd Edition) By Jeffrey Friedl.

Additional: A good argument can be made that password validation should be split up into multiple tests which allows the code to give specific error messages for each type of validation error. The answer provided here is meant to demonstrate one correct way to validate multiple rules using just one regular expression.

Happy regexing!

I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.

Here's the pattern:

new Regex(
              @"  (?<escape> \@\@ )"
            + @"| (?<comment> \@\* ( ([^\*]\@) | (\*[^\@]) | . )* \*\@ )"
            + @"| (?<using> \@using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"

            // captures expressions of the form "foreach (var [var] in [expression]) { <text>" 
/* ---> */      + @"| (?<foreach> \@foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"

            // captures expressions of the form "if ([expression]) { <text>" 
/* ---> */      + @"| (?<if> \@if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"  

            // captures the close of a razor text block
            + @"| (?<endBlock> </text> \s* \} )"

            // an expression of the form @([(int)] a.b.c)
            + @"| (?<parenAtExpression> \@\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
            + @"| (?<atExpression> \@ (?<expressionValue> [\w\.]+ ) )"
/* ---> */      + @"| (?<literal> ([^\@<]+|[^\@]) )",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);

/* ---> */ indicates the new "rules" that caused the slowdown.

As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping @ or escaping characters aside from \ inside a character class; for example, using [^*] instead of [^\*]).

Here are some ideas for improving performance:

Order your different alternatives so that the most likely ones come first.

The regex engine will attempt to match each alternative in the order that they appear in the regex. If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases.

Remove unnecessary backtracking

Not the ending of your "using" alternative: @"| (?<using> \@using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"

If for some reason you have a large amount of whitespace, but no closing ; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\s*;). In your case, (\s*;)? can be replaced with \s*;? to prevent backtracking in these scenarios.

In addition, you could use atomic groups (?>...) to prevent backtracking through quantifiers (e.g. * and +). This really helps improve performance when you don't find a match. For example, your "foreach" alternative contains \s* \( \s*. If you find the text "foreach var...", the "foreach" alternative will greedily match all of the whitespace after foreach, and then fail when it doesn't find an opening (. It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\s*)\( will cause the regex engine to not backtrack through \s* if it matches, allowing the regex to fail more quickly.

Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ;), and the atomic grouping (?>...) preventing the regex engine from backtracking one character to match the ending ;).

"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings).

For example: @"| (?<comment> \@\* ( ([^\*]\@) | (\*[^\@]) | . )* \*\@ )"

Could be replaced with @"| (?<comment> @\* [^*]* (\*+[^@][^*]*)* \*+@ )"

The new regex boils down to:

  1. @\*: Find the beginning of a comment @*
  2. [^*]*: Read all "normal characters" (anything that's not a * because that could signify the end of the comment)
  3. (\*+[^@][^*]*)*: include any non-terminal * inside the comment
    • (\*+[^@]: If we find a *, ensure that any string of *s doesn't end in a @
    • [^*]*: Go back to reading all "normal characters"
    • )*: Loop back to the beginning if we find another *
  4. \*+@: Finally, grab the end of the comment *@ being careful to include any extra *

You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).

I have some text like this, it's written in a custom markdown style format. For example:

[Lorem ipsum] 
Dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. 

[Ut wisi] 
[Enim ad minim veniam](a), quis nostrud exerci tation ullamcorper. 
suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat. 
Vel illum dolore eu feugiat nulla facilisis at vero.
[Ros et accumsan et iusto odio dignissim](b) qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. 

[[Nam liber]](c)
Tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.

As you can see there are square brackets surounding headings, and there are square brackets followed by parenthesis containing a letter which is what I am trying to match with a regex. The regex I'm trying to use is this:

preg_match_all("#\[(.*?)\]\(([a-z]+)\)#is",$html,$matches)

The problem with this ^ one is it matches from [Lorem ipsum] down to the end of (a).

I could also use the following, however I need to be able to include headings with their square brackets so this doesn't work correctly:

preg_match_all("#\[([^]]+)\]\(([a-z]+)\)#is",$html,$matches)

After some reading up, I suspect what I need is a lookahead, however I've not been able to get my head around them. Any help much appreciated.


Clarification

I'm basically looking to be able to wrap any part of some text with the square brackets/parenthesis combination and then be able to match them with regex without existing square brackets anywhere causing conflicts. Example text:

[[Lorem ipsum]](a)
Dolor sit amet, [consectetuer adipiscing elit](b), sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. 

Desired matches:

[[Lorem ipsum]](a)
[consectetuer adipiscing elit](b)

Or... more complex

[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a) tincidunt ut laoreet dolore magna aliquam erat volutpat. 

Desired match:

[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a)

Is it possible?

m.buettner's answer is excellent. It is both accurate and well documented (it got my up-vote and deserves to remain the selected answer). I really like the fact that the regex is self documented in free-spacing mode. However, for the sake of completeness, (and as a demonstration of another commenting style) here is an equivalent (but slightly more efficient) regex solution:

preg_match_all('/
    # Match a "[...[...]...[...]...](...)" structure.
    \[               # Literal open square bracket.
    (                # $1: Square bracket contents.
      [^[\]]*        # {normal*} Zero or more non-[].
      (?:            # Begin {(special normal*)*}.
        \[[^[\]]*\]  # {special} Nested matching [] pair.
        [^[\]]*      # More {normal*} Zero or more non-[].
      )*             # End {(special normal*)*}.
    )                # $1: Square bracket contents.
    \]               # Literal close square bracket.
    (?:              # Optional matching parentheses.
      \(             # Literal open parentheses.
      ([A-Za-z]+)    # $2: Parentheses contents.
      \)             # Literal close parentheses.
    )?               # Optional matching parentheses.
    /x',
    $input,
    $matches);

Improvements (mostly cosmetic/stylistic):

  • The regex is enclosed within 'single quotes' rather than "double quotes". With PHP, there is an extra level of interpretation with double quoted strings and there are many more character escape sequences to be dealt with (the "$" character in particular can cause mischief). Bottom line: with PHP, its best to enclose regex patterns within single quoted strings (i.e. less backslash soup).
  • The expression logic which matches the [nested [square bracket] structure] was re-arranged to implement Friedl's "Unrolling-the-Loop" efficiency technique. This results in less backtracking for the case where the outer square brackets have no nested square brackets.
  • The capture groups' open and close parentheses (which span more than one line) are indented to the same level (i.e. are vertically aligned) to ease visually matching.
  • The capture group number is included in the comments on the lines with the open and close parentheses.
  • The s single line modifier is removed (no need - there are no dots!).
  • The i ignore case modifier is removed and the affected character class [a-z] was changed to [A-Za-z] to compensate. (Some regex engines run a wee bit faster when in case sensitive mode.)
  • The literal "]" closing square bracket metacharacter is explicitly escaped, i.e. to: "\]". (although this is not required, it is good practice IMHO).
  • Capture group $2 is consolidated onto one line.
  • A full width header comment is added at the top of the regex describing the overall regex purpose.

I know that I really need to read one of these books (1, 2) to learn regular expressions but in the meantime I have a small question for the people that already have the knowledge. I want to write a snippet for sublime text which leaves the inner spaces for parentheses if I start typing but deletes everything inside the parentheses if I delete the selection.

Triggered:

( ${1:anything could be typed here} )

Typed:

( I_wrote_that )

Deleted:

()

I do not ask for someone to write it for me, but a clear explanation on conditional regular expressions would be much appreciated. Thanks !

NB: I am referring to the conditional syntax in regular expressions. NB2: Here is an example with a C/C++ printf.

Snippet:

printf( "${1:%s}\\n" ${1/([^%]|%%)*(%.)?.*/(?2:,:\);)/} $2 ${1/([^%]|%%)*(%.)?.*/(?2:\);))/}

Gives:

printf( "%s\n" ,  );

Or:

printf( "\n" );  

As Qtax already showed you can use conditionals in regexp in such way:

(?(condition)then|else)

or

(?(?=pattern)then|else)

Regular Expressions are way to find patterns and similarities in input, but not a logic (otherwise I guess, it would be Logical Expressions too). If your program desires to put some logic into the regexp clause it's possibly first bells of a design flaw.

UPDATE+

Also, I don't understand

Anything could be typed inside the parentheses. But if the content is deleted, we also remove the inner padding. – Athanase

Are you talking about dynamic regexp? It looks like you need some event-driven regexp or command line which will analyze your regexp while you're typing.

Also possible you're talking about Sublime Text features, but not about pure regexp (which also could have some deviations depend on implementation).

I'm Having trouble with regex. Never fully understood it the real question is: does anybody have a good site that explains the difference between the expression instead of just posting stuff like

$regexp = "/^[^0-9][A-z0-9_]+([.][A-z0-9_]+)*[@][A-z0-9_]+([.][A-z0-9_]+)*[.][A-z]{2,4}$/";

then prattling off what that line as a whole will do. Rather then what each expression will do. I've tried googling many different versions of preg_replace and regex tutorial but they all seem to assume that we already know what stuff like \b[^>]* will do.

Secondary. The reason i am trying to do this: i want to turn

<span style="color: #000000">*ANY NUMBER*</span>

into

<span style="color: #0000ff">*ANY NUMBER*</span>

a few variations that i have already tried some just didnt work some make the script crap out.

$data = preg_replace("/<span style=\"color: #000000\">([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);//just tried to match atleast 0-9

$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>(.*?)</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);

$data = preg_replace("/<span style=\"color: #000000\"\b[^>]*>([0-9])</span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);

The answer to this specific problem is not nearly as important to me as a site so check goes to that. Tried alot of different sites and i am pretty sure its not above my comprehension i just cannot find a good for all the bad tutorial/example farm. Normal fallbacks of w3 and phpdotnet dont have what i need this time.

EDIT1 For those of you who end up in here looking for a similar answer:

$data = preg_replace("/<span style=\"color: #000000\">([0-9]{1,})<\/span>/", "<span style=\"color: #FFCC00\">$1</span>", $data);

Did what it needed to. Sadly it was one of the first things i tried but because i didnt put </span> instead of it was not working and i do not know if "[0-9]{1,}" is the MOST appropriate way of matching any number (telling it to match any integer 0-9 with [0-9] atleast once and as many times as it can with {1,} it still fit the purpose)

ROY Finley Posted: http://www.macronimous.com/resources/writing_regular_expression_with_php.asp Its a good site with a list of expression definitions and a good example workup below.

Also: regular-expressions.info/tutorial.html was posted a few times. Its a slower more indepth walk through but if you are stuck like i am its good.

Will pop in about regex101 and the parsers after i have a chance to play with them.

EDIT2 DWright posted a book link below "Mastering Regular Expressions". If you look at regex and cannot make heads or tails of the convolution of characters it is DEFINITELY worth picking it up. Took about an hour and a half to read about half but that is no time compared to the hours spend on google and the mess up work arounds used to avoid it.

Also the html parse linked below would be right for this particular problem.

Try this website, perhaps. Personally, I'd say if you are really interested in regexes, it'd be worth getting a book like this one.

I'm trying to match to expresions contained within [%___%] in a string, before // (comments) excluding // that are in quotations (inside a string)
so for example
[%tag%] = "a" + "//" + [%tag2%]; //[%tag3%]
should match [%tag%] and [%tag2%]

The closest I can get is ^(?:(?:\[%([^%\]\[]*)%\])|[^"]|"[^"]*")*?(?://)

So the problems I'm having are that this doesn't match any strings which don't end in //
In fact, it aggregates lines until it can conclude in one that contains //
I've tried to remedy this problem with ?.*?$ at the end, to signify that // is not necessary and to go to the first endline, but it doesn't really work.

And Secondly, it only captures the second tag. This isn't because of the "//" since even with [%1%] [%2%] it won't capture the first

I'm using C# and Regex.Matches with the RegexOptions.Multiline option and this is my escaped string

"^(?:(?:\\[%([^%\\]\\[]*)%\\])|[^\"]|\"[^\"]*\")*?(?://)"

First off, let me just say that I love regexes. I read Friedl's Mastering Regular Expressions years ago and never looked back. That being said, do not use one giant regex to solve this problem. Use your programming language. You'll end up with more readable and maintainable code. It looks like you're trying to parse a language here where different rules apply in different contexts. Your pattern could appear in a quoted string. Quoted strings might have quotes inside them which need to be escaped. Capturing all the subtleties in one regex would be a nightmare. I recommend iterating through the string character by character, building tokens along the way, looking for the quotes, and keeping track of whether or not you're in a quoted string. When you encounter a token that matches your criteria (you can use a regex for this part), and you're not within a string, add it to your list. When you hit the end of a statement and encounter the beginning of a comment, discard the remaining characters until the end of the comment.

I'm trying to parse a fairly complicated, but structured file using c++.

011 FistName MiddleName LastName age(int) date(4/6/2001) position status ...
012 FistName MiddleName LastName age(int) date(4/6/2001) position status ...
...

That's what the file format looks like. I'm trying to store the data as individual field of a struct but the first middle last name are of variable size and may not have the middle name in them, so how would you distinguish that?

For example,

014 Jon Smith ...
015 Jon J Smith, Jr. ...

I want to store the whole name in a name field rather than separate them. Say we have

struct{
    std::string name;
    int id;
    int age;
    std::string position;
    ...

}

How would i go about parsing everything?

For your purposes, if you're using C++11, you could adapt the std::regex match example to accomplish what you want.

If you're not, you should use boost::regex to accomplish what you want.

Here's an example of a regular expression you could use:

^\d+ (\w+) ?(\w*) (\w+),? ?(\w+\.)? age\((\d+)\) date\((\d\/\d\/\d+)\) (\w+) (\w+)

To find out what that regular expression means and how it matches things, check out this link.

To learn more about regular expressions, I'd highly recommend this book by Jeffrey Friedl.

It would match the following:

014 Jon Smith age(32) date(4/6/2001) position status
014 Jon J Smith, Jr. age(16) date(4/6/2001) position status
015 FistName MiddleName LastName, Title. age(45) date(4/6/2001) position status
016 FistName MiddleName LastName age(7) date(4/6/2001) position status
039 FistName MiddleName LastName age(100) date(4/6/2001) position status

I need a regex or something to remove this kind of comments.

/*!
 * Foo Bar
 */

I tried with /(\/*!.**\/)/m but fails. Any Suggestion?

To do it accurately and efficiently, there is a better regex:

regexp = /\/\*![^*]*\*+(?:[^*\/][^*]*\*+)*\//
result = subject.gsub(regexp, '')

Jeffrey Friedl covers this specific problem at length (using C-comments as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is a breakdown of the regex which illustrates the "Unrolling-the-Loop" efficiency technique.

regexp_long = / # Match she-bang style C-comment
    \/\*!       # Opening delimiter.
    [^*]*\*+    # {normal*} Zero or more non-*, one or more *
    (?:         # Begin {(special normal*)*} construct.
      [^*\/]    # {special} a non-*, non-\/ following star.
      [^*]*\*+  # More {normal*}
    )*          # Finish "Unrolling-the-Loop"
    \/          # Closing delimiter.
    /x
result = subject.gsub(regexp_long, '')

Note that this regex does not need Ruby's 'm' dot-matches-all modifier because it does not use the dot!

Additional: So how much more efficient is this regex over the simpler /\/\*!.*?\*\//m expression? Well using the RegexBuddy debugger, I measured how many steps each regex took to match a comment. Here are the results for both matching and non-matching: (For the non-,matching case I simply removed the last / from the comment)

/*!
 * This is the example comment
 * Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar
 * Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar
 * Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar Foo Bar
 */

'
REGEX                        STEPS TO: MATCH  NON-MATCH
/\/\*!.*?\*\//m                        488      491
/\/\*![^*]*\*+(?:[^*\/][^*]*\*+)*\//    23       29
'

As you can see, the lazy-dot solution (which must backtrack once for each and every character in the comment), is much less efficent. Note also that the efficiency difference is even more pronounced with longer and longer comments.

CAVEAT Note that this regex will fail if the opening delimiter occurs inside a literal string, e.g. "This string has a /*! in it!". To do this correctly with 100% accuracy, you will need fo fully parse the script.

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:

var pattern = /^((?!&#|<|>).)*$/m;

The problem is, that it still matches multiline strings from a textarea like

this text matches

though this should not, because of this character <

EDIT:

To be more clear, I need exclude &# combination only, not & or #.

Please suggest the solution. Very grateful.

Alternate answer to specific question:

anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:

function hasSpecial_1(text) {
    // If regex matches, then string does NOT contain special chars.
    return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
    // If regex matches, then string contains (at least) one special char.
    return /[<>]|&#/.test(text) ? true : false;
}

These two functions are functionally equivalent, but the second one is probably quite a bit faster.

Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.

Test if a string contains HTML special Chars:

It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as: &#160;, &#xA0;, etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &amp;, &lt;, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.

Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)

FALSE test using positive regex:

function hasHtmlSpecial_1(text) {
    /* Commented regex:
        # Match string having no special HTML chars.
        ^                  # Anchor to start of string.
        [^<>&]*            # Zero or more non-[<>&] (normal*).
        (?:                # Unroll the loop. ((special normal*)*)
          &                # Allow a & but only if
          (?!              # not an HTML entity (3 valid types).
            (?:            # One from 3 types of HTML entities.
              [a-z\d]+     # either a named entity,
            | \#\d+        # or a decimal entity,
            | \#x[a-f\d]+  # or a hex entity.
            )              # End group of HTML entity types.
            ;              # All entities end with ";".
          )                # End negative lookahead.
          [^<>&]*          # More (normal*).
        )*                 # End unroll the loop.
        $                  # Anchor to end of string.
    */
    var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
    // If regex matches, then string does NOT contain HTML special chars.
    return re.test(text) ? false : true;
}

Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))

TRUE test using negative regex:

function hasHtmlSpecial_2(text) {
    /* Commented regex:
        # Match string having one special HTML char.
          [<>]           # Either a tag delimiter
        | &              # or a & if start of
          (?:            # one of 3 types of HTML entities.
            [a-z\d]+     # either a named entity,
          | \#\d+        # or a decimal entity,
          | \#x[a-f\d]+  # or a hex entity.
          )              # End group of HTML entity types.
          ;              # All entities end with ";".
    */
    var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
    // If regex matches, then string contains (at least) one special HTML char.
    return re.test(text) ? true : false;
}

Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

my question is how would i be able to go through a string and take out only the links and erase all the rest? I thought about using some type of delimiter, but wouldn't know how to go about using it in Java. an example of what i am trying to do:

this is my String:

String myString = "The file is http: // www.   .com/hello.txt and the second file is "
                     + "http: // www.   .com/hello2.dat";

I would want the output to be:

"http: // www.   .com/hello.txt http: // www.   .com/hello2.dat"

or each could be added to an array, separately. I just want some ideas, id like to write the code myself but am having trouble on how to do it. Any help would be awesome.

Regular Expressions, or Regex, is built for this kind of work. It is like another mini-language to learn. The best book out there some would say is Mastering Regular Expressions

The Javadoc for Pattern and Matcher can only serve as a reference. It completely ignores the subtleties involved in regex.

So it's easy to match for at least one character or to match a character of a certain length in a string, but I'm trying to match at least one character in a string of a certain length. Another way to think of it would be the negation of having no character x for a sequence of N letters, which I think would just be /^[^x]{N}$/. I'm using this in a query, and though could use the negation regex would prefer the former. I'm also just curious as to how one would do it.

Patterns with no string length requirement:

At least one "X":

if /X/.test(text) {
     // At least one X.
}

Exactly one "X":

if /^[^X]*X[^X]*$/.test(text) {
     // Exactly one X.
}

3 or more "X":

if /^(?:[^X]*X){3}/.test(text) {
    // 3 or more X.
}

From 3 to 5 "X" (no more, no less):

if /^[^X]*(?:X[^X]*){3,5}$/.test(text) {
    // From 3 to 5 X.
}

Adding a minimum string length requirement:

At least one "X" and length >= 9:

if /^(?=[\S\s]{9})[^X]*X/.test(text) {
     // At least one X and length >= 9.
}

Exactly one "X" and length >= 9:

if /^(?=[\S\s]{9})[^X]*X[^X]*$/.test(text) {
     // Exactly one X and length >= 9.
}

3 or more "X" and length >= 9:

if /^(?=[\S\s]{9})(?:[^X]*X){3}/.test(text) {
    // 3 or more X and length >= 9.
}

From 3 to 5 "X" (no more, no less) and length >= 9:

if /^(?=[\S\s]{9})[^X]*(?:X[^X]*){3,5}$/.test(text) {
    // From 3 to 5 X and length >= 9.
}

Adding a string length range requirement:

From 3 to 5 "X" (no more, no less) and length from 9 to 15:

if /^(?=[\S\s]{9,15}$)[^X]*(?:X[^X]*){3,5}$/.test(text) {
    // From 3 to 5 X and length from 9 to 15.
}

Hopefully you get the idea and can take it from here.

Note also that there is more than one way to do this but as you can see, regex can easily handle this type of chore. (Assuming you "speak" regex language itself.) Note also that the DOT wildcard is not used anywhere in these expressions (although most folks use this a lot, its use is rarely needed.). However, the length requirement subpatterns (e.g. ^(?=[\S\s]{9,15}$)) do make use of the [\S\s] which is equivalent to the dot-matches-newline modified DOT under JavaScript.)

Mastering Regular Expressions

To thoroughly understand these patterns I highly recommend reading: Mastering Regular Expressions (3rd Edition) Once you have read and studied this book, problems like this become child's play!

Happy regexing!

Some context

I'm usually using the website http://regex101.com to test my regex, which provides a "debugger" feature in PCRE that lets you see what the regex engine is doing step by step.

When matching a random string with .*, this debugger tells me the engine follows the constant number of 3 steps.

When matching with (?:.)*, it announces a number depending on the length: 66 steps for something like 0123456789012345678901234567899.

Is (?:.)* really more costly than .*?

It seems that on the latter case, entering the group is considered each time to be a new step, whereas on the former the .* is applied at once.

Is that some sort of "improvement" the website is doing (trying to avoid showing useless cases), or does it match a real internal regex mechanism ? And if so, what's the idea behind the scene?

I'm not an expert on the subject, but from what I can tell, yes, /(?:.)*/ and /(.)*/ are more costly than /.*/.

According to the Perl documentation on backtracking,

A fundamental feature of regular expression matching involves the notion called backtracking, which is currently used (when needed) by all regular non-possessive expression quantifiers, namely * , *? , + , +?, {n,m}, and {n,m}?. Backtracking is often optimized internally, but the general principle outlined here is valid.

So basically, .* is optimized internally, but I can't find a source that says how.

I also found another source, a blog post by the author of Mastering Regular Expressions, Jeffrey Friedl.

By the way, I guess I should make one mention about how Perl sometimes optimizes how it deals with regular expressions. Sometimes it might actually perform fewer tests than what I've described. Or it perhaps does some tests more efficiently than others (for example, /x*/ is internally optimized such that it is more efficient than /(x)/ or /(?:x)/). Sometimes Perl can even decide that a regex can never match the particular string in question, so will bypass the test altogether.

If anyone else can explain the optimizations Perl makes in more detail, that would be useful!

can somebody point me to a good regular expression resource (for php if it matters). I am looking now for a book here amazon but don't know which one is better. It would be great to find something simple to understand and a fast and interesting process of learning.

Mastering Regular Expressions by: Jeffrey E.F. Friedl is considered the bible of Regular Expressions books.

Also, as Artefacto mentioned: http://www.regular-expressions.info/ is a terrific resource with clear and simple explanations.

But the best way to learn is to play with them using a regex tool like Reggy (Mac tool)

These are the only regex books you'll ever need:

Mastering Regular Expressions

Regular Expressions Cookbook

Both are great tools for learning regexes in general, but they both have lots of information specific to PHP as well.

Does anyone know how to a regular expression in python to get everything in between a quotation marks?

For example, text: "some text here".... text: "more text in here!"... text:"and some numbers - 2343- here too"

The text are of different length, and some contain punctuation and numbers as well. How do I write a regular expression to extract all the information?

what I would like to see in the compiler:

some text here more text in here and some numbers - 2343 - here too

If the quoted sub-strings to be matched do NOT contain escaped characters, then both Karl Barker's and Pierce's answers will both match correctly. However, of the two, Pierce's expression is more efficient:

reobj = re.compile(r"""
    # Match double quoted substring (no escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"]*             # Zero or more non-".
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.VERBOSE)

But if the quoted sub-string to be matched DOES contain escaped characters, (e.g. "She said: \"Hi\" to me.\n"), then you'll need a different expression:

reobj = re.compile(r"""
    # Match double quoted substring (allow escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"\\]*           # {normal} Zero or more non-", non-\.
      (?:               # Begin {(special normal*)*} construct.
        \\.             # {special} Escaped anything.
        [^"\\]*         # more {normal} Zero or more non-", non-\.
      )*                # End {(special normal*)*} construct.
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.DOTALL | re.VERBOSE)

There are several expressions I'm aware of that will do the trick, but the one above (taken from MRE3) is the most efficient of the bunch. See my answer to a similar question where these various, functionally identical expressions are compared.

Here's my regex problem: How can I select all results with a certain ID, e.g. "...ID=99" but excluding the results countinuing with an additional number like ID="990" or "ID=9923". However if the string countinues with another non-number character ("&"), e.g. "...ID=99&PARAM=9290" it also should be included.

I am totally confused turning this into a regex selection. I would appreciate any idea on this very much!

(by the way if you are really into regex. how did you learn it? any recommended resources, books, tutorials?)

Note: I use this to filter my search results in google analytics as you can actually use regex in the "Filter Page" form. Maybe this is useful information to you.

ID=99(?!\d)

(?!\d) is a negative lookahead; it asserts that either the next character is not a digit or there is no next character. You didn't say what regex flavor you're using, but most of the so-called Perl-compatible flavors support lookaheads.

As for learning resources, the tutorial at regular-expressions.info is a great place to start. For advanced study, the Goyvaerts-Levithan book recommended by others is excellent, but Mastering Regular Expressions is still the best. Get both if you can afford them; you won't regret it.

EDIT: To be on the safe side, you might want to use \bID=99(?!\d) to avoid matching something like FOO_ID=99.

Answering the second part of your question:

As far as books are concerned, Regular Expression Cookbook by O'Reilly is fantastic. The first few chapters pretty much cover all the basics, and then some. The rest of the book is concrete examples.

How can I use IndexOf with SubString to pick a specific Character when there are more than one of them? Here's my issue. I want to take the path "C:\Users\Jim\AppData\Local\Temp\" and remove the "Temp\" part. Leaving just "C:\Users\Jim\AppData\Local\" I have solved my problem with the code below but this assumes that the "Temp" folder is actually called "Temp". Is there a better way? Thanks

if (Path.GetTempPath() != null) // Is it there?{
tempDir = Path.GetTempPath(); //Make a string out of it.
int iLastPos = tempDir.LastIndexOf(@"\");
if (Directory.Exists(tempDir) && iLastPos > tempDir.IndexOf(@"\"))
{
    // Take the position of the last "/" and subtract 4.
    // 4 is the lenghth of the word "temp".
    tempDir = tempDir.Substring(0, iLastPos - 4);
}}

The other answerers have shown the best way to accomplish your goal. In the interest of expanding your knowledge further, I suggest that you look at regular expressions for your string matching and replacement needs, in general.

I spent the first couple of years of my self-taught programming career doing the most convoluted string manipulation imaginable before I realized that someone else had already solved all these problems, and I picked up a copy of Mastering Regular Expressions. I strongly recommend it.

One way to strip off the last directory is with the following regular expression:

tempDir = Regex.Match(tempDir, @".*(?=\\[^\\]+)\\?").Value;

It may look cryptic, but this will actually remove the last item from the path, regardless of its name, and regardless of whether there is another \ at the end.

I am having trouble with understanding how to keep the norwegian letters "æ ø å" in this preg_replace function i got for modifying forum titles into SEO URLs. My website is rendered in "iso-8859-1".

How i want it: someurl.com/read=kjøp_og_salg

Currently looks like this: someurl.com/read=kj_p_og_salg

//----- The seo url function ------//
    public function make_seo_name($title){
    $title = preg_replace('/[\'"]/', '', $title);
    $title = preg_replace('/[^a-zA-Z0-9]+/', '_', $title);
    $title = strtolower(trim($title, '_'));
    return $title;
}

I tried to utf8_encode/decode the $title before and after the preg_replace was done, but didn't work.

Thank you for your time!

EDIT:

Solved, i fixed it with some help from "One Trick Pony". I ended up with this function.

public function make_seo_name($title){
  $title = utf8_encode($title);
  $title = preg_replace('/[\'"]/', '', $title);
  $title = preg_replace('/[^a-zA-Z0-9\ø\å\æ]+/', '_', $title);
  $title = strtolower(trim($title, '_'));
return $title;
}

Note: i did NOT need to change my header from "iso-8859-1" to "UTF-8"

The '/[^a-zA-Z0-9]+/' bit is a regular expression that says to match only characters that are not the characters a through z, A through Z, or 0 through 9. The basic syntax is on wikipedia.

preg_replace then replaces such characters with underscores.

You can add the extra characters you want to allow to this list:

$title = preg_replace('/[^a-zA-Z0-9æøå]+/', '_', $title);

I am very weak with Regex and need help. Input looks like the following:

<span> 10/28 &nbsp;&nbsp;Currency:&nbsp;USD

Desired output is 10/28.

I need to get all text between the <span> and "Currency:" that are numbers, a "/" character, or a ":" character. No spaces.

Can you help? Thanks.

Here's a good place to start. Using others code is fine at first, but if you don't learn this stuff you're going to be eternally doomed to asking questions every time you need a new regex.

Mastering Regular Expressions

Regular Expressions Cookbook

Online tutorial

spend some time, learn the basics, and pretty soon you'll be helping us with our regex problems.

I feel super fail asking this... but I've been trying to make sure my values are alphanumeric, and I can't do it! if(!preg_match("^[0-9]+:[a-zA-Z]+$/", $subuser)){ $form->setError($field, "* Username not alphanumeric"); }

I did a search here and couldn't find anything... probably because this is so rudimentary =__=

Also, does anyone know of a resource (aside from PHP.net) that has a list of operators for Preg_match, and what they mean?

Thanks!

if you just want to make sure they are alphanumeric,

if( preg_match(/\W/), $subuser) {  
  $form->setError($field, "* Username not alphanumeric"); 
}

will work (match any non-alphanumeric entry). But it looks like your username might require some structure.

I thought "Mastering Regular Expressions" was one of the better programming books I've read, and I've read at least a couple hundred by now.

http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&qid=1312676428&sr=8-1

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.

Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.

In the below code, i essentially want to match all the three strings and print true, but cannot.

What should be the correct regex?

Thanks

public static void main(String[] args) {

    String[] arr = new String[] 
            { 
                "\"tuco\"", 
                "\"tuco  \" ABC\"",
                "\"tuco \" ABC \" DEF\"" 
            };

    Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");

    for (String str : arr) {
        Matcher matcher = pattern.matcher(str);
        System.out.println(matcher.matches());
    }

}

The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:

import java.util.regex.*;
public class TEST
{
    public static void main(String[] args) {

        String[] arr = new String[] 
                { 
                    "\"tuco\"", 
                    "\"tuco  \\\" ABC\"",
                    "\"tuco \\\" ABC \\\" DEF\"" 
                };

//old:  Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
        Pattern pattern = Pattern.compile(
            "# Match double quoted substring allowing escaped chars.     \n" +
            "\"              # Match opening quote.                      \n" +
            "(               # $1: Quoted substring contents.            \n" +
            "  [^\"\\\\]*    # {normal} Zero or more non-quote, non-\\.  \n" +
            "  (?:           # Begin {(special normal*)*} construct.     \n" +
            "    \\\\.       # {special} Escaped anything.               \n" +
            "    [^\"\\\\]*  # more {normal} non-quote, non-\\.          \n" +
            "  )*            # End {(special normal*)*} construct.       \n" +
            ")               # End $1: Quoted substring contents.        \n" +
            "\"              # Match closing quote.                        ", 
            Pattern.DOTALL | Pattern.COMMENTS);

        for (String str : arr) {
            Matcher matcher = pattern.matcher(str);
            System.out.println(matcher.matches());
        }
    }
}

I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

I have a regex that looks basically like this:

<(title|head)>(.*?)(String)(.*?<\/\1>

I am trying to use name groups to identify the parts

(?P<TITLE>(<(title|head)>))(.*?)(?P<NAME>(String))(.*?<\/\1>

This works when I don't use the TITLE name group:

(<(title|head)>)(.*?)(?P<NAME>(String))(.*?<\/\1>

but when I use the TITLE name group while I don't get errors I suddenly lose my match. Any ideas on how to capture the part of the regex using <>? I've tried escaping the > as well:

(?P<TITLE>(\<(title|head)\>))(.*?)(?P<NAME>(String))(.*?<\/\1>

to no avail

Numbering of mixed named and numbered capturing groups

The regex tools which support named capture (Python, .NET, PCRE/PHP, Perl 5.10, etc.) handle the numbering of mixed named and numbered capturing groups in different ways. The .NET flavor first numbers all the numbered (non-named) from left to right, then goes back and numbers the named groups. However, the PCRE/PHP flavor counts the named and numbered capturing groups in one pass from left to right. Here is your regex, (fixed - added missing closing parentheses), and fully commented in both flavors showing how the mixed capture groups are numbered:

PCRE/PHP Mixed captures numbering example:

$re_php = '%
    # PCRE/PHP mixed capture numbering example.
    (?P<TITLE>          # $1: = $TITLE:
      (                 # $2:
        <(title|head)>  # $3:
      )                 # End $2:
    )                   # End $1: = $TITLE:
    (.*?)               # $4:
    (?P<NAME>           # $5: = $NAME:
      (String)          # $6:
    )                   # End $5: = $NAME:
    (.*?)               # $7:
    </\1>               # Error! Should be "\3".
    %x';

.NET Mixed captures numbering example:

Regex re_csharp = new Regex(@"
    # .NET mixed capture numbering example.
    (?<TITLE>           # $TITLE: = $6:
      (                 # $1:
        <(title|head)>  # $2:
      )                 # End $1:
    )                   # End $TITLE: = $6:
    (.*?)               # $3:
    (?<NAME>            # $NAME: = $7:
      (String)          # $4:
    )                   # End $NAME: = $7:
    (.*?)               # $5:
    </\1>               # Error! Should be '\2'.
    ", RegexOptions.IgnorePatternWhitespace);

As Tim eluded, there are other issues with your regex as well, but I will not address them here.

Bottom line:

Its best to just not mix the two types of capturing groups. Use either all named or all numbered capturing groups. Your life will be better!

That said, I highly recommend reading: Mastering Regular Expressions (3rd Edition) which is where I gleaned the above information. (Hands down, the most useful book I've ever read.)

Happy Regexing!

I have this HTML document

{*
<h2 class="block_title bg0">ahmooooooooooooooooooooooooooooooooooooooooooodi</h2>
<div class="block_content padding bg0">{welc_msg}</div>
<br/>
    {*
    hii<br /><span>5
    *}

    {*
    hii<br /><span>5

    *}
*}

I want to remove it, so I want to remove anything between {* *}

I wrote regex pattern:

preg_replace("#(\{\*(.*?)\*\})+#isx",'',$html);

and it works, but it doesn't work ideally 100%, it's leaving *} at the end.

Can you give me the true pattern?

If your regex engine supports matching nested structures, (and PHP does), then you can remove the (possibly nested) elements in one pass like so:

Recursive regex applied with one pass:

function stripNestedElementsRecursive($text) {
    return preg_replace('/
        # Match outermost (nestable) "{*...*}" element.
        \{\*        # Element start tag sequence.
        (?:         # Group zero or more element contents alternatives.
          [^{*]++   # Either one or more non-start-of-tag chars.
        | \{(?!\*)  # or "{" that is not beginning of a start tag.
        | \*(?!\})  # or "*" that is not beginning of an end tag.
        | (?R)      # or a valid nested matching tag element.
        )*          # Zero or more element contents alternatives.
        \*\}        # Element end tag sequence.
        /x', '', $text);
}

The above recursive regex matches the outermost {*...*} element, which may contain nested elements.

However, if your regex engine does not support matching nested structures, you can still get the job done, but you cannot do it in one pass. A regex can be crafted that matches an innermost {*...*} element, (i.e. one that does NOT contain any nested elements). This regex can be applied in a recursive manner until there are no more elements in the text like so:

Non-recursive regex applied recursively:

function stripNestedElementsNonRecursive($text) {
    $re = '/
        # Match innermost (not nested) "{*...*}" element.
        \{\*        # Element start tag sequence.
        (?:         # Group zero or more element contents alternatives.
          [^{*]++   # Either one or more non-start-of-tag chars.
        | \{(?!\*)  # or "{" that is not beginning of a start tag.
        | \*(?!\})  # or "*" that is not beginning of an end tag.
        )*          # Zero or more element contents alternatives.
        \*\}        # Element end tag sequence.
        /x';
    while (preg_match($re, $text)) {
        $text = preg_replace($re, '', $text);
    }
    return $text;
}

Dealing with nested structures with regex is an advanced topic and one must tread carefully! If one really wants to use regex for advanced applications such as this, I would highly recommend reading the classic work on this subject: Mastering Regular Expressions (3rd Edition) By Jeffrey Friedl. I can honestly say that this is the most useful book that I have ever read.

Happy Regexing!

I have this .NET regex:

^(?<prefix>("[^"]*"))\s(?<attrgroup>(\([^\)]*\)))\s(?<suffix>("[^"]*"))$

It properly matches the following strings:

"some prefix" ("attribute 1" "value 1") "some suffix"
"some prefix" ("attribute 1" "value 1" "attribute 2" "value 2") "some suffix"

It fails on...

"some prefix" ("attribute 1" "value (fail) 1") "some suffix"

...due to the right paren after "fail".

How can I modify my regex so that the attrgroup match group will end up containing "("attribute 1" "value (fail) 1")"? I've been looking at it for too long and need some fresh eyes. Thanks!

Edit: attrgroup won't ever contain anything other than pairs of double-quoted strings.

my, untested guess:

^(?<prefix>("[^"]*"))\s(?<attrgroup>(\(("[^"]*")(\s("[^"]*")*)**\)))\s(?<suffix>("[^"]*"))$

hereby I've replaced

[^\)]*

with

("[^"]*")(\s("[^"]*")*)*

I assumed everything within the parenthesis is either between double quotes, or is a whitespace.

If you want to know how I came up with this, read Mastering Regular Expressions.

ps. if I'm correct, then this will also validate attribute group as pairs of quoted string.

I have this piece of code:

var s_1 = 'blabla [size=42]the answer[/size] bla bla blupblub';
var s_2 = 'blabla [size=42]the answer[/size] bla bla blupblub [size=32] 32 [/size]';

alert('Test-String:\n' + s_1 + '\n\nReplaced:\n' + size(s_1));
alert('Test-String:\n' + s_2 + '\n\nReplaced:\n' + size(s_2));


function size(s) {
    var reg = /\[size=(\d{1,2})\]([\u0000-\uFFFF]+)\[\/size\]/gi;
    s = s.replace(reg, function(match, p1, p2) {
        return '<span style="font-size: ' + ((parseInt(p1) > 48) ? '48' : p1) + 'px;">' + p2 + '</span>';
    })
    return s;    
}

It's supposed to replace all occurrences of the "[size=nn][/size]"-Tags but it only replaces the outer ones. I can't figure out how to replace all of them. (Please don't recommend to use a PHP-Script, I'd like to have a live-preview for the BB-Code formated Text)

Test it

Matching (possibly nested) BBCode tags

A multi-pass approach is required if the elements are nested. This can be accomplished in one of two ways; matching from the inside out (no recursive expression required), or from the outside in (which requires a recursive expression). (See also my answer to a similar question: PHP, nested templates in preg_replace) However, since the Javascript regex engine does not support recursive expressions, the only way to (correctly) do this using regex is from the inside out. Below is a tested function which replaces BBCode SIZE tags with SPAN html tags from the inside out. Note that the (fast) regex below is complex (for one thing it implements Jeffrey Friedl's "unrolling-the-loop" efficiency technique - See: Mastering Regular Expressions (3rd Edition) for details), and IMO all complex regexes should be thoroughly commented and formatted for readability. Since Javascript has no free-spacing mode, the regex below is first presented fully commented in PHP free-spacing mode. The uncommented js regex actually used is identical to the verbose commented one.

Regex to match innermost of (possibly nested) SIZE tags:

// Regular expression in commented (PHP string) format.
$re = '% # Match innermost [size=ddd]...[/size] structure.
    \[size=            # Literal start tag name, =.
    (\d+)\]            # $1: Size number, ending-"]".
    (                  # $2: Element contents.
      # Use Friedls "Unrolling-the-Loop" technique:
      #   Begin: {normal* (special normal*)*} construct.
      [^[]*            # {normal*} Zero or more non-"[".
      (?:              # Begin {(special normal*)*}.
        \[             # {special} Tag open literal char,
        (?!            # but only if NOT start of
          size=\d+\]   # [size=ddd] open tag
        | \/size\]     # or [/size] close tag.
        )              # End negative lookahead.
        [^[]*          # More {normal*}.
      )*               # Finish {(special normal*)*}.
    )                  # $2: Element contents.
    \[\/size\]         # Literal end tag.
    %ix';

Javascript function: parseSizeBBCode(text)

function parseSizeBBCode(text) {
    // Here is the same regular expression in javascript syntax:
    var re = /\[size=(\d+)\]([^[]*(?:\[(?!size=\d+\]|\/size\])[^[]*)*)\[\/size\]/ig;
    while(text.search(re) !== -1) {
        text = text.replace(re, '<span style="font-size: $1pt">$2</span>');
    }
    return text;
}

Example input:

r'''
[size=10] size 10 stuff
    [size=20] size 20 stuff
        [size=30] size 30 stuff [/size]
    [/size]
[/size]
'''

Example output:

r'''
<span style="font-size: 10pt"> size 10 stuff
    <span style="font-size: 20pt"> size 20 stuff
        <span style="font-size: 30pt"> size 30 stuff </span>
    </span>
</span>
'''

Disclaimer - Don't use this solution!

Note that using regex to parse BBCode is fraught with peril! (There are a lot of "gotchas" not mentioned here.) Many would say that it is impossible. However, I would strongly disagree and have in fact written a complete BBCode parser (in PHP) which uses recursive regular expressions and works quite nicely (and is fast). You can see it in action here: New 2011 FluxBB Parser (Note that it uses some very complex regexes not for the faint of heart).

But in general, I would strongly warn against parsing BBCode using regex unless you have a very deep and thorough understanding of regular expressions (which can be gained from careful study and practice of Friedl's masterpiece). In other words, if you are not a master of regex (i.e. a regex guru), steer clear from using them for any but the most trivial of applications.

Impossible looking task. I have a server output that contains dump inside a <pre>. Unfortunatelly I happen to dump a file that contains some html tags. I need to convert any inner </pre> to HTML entities so that the structire is not broken when I append the data to DOM:

   <pre> 
      ...
      echo '<pre>'
      cat gcc.log
      echo '</pre>'
      ...
   </pre>

But there's an obvious rule - there will always be a echo ' before <pre> or </pre>. It might not be exactly echo '</pre> though.

Based on this, I have constructed already quite complex regular expression:

   <pre>   - The beginning tag
       ([\s\S]*?) - Any characters including new lines
           (?:(echo[^\n]+) - Echo and anything but new line
               (<pre>|<\/pre>|<\/xmp>|<xmp>)) - The enclosing tags
           ([\s\S]*?) - More random characters
   <\/pre>

The problem is, that as soon as there are two </pre> in the code, the regexp only matches the first and treats the second as random characters - ([\s\S]*?).

How can I make regex to first try to match explicit characters and then the .*? stuff?

You can try the thing live at http://regex101.com

Oh, and I can't fix it on server, really

Here is a JavaScript compatible regex that matches a PRE element having exactly one nested PRE element (presented in Python free-spacing form with lots of comments so that you can understand how it works):

re_pre_inside_pre = re.compile(r"""
    # Match PRE element containing one nested PRE element.
    (<pre\b[^>]*>)      # $1: Outer PRE start tag.
    (                   # $2: Outer PRE element contents.
      (                 # $3: Stuff from <PRE> to <PRE>.
        [^<]*           # (normal*) Zero or more non-tags.
        (?:             # Begin ((special normal*)*).
          <             # (special) Any other tags,
          (?!\/?pre\b)  # but not a <PRE> or </PRE>.
          [^<]*         # More (normal*).
        )*              # End ((special normal*)*).
      )                 # End $3: Stuff from <PRE> to <PRE>.
      (<pre\b[^>]*>)    # $4: Inner PRE start tag.
      (                 # $5: Inner PRE element contents.
        [^<]*           # (normal*) Zero or more non-tags.
        (?:             # Begin ((special normal*)*).
          <             # (special) Any other tags,
          (?!\/?pre\b)  # but not a <PRE> or </PRE>.
          [^<]*         # More (normal*).
        )*              # End ((special normal*)*).
      )                 # End $5: Inner PRE element contents.
      (</pre\b\s*>)     # $6: Inner PRE end tag.
      (                 # $7: Stuff from </PRE> to </PRE>.
        [^<]*           # (normal*) Zero or more non-tags.
        (?:             # Begin ((special normal*)*).
          <             # (special) Any other tags,
          (?!\/?pre\b)  # but not a <PRE> or </PRE>.
          [^<]*         # More (normal*).
        )*              # End ((special normal*)*).
      )                 # End $7: Stuff from </PRE> to </PRE>.
    )                   # End $2: Outer PRE element contents.
    (</pre\b\s*>)       # $8: Outer PRE end tag.
    """, re.VERBOSE | re.IGNORECASE)

Note that the: normal* (special normal*)* parts are unrolled loops - an efficiency technique taken from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).

Note also the capture groups:
$1: - The outer PRE start tag.
$2: - The contents of the outer PRE element.
$3: - Stuff from <PRE> to <PRE>.
$4: - The inner PRE start tag.
$5: - The contents of the inner PRE element.
$6: - The inner PRE end tag.
$7: - Stuff from </PRE> to </PRE>.
$8: - The outer PRE end tag.

Here is a tested JavaScript function which utilizes the above regex to convert all the angle brackets within the outer PRE contents (i.e. the contents of group $2) into HTML entities:

// Process PRE element containing one nested PRE element.
function processNestedPRE(text) {
    var re_pre_inside_pre = /(<pre\b[^>]*>)(([^<]*(?:<(?!\/?pre\b)[^<]*)*)(<pre\b[^>]*>)([^<]*(?:<(?!\/?pre\b)[^<]*)*)(<\/pre\b\s*>)([^<]*(?:<(?!\/?pre\b)[^<]*)*))(<\/pre\b\s*>)/gi;
    return text.replace(re_pre_inside_pre,
        function(m0, m1, m2, m3, m4, m5, m6, m7, m8){
            // m2 has the outer PRE contents.
            // Convert all its <> angle brackets to entities:
            m2 = m2.replace(/[<>]/g,
                // Use a literal object for conversion.
                function(n0){ return {'<': '&lt;', '>': '&gt;'}[n0]; });
            // Put humpty dumpty back together again.
            return m1 + m2 + m8;
        });
}

It is unknown which parts are in need of entification - this is why I've included all the capture groups so that you can modify the replace function to do just those parts which are required.

Hope this helps.

I'm teaching myself Perl and Regex by reading Jeffrey Friedl's excellent Mastering Regular Expressions.

While trying to solve the "A Small Mail Utility" exercise starting on page 53 I stumbled upon the problem of not knowing how to save the content of file into a variable starting from an offset.

So here's my (shortened) script.

my ($body, $line, $subject); 
$body = $line = $subject = "";

open(MYFILE, "king.in") || die("Could not open file!");    
# Read the file's content line by line
while ($line = <MYFILE>)
{   
    # An empty line marks the beginning of the body
    if ($line =~ m/^\s+$/ ) {
        # HERE IS THE ISSUE
        # Save the file content starting from the current line
        # to the end of the file into $body
        last; 
    }

    if ($line =~ m/^subject: (.*)/i) {
        $subject = $1;
    }
    # Parse additional data from the mail header
}
close(MYFILE);

print "Subject: Re: $subject\n";
print "\n" ;
print $body;

I did some online research but couldn't figure out how to put the remainder of the file (i.e., the email body) into the variable $body.

I figured out that I could get the current position within the file in bytes using $pos = tell(MYFILE);

Eventually I ended up with the working but unsatisfactory solution of putting the file's lines first into an array.

How do I save the file content starting from an offset (either as a line number or bytes) into $body?

Edit: My solution -as provided by vstm- is to use $body = join("", <MYFILE>) to read in the rest of the file when encountering the empty line that marks the beginning of the body. The whole script I wrote can be found here.

Although this works great for me now, I would still like to know how to say (elegantly) in Perl "give me lines x to z of this file".

Thanks for your advice everybody.

Instead of stopping immediately, you could just set a flag that says "now I'm reading the body". Like that:

my $inbody = 0;

while ($line = <MYFILE>)
{   
    if($inbody) {
        $body .= $line;
        next;
    }
    # An empty line marks the beginning of the body
    if ($line =~ m/^\s+$/ ) {
        # HERE IS THE ISSUE
        # Save the file content starting from the current line
        # to the end of the file into $body
        $inbody = 1;
        next;
    }

    if ($line =~ m/^subject: (.*)/i) {
        $subject = $1;
    }
    # Parse additional data from the mail header
}

It's like a mini state-machine. First It's in the "header"-state and if the first empty newline is read it switches to the "body"-state and just appends the body to the variable.

Alternatively you could just slurp the rest of the MYFILE-handle into the body at the end of your original while-loop and before the close:

# This would be your original while loop, (I've just shortened it)
while ($line = <MYFILE>)
{   
    if ($line =~ m/^\s+$/ ) {
        last;
    }
    # Parse additional data from the mail header
}

# The MYFILE-handle is now still valid and at the beginning of the body
$body = join("", <MYFILE>);

# now you can close the handle
close(MYFILE);

I'm trying to understand how (?<name>pattern) works in Regex. Is there a good link or can someone offer a simple explanation?

From Mastering Regular Expressions:

Named capture:

\b(?<Area>\d\d\d\)-(?<Exch>\d\d\d)-(?<Num>\d\d\d\d)\b

This "fills the names" Area, Exch, and Num with the components of a US phone number. The program can then refer to each matched substring through its name, for example, RegexObj.Groups("Area") in VB.NET and most other .NET languages, RegexObj.Groups["Area"] in C#, RegexObj.group("Area") in Python, and $matches["Area"] in PHP. The result is clearer code.

Within the regular expression itself, the captured text is available via \k with .NET, and (?P=Area) in Python and PHP.

With Python and .NET (but not with PHP), you can use the same name more than once within the same expression.

I'm writing a document for a software that has heavy use of regex. Which of the following expressions are acceptable, regarding the use of the word match?

  • The regex a. matches the string aa.
  • The string aa matches the regex a..
  • This function lists all the strings matched by the regex.
  • This function lists all the strings matched with the regex.
  • This function lists all the strings matching the regex.

I've seen many usages of [regex] matches [string]. Are others acceptable?

This question probably belongs to English.stackexchange or English language learners, but I thought this was a bit too technical and decided to ask here. Anyway, I believe this can be considered as a practical, answerable problem that is unique to software development.

Definitely "regex matches the string". This usage is also consistent with the excellent book Mastering Regular Expressions by Jeffrey Friedl. You can take a peek at the book on Amazon to see some examples.

Is there a way to improve this code and maintain functionality? Some of that is result of checking that code and outputs on Windows and Linux, so to be "multi-OS" is necessary.

// Remove tags
$input = strip_tags($input);
// Converts accented to non-accented
$input =  iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
// String to lower
$input = strtolower($input);
// Remove all non-word and non-space chars
$input = preg_replace('/[^\sa-z]/', '', $input);
// Replace enters
$input = preg_replace('/[\r\n]/', ' ', $input);
// Remove stopwords
$input = preg_replace('/\b(' . implode('|', $stopwords) . ')\b/', '', $input);
// Remove individual chars
$input = preg_replace('/\b([a-z])\b/', '', $input);
// Trim it
$input = trim($input);
// Remove multiple spaces
$input = preg_replace("/[[:blank:]]+/", " ", $input);

INPUT:

<doc id="603" url="http://it.wikipedia.org/wiki/Esperanto">
Esperanto.
Esperanto (pôvodne Lingvo Internacia – „medzinárodný jazyk“) je najrozšírenejší <a     href="Medzin%C3%A1rodn%C3%BD_pomocn%C3%BD_jazyk">medzinárodný</a> <a     href="Umel%C3%BD_jazyk">plánový jazyk</a>. Názov je odvodený od <a     href="Pseudonym">pseudonym</a>u, pod ktorým v roku <a href="1887">1887</a> zverejnil lekár     <a href="Ludwik_Lejzer_Zamenhof">L. L. Zamenhof</a> základy tohto jazyka. Zámerom tvorcu     bolo vytvoriť ľahko naučiteľný a použiteľný neutrálny jazyk, vhodný na použitie v     medzinárodnej komunikácii. Cieľom nebolo nahradiť <a href="N%C3%A1rodn%C3%BD_jazyk">národné     jazyky</a>, čo bolo neskôr aj deklarované v <a     href="Boulonsk%C3%A1_deklar%C3%A1cia">Boulonskej deklarácii</a>.
Hoci žiaden <a href="%C5%A1t%C3%A1t">štát</a> neprijal esperanto ako <a href="%C3%BAradn%C3%BD_jazyk">úradný jazyk</a>, používa ho komunita s odhadovaným počtom hovoriacich 100 000 až 2 000 000, z čoho približne 1 000 tvoria rodení hovoriaci. Získalo aj isté medzinárodné uznania, napríklad dve rezolúcie <a href="UNESCO">UNESCO</a> či podporu známych osobností verejného života. V súčasnosti sa esperanto využíva pri <a href="Cestovanie">cestovaní</a>, dopisovaní, medzinárodných stretnutiach a kultúrnych výmenách, <a href="Kongres">kongres</a>och, <a href="Veda">vedeckých</a> diskusiách, v pôvodnej aj prekladovej
</doc>

OUTPUT:

esperanto esperanto povodne lingvo internacia medzinarodny jazyk najrozsirenejsi medzinarodny planovy jazyk nazov odvodeny pseudonymu ktorym roku zverejnil lekar zamenhof zaklady tohto jazyka zamerom tvorcu vytvorit lahko naucitelny pouzitelny neutralny jazyk vhodny pouzitie medzinarodnej komunikacii cielom nebolo nahradit narodne jazyky neskor deklarovane boulonskej deklaracii hoci ziaden stat neprijal esperanto uradny jazyk pouziva komunita odhadovanym poctom hovoriacich coho priblizne tvoria rodeni hovoriaci ziskalo iste medzinarodne uznania napriklad dve rezolucie unesco podporu znamych osobnosti verejneho zivota sucasnosti esperanto vyuziva cestovani dopisovani medzinarodnych stretnutiach kulturnych vymenach kongresoch vedeckych diskusiach povodnej prekladovej

m.buettner's answer is a good start. With that solution my benchmarks measure better than 25% speed improvement.

PCRE 'S' "Study" Modifier

For certain regexes, the PCRE 'S' Study modifier can speed up matching quite a bit. Here is an enhanced version of m.buettner's code:

// Remove tags
$input = strip_tags($input);
// Converts accented to non-accented
$input =  iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
// String to lower
$input = strtolower($input);
// Remove all non-word and non-space chars
$input = preg_replace('/[^\sa-z]+/S', '', $input);
// Replace all (consecutive) whitespace characters with a single space:
$input = preg_replace('/\s+/S', ' ', $input);
// Remove all stopwords and single letters:
$input = preg_replace('/\b('.implode('|', $stopwords).'|[a-z])\b/', '', $input);
// Trim it
$input = trim($input);

This improves it further for about a 45% speed improvement over the original. Note that the S Study modifier does not help with regexes that begin with literal text or anchors. It may be that the bottleneck is in the $stopwords statement depending on how many you've got in there. (I used a simple array with four elements for my benchmarking: ['one','two','three','four']). A much larger $stopwords array will prolly show less improvement.

There are many more useful efficiency tidbits like this one in the classic: Mastering Regular Expressions (3rd Edition) - a MUST READ for everyone who uses regexes on a regular basis.

8^)

I try to get the following using js regex:

string</span>string</span>string</span>theStrngIWant

In one sentence i try to get any characters including new lines after the last </span>

I try this pattern /<\/span>((.|\n)*)/i, I know the pattern won't work i just wanted to show what i want to capture:

So after the last </span> i want to capture anything included in the . token and because its not include new lines i added \n, I think its ok to use greedy without ? because its the end of the string anyway, Also just to note i did try negative lookahead.

If any one know any regex for this case i will be very thankful.

Don't parse HTML using Regex

First, you are asking for trouble if you decide to parse HTML using regex for any production/important code.

That said, for non-critial rough editing purposes, HamZa's pattern works just fine. Here is a slightly more complex, (but more efficient) pattern in the form of a tested JavaScript function:

function processText(text) {
/*  # Capture in $1, everything following last SPAN element.
    <\/span\s*>        # Last SPAN close tag.
    (                  # $1: Everything after last SPAN.
      [^<]*            # Zero or more non start-of-tag chars.
      (?:              # Zero or more non-SPAN tags.
        <              # Allow start of any HTML tag, but
        (?!\/?span\b)  # only if not start a SPAN tag.
        [^<]*          # Zero or more non start-of-tag chars.
      )*               # End zero or more non-SPAN tags.
    )                  # End $1: Everything after last SPAN.
    $                  # Anchor to end of string.
*/
    var re = /<\/span\s*>([^<]*(?:<(?!\/?span\b)[^<]*)*)$/i;
    var m = text.match(re);
    return (m) ? m[1] : '';
}

The regex is also presented (as a multi-line comment) in free-spacing mode with indentation and comments describing each bite-size regex chunk.

Learning Regex

For more info on how to write a good regex, I recommend reading the tuorial at: regular-expressions.info/. If you would like to become a regex guru, then you would be well served by reading: Mastering Regular Expressions (3rd Edition)

I've been using regex for years, I've read several tutorials and references (emacs regex reference is my bible), but I still have problems understanding matching. Is there a good comprehensive tutorial on regex matching with abundant examples? Can anybody give me a link where I can finally deeply understand regex matching?

Example ot the problem bothering me.

haystack = "[{one, {one, andahalf}},\n {{two, zero}, two},\n {{threezero}, three},\n {four}]"
pattern = "({.+})"

Result is:

{one, {one, andahalf}}
{{two, zero}, two}
{{threezero}, three}
{four}

Now, what is that exactly? Greedy or nongreedy (it's C# Regexp.Matches)?

Why, o why the (nongreedy) result isn't:

{one, {one, andahalf}
{{two, zero}
{{threezero}
{four}

(matching first possible pair of {})

Or (greedy):

{one, {one, andahalf}},\n {{two, zero}, two},\n {{threezero}, three},\n {four}

(maching greatest possible pair of {})

Of course, the actual result is exactly what I need, and I'm very happy that regex reads my mind, but I'd rather that I read his mind :-D So, does anybody have any decent tutorial on regex matching which will help me understand how this match did what it did?

This is an easy one... Read: Mastering Regular Expressions (3rd Edition)

This is hands down, the most useful book I've read in my life. Very clear, accurate and error-free presentation of the material. An entertaining and thorough tutorial to gain a deep understanding of exactly how an NFA regex engine works under the hood and how you can utilize this knowledge to begin crafting accurate and efficient regexes (for just about any any language).

When it comes to regex, there are two types of people: those who have read this book, and those who haven't.

(You can spot the ones who haven't by all the .* dot-stars in their expressions.)

I want to check the following with regular expression

{Today,Format}

Today - will be remains as it is. In the place of Format, we can allow the digits from 0 to 12.

for example: we have to allow

{Today,0}
{Today,1}
{Today,2}
...
{Today,12}

and also have to allow

{Today,}
{Today,Format}

Please help me and also refer me to some site to develop my regular expression skills.

\{Today,(\d|1[012]|Format)?\}

Meaning:

  • Open curly brace;
  • 'Today,';
  • Optionally one of the following: a digit (0-9), 1 followed by 0, 1 or 2 (10,11,12), 'Format'; and then
  • Close curly brace.

As for resources I can recommend this site on regular expressions and the book Mastering Regular Expressions.

The method is supposed to take in a name of a book and return it in proper title case. All of my specs pass ( )handles non-letter characters, handles upper and mixed cases) except the last one which is to return special words like McDuff or McComb with a capital 3rd letter. Anyone see what I'm doing wrong? And, is there a way to simplify this, using the tools at hand and not some higher level shortcut?

class String
  define_method(:title_case) do
    sentence_array = self.downcase.split
    no_caps = ["a", "an", "the", "at", "by", "and", "as", "but", "or", "for", "in", "nor", "on", "at", "up", "to", "on", "of", "from", "by"]

    sentence_array.each do |word|
      if no_caps.include?(word)
        word
      else
        word.capitalize!
      end
      sentence_array.first.capitalize!

      #  Manage special words
      if (word.include?("mc"))
        letter_array = word.split!("")    # word with mc changed to an array of letters
          if (letter_array[0] == "m") && (letter_array[1] == "c")  # 1st & 2nd letters
            letter_array[2].capitalize!
            word = letter_array.join
          end
      end  
    end
    sentence_array.join(" ")
  end
end

First of all, please, don't monkey patch. This is bad design, just make a helper function that takes an argument you need (string in your case).

def title_case(string)
    no_caps = %w(a an the at by and as but or for in nor on at up to on of from by)
    no_caps_regex = /\b(#{no_caps.join('|')})\b/i # match separate words from above, case-insensitive

    # you will need ActiveSupport (or Rails) for +String#titleize+ support
    titleized = string.titleize
    handle_special = titleized.gsub(/\b(mc)(.+?)\b/i) do |match| 
        [$1, $2].map(&:capitalize).join 
    end

    no_capsed = handle_special.gsub(no_caps_regex) { |match| match.downcase }
end

title_case('mcdonalds is fast food, but mrmcduff is not')
# => "McDonalds Is Fast Food, but Mrmcduff Is Not"

UPDATE: I am sorry about that, it was really bad reading, but I still want to elaborate on the confused terms you noted:

  1. Monkey patching is a technique, available for some dynamic languages (Ruby or Javascript, for example) where you can change (add or remove methods/properties) to already existing classes, such as String, Fixnum, DateTime and others. Often this technique is used for "enhancing" core types (exactly like you did in your code, adding method title_case to String). The problem here is that if any other library developer chooses the same name and adds it to String class, and you eventually want to try his library in your project, your implementations will clash together and which one is added later wins (depending on the code loading time, usually yours). This will either brake your code or brake the library which is no good also.

    Another similar problem, is when you try to "fix" some bug in third party library this way. You monkey patch it, everything works and you forget about it. Then 6 months later you decide to upgrade the library to a new version and suddenly everything blows up, because library code clashed with your changes and you may even not to remember about your monkey patch (or it may even be another developer, that doesn't even know about your monkey patch existence).

  2. Helper function - is just some function that you can add a) to a separate file, called helper b) or just to the current controller/model (the place you need it).

  3. \b is a mark in regex that tells regex engine to treat the following text as a separate word, i.e. /as/ regex can match for word as and also for word fast since it contains as. If you instead use /\bas\b/, only as will be matched.

    Regexes are very powerful, please, find some time to learn them, you'll boost your text processing skills to a next level. Combined with some console tools knowledge (I mean commands in UNIX terminals, such as ls, ps, find, grep and etc.), they can be very powerful in day-to-day routines such as "whether yesterday logs contain some ip?", "what is the process name that eats all memory on my machine right now?" or "what are all files in my project that contain this function call?".

    The classic book on this subject is J. Friedl's "Mastering regular expressions", highly recommended.

Have a nice day.

When learning regular expressions, where would you start? I am looking for a good set of things to learn so I can build a good base. I don't expect to know everything from memory, but if I could learn the correct things - and enough of them - I could have a good head start on this.

Please give me your suggestions so I can have an efficient start to learning regexpressions.