Algorithms on Strings, Trees and Sequences

Dan Gusfield

Mentioned 9

String algorithms are a traditional area of study in computer science. In recent years their importance has grown dramatically with the huge increase of electronically stored text and of molecular sequence data (DNA or protein sequences) produced by various genome projects. This 1997 book is a general text on computer algorithms for string processing. In addition to pure computer science, the book contains extensive discussions on biological problems that are cast as string problems, and on methods developed to solve them. It emphasises the fundamental ideas and techniques central to today's applications. New approaches to this complex material simplify methods that up to now have been for the specialist alone. With over 400 exercises to reinforce the material and develop additional topics, the book is suitable as a text for graduate or advanced undergraduate students in computer science, computational biology, or bio-informatics. Its discussion of current algorithms and techniques also makes it a reference for professionals.

More on Amazon.com

Mentioned in questions and answers.

To build a suffix tree, in the worst case if all the letter of the string are different the complexity would be something like

n + (n-1) + (n-2) ... 1 = n*(n+1)/2

which is O(n^2).

However according to http://en.wikipedia.org/wiki/Suffix_tree building a suffix tree takes O(n) time. What am i missing here?

Your intuition behind why the algorithm should be Θ(n2) is a good one, but most suffix trees are designed in a way that eliminates the need for this time complexity. Intuitively, it would seem that you need Θ(n2) different nodes to hold all of the different suffixes, because you'd need n + (n - 1) + ... + 1 different nodes. However, suffix trees are typically designed so that there isn't a single node per character in the suffix. Instead, each edge is typically labeled with a sequence of characters that are substrings of the original string. It still may seem that you'd need Θ(n2) time to construct this tree because you'd have to copy the substrings over to these edges, but typically this is avoided by a cute trick - since all the edges are labeled with strings that are substrings of the input, the edges can instead be labeled with a start and end position, meaning that an edge spanning Θ(n) characters can be constructed in O(1) time and using O(1) space.

That said, constructing suffix trees is still really hard to do. The Θ(n) algorithms referenced in Wikipedia aren't easy. One of the first algorithms found to work in linear time is Ukkonen's Algorithm, which is commonly described in textbooks on string algorithms (such as Algorithms on Strings, Trees, and Sequences). The original paper is linked in Wikipedia.

Hope this helps!

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.

My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.

I have started off by implementing in the usual dynamic programming approach. That gives the correct answer and is hopefully implemented in properly.

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

That should be O(mn) (hopefully).

Then in search for speed I found this post Longest Common Subsequence Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES). The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

There are examples in the main. Eg. "tomato" and "potato" (don't comment), have an LCS length of 4. The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.

Any approach that is implementable (I don't have immense programming skills), is welcome. (If someone would know how to take advantage of the small alphabet for example).

EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either. I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.

EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).

I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.

What are some data structures that should be known by somebody involved in bioinformatics? I guess that anyone is supposed to know about lists, hashes, balanced trees, etc., but I expect that there are domain specific data structures. Is there any book devoted to this subject?

In addition to basic familiarity with the structures you mentioned, suffix trees (and suffix arrays), de Bruijn graphs, and interval graphs are used extensively. The Handbook of Computational Molecular Biology is very well written. I've never read the whole thing, but I've used it as a reference.

A lot of introductory books on bioinformatics will cover some of the basic structures you'd use. I'm not sure what the standard textbook is, but I'm sure you can find that. It might be useful to look at some of the language-specific books:

I chose those two as examples because they're published by O'Reilly, which, in my experience, publishes good quality books.

I just so happen to have the Python book on my hard drive, and a great deal of it talks about processing strings for bioinformatics using Python. It doesn't seem like bioinformatics uses any fancy special data structures, just existing ones.

The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.

A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences

Spatial hashing datastructures (kd-tree) for example are used often for nearest neighbor queries of arbitrary feature vectors as well as 3d protein structure analysis.

Best book for your $$ is Understanding Bioinformatics by Zvelebil because it covers everything from sequence analysis to structure comparison.

I'm looking for an algorithm or even an algorithm space that deals with the problem of validating that short text (email) matches known templates. Coding will probably be python or perl, but that's flexible.

Here's the problem:

Servers with access to production data need to be able to send out email that will reach the Internet:

Dear John Smith,

We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
      $12.34 Spuznitz, LLC on 4/1
      $43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal. 
Thank you for your business!

Obviously some of the email contents will vary - the salutation ("John Smith"), "$123.45 on 2/4/13", and the lines with transactions printed out. Other parts ("We received your last payment") are very static. I want to be able to match the static portions of the text and quantify that the dynamic portions are within certain reasonable limits (I might know that the most transaction lines to be printed is 5, for example).

Because I am concerned about data exfiltration, I want to make sure email that doesn't match this template never goes out - I want to examine email and quarantine anything that doesn't look like what I expect. So I need to automate this template matching and block any email messages that are far enough away from matching.

So the question is, where do I look for a filtering mechanism? Bayesian filtering tries to verify a sufficient similarity between a specific message and a non-specific corpus, which is kind of the opposite problem. Things like Perl's Template module are a tight match - but for output, not for input or comparison. Simple 'diff' type comparisons won't handle the limited dynamic info very well.

How do I test to see if these outgoing email messages "quack like a duck"?

I would go for the "longest common subsequence". A standard implementation can be found here:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

If you need a better algorithm and/or lots of additional ideas for inexact matching of strings, THE standard reference is this book:

http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

Don't be fooled by the title. Compuatational biology is mostly about matching of large database of long strings (also known as DNA sequences).

There have been numerous posts on string algorithms:

However, no general literature was mentioned.

Could anyone recommend a book(s) that would thoroughly explore various string algorithms? The topic which is of special interest is approximate string matching [things like google-offered corrected search string variants :) ].

Thanks a lot for advice.

I'm surprised no-one has mentioned Dan Gusfield's excellent book Algorithms on Strings, Trees and Sequences which covers string algorithms in more detail than anyone would probably need. It served me very well for a project on protein sequencing that I was working on a few years ago. After reading this book you will learn:

  • Naive String Matching
  • Preprocessor Based algorithms (Boyer Moore, Knuth-Morris-Pratt)
  • Regex matching algorithms
  • Karp-Rabin and similar methods
  • Suffix tree methods (Ukkonen's method, etc)
  • Sequence alignment (Levenshtein distance and string similarity, and multiple sequence alignment)
  • Applications to DNA sequencing, gene prediction and other areas.

I am trying to search for the maximal number of substring repetitions inside a string, here are some few examples:

"AQMQMB" => QM (2x)
"AQMPQMB" => <nothing>
"AACABABCABCABCP" => A (2x), AB (2x), ABC (3x)

As you can see I am searching for consecutive substrings only and this seems to be a problem because all compression algorithms (at least that I am aware of) don't care about the consecutivity (LZ*), or too simple to handle consecutive patterns instead of single data items (RLE). I think using suffix tree-related algorithms is also not useful due to the same problem.

I think there are some bio-informatics algorithms that can do this, does anyone have an idea about such algorithm?

Edit In the second example there might be multiple possibilities of consecutive patterns (thanks to Eugen Rieck for the notice, read comments below), however in my use case any of these possibilities is actually acceptable.

Suffix tree related algorithms are useful here.

One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.

The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.

I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

Requirement

  1. currently we have a list containing more than ten thousands keywords or sentences(the number is N)
  2. Input is a long character string , the length is L

Question: Check whether the character string contains keywords or sentences in list given

The question can be described as word filter on wikipedia, but I didn't find any algorithm on that page. The simplest way to fix this problem is iterator all the keywords or sentences and each time check whether the long text contains such substring. As we have a number of keywords, also considering the long text, the performance is very bad. It uses O(NL) time

Seems the better solution should be finished in O(L). Could anyone give some suggestion about this?

There are several approaches to this problem having time complexity O(M + L), where L is length of the string and M is combined length of all patterns:

  1. Aho–Corasick string matching algorithm.
  2. Construct a suffix tree for the string using Ukkonen's algorithm, then find the match for each pattern to this suffix tree.
  3. Construct a generalized suffix tree for set of patterns, then find the match between input string and this suffix tree.
  4. Construct a Suffix array for the string, and use it to search each pattern. This approach has time complexity O(M + L + N log L), where N is the number of patterns.
  5. Commentz-Walter algorithm.

You can find details for all these algorithms (except Commentz-Walter algorithm) in this book: Algorithms on Strings, Trees and Sequences by Dan Gusfield.


Several different (simpler) approaches may be used if you can unambiguously extract separate words/sentences from input string.

  1. Prepare a Bloom filter with size, large enough to guarantee low probability of false positives for N patterns. Add to Bloom filter bits, determined by hash functions of keywords/sentences. Then scan the string, extracting consecutive words/sentences, and check if these words/sentences can be found in the Bloom filter. Only if a word/sentence is present in the Bloom filter, search it in the list of patterns. This algorithm has expected time complexity O(M + L) and is space-efficient.
  2. Insert all patterns into a hash set. Then scan the string, extracting consecutive words/sentences, and check if any of them is in the hash set. This has the same expected time complexity O(M + L), is simpler than the Bloom filter approach, but is not space-efficient.
  3. Insert all patterns into a radix tree (trie). Then use it to search words/sentences from the string. This is not very different from the generalized suffix tree approach, but is simpler and has better performance. It has worst-case time complexity O(M + L), is probably somewhat slower than Bloom filter or hash set approaches, and may be made very space-efficient if necessary.

My platform here is Ruby - a webapp using Rails 3.2 in particular.

I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.

A quick illustration -

Data illustration

The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -

MATCHES = {}
for each (PERSON in (people except USER)) do
  for each (RATING that PERSON has made) do
    if (USER has rated the item that RATING refers to) do
      MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
    end
  end
end
lowest values in MATCHES are the best matches for USER

The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.

I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.

If any more detail would help, please feel free to ask. Any assistance greatly appreciated!

Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.