Large-scale Genome Sequence Processing

Masahiro Kasahara, Shinichi Morishita

Mentioned 1

Efficient computer programs have made it possible to elucidate and analyze large-scale genomic sequences. Fundamental tasks, such as the assembly of numerous whole-genome shotgun fragments, the alignment of complementary DNA sequences with a long genome, and the design of gene-specific primers or oligomers, require efficient algorithms and state-of-the-art implementation techniques. This textbook emphasizes basic software implementation techniques for processing large-scale genome sequences and provides executable sample programs. Book jacket.

More on Amazon.com

Mentioned in questions and answers.

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?


Preliminary

I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.

Details

I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.

Research

Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.

To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.

Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.