The Art of Computer Programming: Sorting and searching

Donald Ervin Knuth

Mentioned 7

Donald Knuth is Professor Emeritus of the Art of Computer Programming at Stanford University, and is well-known worldwide as the creator of the Tex typesetting language. Here he presents the third volume of his guide to computer programming.

More on Amazon.com

Mentioned in questions and answers.

Note: Please don't interpret this as "homework question." This is just a thing I curious to know :)

The median of five is sometimes used as an exercise in algorithm design and is known to be computable using only 6 comparisons.

What is the best way to implement this "median of five using 6 comparisons" in C# ? All of my attempts seem to result in awkward code :( I need nice and readable code while still using only 6 comparisons.

public double medianOfFive(double a, double b, double c, double d, double e){
    //
    // return median
    //
    return c;
}

Note: I think I should provide the "algorithm" here too:

I found myself not able to explain the algorithm clearly as Azereal did in his forum post. So I will reference his post here. From http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi?board=riddles_cs;action=display;num=1061827085

Well I was posed this problem in one of my assignments and I turned to this forum for help but no help was here. I eventually found out how to do it.

  1. Start a mergesort with the first 4 elements and order each pair (2 comparisons)

  2. Compare the two lower ones of each pair and eliminate the lowest one from the possibilities (3 comparisons)

  3. Add in the 5th number set aside to the number without a pair and compare the two (4 comparisons)

  4. Compare the two lowest of the two new pairs and eliminate the lower one (5 comparisons)

  5. Compare the one by itself and the lower of the last pair and the lower number is the median

    The possible median is within the parentesis

(54321)

5:4 3:2 2 comparisons

(4<5 2<3 1)

4:2 3 comparisons

2(4<5 3 1)

1:3 4 comparisons

2(4<5 1<3)

4:1 5 comparisons

1,2(4<5 3)

4:3 6 comparisons

1,2(3)4,5

Three is the median

EDIT: As your request and to prevent myself from getting more downvotes, this are C++ code I wrote to find median of five. Don't mind it's awkwardness:

double StageGenerator::MedianOfFive(double n1, double n2, double n3, double n4, double n5){
    double *a = &n1, *b = &n2, *c = &n3, *d = &n4, *e = &n5;
    double *tmp;

    // makes a < b and b < d
    if(*b < *a){
        tmp = a; a = b; b = tmp;
    }

    if(*d < *c){
        tmp = c; c = d; d = tmp;
    }

    // eleminate the lowest
    if(*c < *a){
        tmp = b; b = d; d = tmp; 
        c = a;
    }

    // gets e in
    a = e;

    // makes a < b and b < d
    if(*b < *a){
        tmp = a; a = b; b = tmp;
    }

    // eliminate another lowest
    // remaing: a,b,d
    if(*a < *c){
        tmp = b; b = d; d = tmp; 
        a = c;
    }

    if(*d < *a)
        return *d;
    else
        return *a;

} 

It should be more compact, isn't it ?

EDIT:

As @pablito pointed out in his answer. The built-in List.Sort() cannot fulfill this requirement since it uses up to 13 comparisons :]

For completeness, the question is a specific case of a sorting network, which Knuth (Art of Computer Programming, vol 3) covers in great detail. The classic paper by K.E. Batcher on the subject is brief and worth reading.

So lets say that you want to learn some stuff about database internals. What's the best source code to look at? the best books to buy?

I was talking about this with a buddy the other day and he recommended:
Art of Computer Programming, Volume 3: Sorting and Searching

What other books would help me learn about all the File IO and memory issues, pages, locking, etc.... ?

A colleague and I got a great deal of information out of Database in Depth: Relational Theory for Practitioners Very low level stuff but it sounds like that is the sort of thing you are looking for.

Take a look at Database Systems: The Complete Book by by Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer D. Widom. It is specifically about the internals of the DBMS.

The answer by SquareCog also contains sensible suggestions; I've not looked at the two books mentioned (though the Stonebreaker "Architecture" book is only 136 pages according to Amazon, which seems a tad lightweight).

Textbook: Database Management Systems by Ramakrishnan and Gehrke.

Or: Architecture of a Database System by Hellerstein, Stonebraker, and Hamilton.

Production Code: PostgreSQL

(I like the PG code better than SQLite , it's far more complete and, I think, better organized. SQLite is awesome for what it does, but there is a lot it doesn't take on).

Extra Credit: Readings in Database Systems, 4th edition edited by Hellerstein.

Not everybody likes his style, but I find that Joe Celko does a fine job of explaining the set-based logic that drives SQL databases. If you already have a little SQL experience under your belt, you should read SQL for Smarties.

In depth information about internals is database specific, here's a source on SQL Server 2008: http://www.amazon.com/Microsoft%C2%AE-SQL-Server%C2%AE-2008-Internals/dp/0735626243

I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.

Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.

What is the typical solution to this conundrum? Is there any good one?

Why aren't you using the algorithms in http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850 ?

They're quite good, and carefully explained.

What is the most memory efficient way to remove duplicate lines in a large text file using C++?

Let me clarify, I'm not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.

Why not just consult Knuth, Sorting and Searching? That will give you a great background for making a balanced decision.

Sometimes interviewers ask how to sort million/billion 32-bit integers (e.g. here and here). I guess they expect the candidates to compare O(N*Log(N)) sort with radix sort. For million integers O(N*Log(N)) sort is probably better but for billion they are probably the same. Does it make sense ?

I expect they're looking for you to expand on the difference between internal sorting and external sorting. Apparently people don't read Knuth nowadays

I've been looking an answer for a while and I was wondering if anyone knows how to define explicitly (from scratch) a hash function in C#. I understand that there are pre-defined data structures that have List abilities, but I'm trying to understand the underlying structures of these objects.

Can anyone help with this? For example, if you have two arrays, how would you be able to create a hash table from this?

The Wikipedia articles on hash tables and hash functions are ridiculously good. You could also check out Volume 3 of TAOCP. Lastly, a peek in Reflector at System.Collections.Hashtable would probably be an edifying experience.

If you have more specific questions, we can probably provide more detailed insight.

To be able to hash two arrays, it sort of depends on what kind of data the arrays are holding. For example, if you mean to be hashing arrays of strings you'll need an extra step to encode the strings and integers before hashing them because what you'll essentially need is to convert your input into an index for your hash table, given a hash function. Of course, when you are 'converting your input' you have to solve the problem of collisions among your keys. In other words, you have to try and minimize the number of keys that hash to the same value, which is why Number Theory (specifically, the use of prime numbers) becomes particularly useful.

I assume that when you ask to know how to 'create a hash table' from two arrays you mean for the data in the two arrays to be the keys of the table. In that case, I can't see why you would need to refer to two arrays instead of a bigger array that contains the elements of the two arrays unless you are dealing with a statically-typed programming language and the two arrays can have different types. In this case, you will need to come up with a scheme to convert the elements into integers. If you want to convert a string s of length n where s[i] is the ith character in the string (referring to its ASCII value) into an integer, for example, you can look at how Java's hashCode() function does the job It essentially evaluates a polynomial with a prime base so as to avoid having different strings hash to the same integer. And the reason why the base is 31, other than the fact that it is prime, is because multiplying by 31 is close to a power of 2 so 31 can be done efficiently mostly with bit-shifting.

Once you have all your elements in the form of integers, you get to the real meat of the problem. There are basic tricks from which we can make elaborate combinations of, provided that the components of the combinations are relatively prime with one another so that we can scale to the entire hash table. Two of the basic methods are the division method and the multiplication method. These methods on their own, especially the division method, are insufficient because someone could eventually determine the function that hashed your keys. I would try to explain common ways of constructing hash functions, but I probably wouldn't explain them nearly as well as CLRS. On the other hand, I can give you a list of properties to satisfy:

  • You should minimize collisions
  • Your function should be able to hash a key to any slot
  • (so as to not violate simple uniform hashing)
  • Your function should be hard to reverse-engineer

Bear in mind that in order to satisfy that last property typically some component(s) of the hash function should at least cause the slots they map the keys to to appear random. Due to this constraint we will still have a non-zero probability of having a collision, so your next problem to solve would be collision resolution.

Given 10 files each having 1 million integer in sorted order, physical memory have size of 3 million. Please suggest methods to extract 1 million integer in sorted form efficiently.

If you want to deep into details, there's nice book icon from Knuth Sorting and Searching. It's bestseller in this topic, and gives understanding on the algorithms, which are being used even today.

On StackOverflow, there was really nice discussion, on how does the MapReduce algorithm work.

Or this discussion from programmers.stackexchange.com on sorting algorithms with limited resources.