4.2 SORTING
This section under major construction.
Some of this material is adapated directly from Chapter 6, Algorithms in Java, 3rd edition, by Robert Sedgewick, Addison Wesley.
Sorting. The sorting problem is to rearrange a set of items in ascending order. A prime reason why sorting is so useful is that it is much easier to search for an element in a sorted list than an unsorted one. Given someone's first and last name, it is easy to find their phone number using a phone book since the entries are sorted by last name. An iPod organizes MP3 files by artist name or song title, Google display search results in descending order of "importance", a spreadsheet displays columns sorted by a particular field, Matlab sorts the real eigenvalues of a symmetric matrix in descending order. Once a list is sorted, it is also easier to perform other tasks: find the median element, the closest pair, statistical outliers, remove duplicates in a mailing list, or detect double registrations in a voting precinct. Sorting also arises as a critical subroutine in many applications that appear to have nothing to do with sorting at all including: data compression (see the Burrows-Wheeler programming assignment), computer graphics (convex hull, closest pair), computational biology (longest common substring discussed below), supply chain management (schedule jobs to minimize weighted sum of completion times), combinatorial optimization (Kruskal's algorithm), social choice and voting (Kendall's tau distance), Historically, sorting was most important for commercial applications, but sorting also plays a major role in the scientific computing infrastructure. NASA and the fluids mechanics community use sorting to study problems in rarefied flow; these collision detection problems are especially challenging since they involve ten of billions of particles and can only be solved on supercomputers in parallel. Similar sorting techniques are used in some fast N-body simulation codes. Another important scientific application of sorting is for load balancing the processors of a parallel supercomputers. Scientists rely on clever sorting algorithm to perform load-balancing on such systems. For these reasons and others, the quicksort algorithm was honored as one of top 10 algorithms for science and engineering of 20th century.
Rules of the game. We shall begin by considering methods that sort arrays of real numbers, although the same ideas apply to sorting integers, strings, dates, or any other entity where pairwise comparison is a well-defined operation. (See below.) We use two basic abstractions when sorting: less and exch. The less function takes two inputs and return true if the first input is less than the second one, and false otherwise. For floating point numbers, this function is a one-liner. The exch functions takes an array a and two indices i and j and exchanges a[i] with a[j]. This is the classic swapping idiom that we have encountered before.
static boolean less(double x, double y) { return (x
Insertion sort. The method that people often use to sort bridge hands is to consider the cards one at a time, inserting each into its proper place among those already considered (keeping them sorted). In a computer implementation, we need to make space for the element being inserted by moving larger elements one position to the right, and then inserting the element into the vacated position. Program InsertionSort.java implements this method, which is known as insertion sort.
static void insertionSort(double[] a) { int N = a.length; for (int i = 0; i0; j--) if (less(a[j], a[j-1])) exch(a, j, j-1); else break; }
The program maintains the invariant that the elements to the left of the current index i are in sorted order during the sort, but they are not necessarily in their final position, as they may have to be moved to make room for smaller elements encountered later. The array is, however, fully sorted when the index reaches the right end. The figure below illustrates the method in operation on a sample file.
|
Analysis of insertion sort. We first perform an empirical analysis. Then we validate our observations using a mathematical analysis.
- Empirical analysis.
We test out insertion sort by performing some computational experiments.
We analyze how long our program takes to sort N random numbers between 0 and 1.
We modify our program to explicitly count the number of comparisons and
exchanges. We also use the -Xprof option to measure the number
of seconds it takes to execute.
% java -Xprof InsertionSort 50000 [wayne:bolle] /csinfo/www/introcs/42sort> java -Xprof InsertionSort 50000 Comparisons: 625107306 Exchanges: xxx Flat profile of 148.61 secs (4353 total ticks): main Compiled + native Method 99.6% 4335 + 0 InsertionSort.insertionSort 0.0% 0 + 2 java.util.Random.next 99.6% 4335 + 2 Total compiledFrom this output, we conclude that 99% of the time in insertion sort is spent in our less and exch functions. To improve performance, we have two main choices (i) find a way to implement less and exch more efficiently or (ii) find a sorting method that calls less and exch substantially fewer times. Optimizing the time it takes to read in the data will have little effect because this operations does not consume a substantial fraction of the overall running time.
- Mathematical analysis. Insertion sort uses aboult N^2 / 4 comparisons and N^2 / 4 exchanges on average, and twice that many at worst. The unshaded letters correspond in the figure above correspond to comparisons. In the worst case (a reverse sorted input), all of the elements below the diagonal (N^2 / 2) must be compared and exchanged. For a random input, we expect that each element will move halfway to the right, so we only count one-half the elements below the diagonal (roughly N^2 / 4). This supports the hypothesis made from our empirical analysis. If the file is already sorted (or almost sorted), insertion sort runs in linear time. This is important in some applications where we need to re-sort a file, after modifying or inserting a few elements.
Quicksort. Quicksort is a divide-and-conquer method for sorting. It works by partitioning an array of elements into two parts, then sorting the parts independently. As we shall see, the precise position of the partition depends on the initial order of the elements in the input file. The crux of the method is the partitioning process, which rearranges the array to make the following three conditions hold:
- The element a[i] is in its final place in the array for i.
- None of the elements a[left], ..., a[i-1] is greater than a[i].
- None of the elements in a[i+1], ..., a[right] is less than a[i].
We achieve a complete sort by partitioning, then recursively applying the method to the subfiles, as depicted below.
public static void quicksort(double[] a) { quicksort(a, 0, a.length - 1); } private static void quicksort(double[] a, int left, int right) { if (right <= left) return; int i=partition(a, left, right); quicksort(a, left, i-1); quicksort(a, i+1, right); }
We use the following general strategy to implement partitioning. First, we arbitrarily choose a[right] to be the partitioning element - the one that will go into its final position. Next, we scan from the left end of the array until we find an element greater than the partitioning element, and we scan from the right end of the array until we find an element less than the partitioning element. The two elements that stopped the scans are obviously out of place in the final partitioned array, so we exchange them. Continuing in this way, we ensure that no array elements to the left of the left index are greater than the partitioning element, and no array elements to the right of the right index are less than the partitioning element, as depicted in the following diagram.
When the scan indices cross, all that we need to do to complete the partitioning process is to exchange a[right] with the leftmost element of the right subfile (the element pointed to by the left index i). The code below implements this idea.
This PowerPoint demo shows the results of partioning on the L in QUICKSORTISCOOL. The following illustrates the results of partioning on the rightmost E of the input the input ASORTINGEXAMPLE.
static int partition(double[] a, int left, int right) { int i = left - 1; int j = right; while(true) { while (less(a[++i], a[right])) // left pointer if (i == right) break; while (less(a[right], a[--j])) // right pointer if (j == left) break; if (i >= j) break; // pointers cross? exch(a, i, j); // swap left and right } exch(a, i, right); // swap pivot return i; }
Program QuickSort.java takes a command
line input N, reads in N strings from standard input, and quicksorts them.
Here are the results of the full algorithm on the input ASORTINGEXAMPLE.
|
Analysis of quicksort. Quicksort is more difficult to analyze mathematically than the programs we've seen so far because the program is recursive and the running time can vary greatly depending on the input. Nevertheless, the performance of quicksort is well-understood. On average, quicksort uses about 2N ln N comparisons. This analysis assumes that the input has been randomly shuffled prior to sorting. If the input is degenerate (e.g., sorted or reverse sorted) then quicksort can take as many as N^2/2 comparisons.
To gain some intuition as to where the N log N terms comes from, let's assume that N is a power of 2 and that each partitioning step of quicksort divides the file exactly in half. Let T(n) denote the number of comparisions to quicksort an input of size n. Partioning a file of n elements requires n comparisons (possibly n+1 or n+2). By our optimisitc assumption T(n) = 2T(n/2) + n. Clearly T(1) = 0. Now T(2) = 2, T(4) = 8, T(8) = 24, T(16) = 64, T(32) = 160, and so forth. In general T(N) = N lg N. We can visualize this recurrence with a recursion tree. At level i (starting at i = 0), we sort k = 2^i subfiles of size N/k. Thus each level requires N comparisons. The depth of the tree is lg N. This is a total of N lg N.
This intuition undercounts the number of comparisons because we will not get a perfectly even split.
Binary search.
Searching is a fundamental computational problem. For example, suppose we want to find the phone number of a classmate and we have access to a phone book which lists the names of each student in alphabetical order along with their phone numbers. A naive solution would be to examine each entry one at a time, and see if it matches the name of the classmate. A much more efficient solution exploits the phone book's ordering. We can open up to the middle of the phone book. If the name of the classmate appears on that page, we are done. Otherwise, we can eliminate half of the phone book from further consideration and repeat.The recursive phone book strategy is known as binary search. Given a sorted array a and an element key, our task is to locate that element within the array, if it exists. It is based on the idea that if the elements in the table are in ascending order, we can eliminate half of them from consideration by comparing the one that we seek with the one at the middle position of the table. If it is equal, we have a successful search. If it is less, we apply the same method to the left half of the table. If it is greater, we apply the same method to the right half of the table. To implement this strategy, we maintain two indices left and right that mark the range of the array where the element might occur. If the range is empty, we return -1 to indicate an unsuccessful search. Otherwise, we examine the element mid in the middle of the range and compare it to key. Depending on the outcome, we either (i) return the index of the matching element, (ii) update the right endpoint of the range, or (iii) update the left half of the range.
public static int binarySearch(long[] a, long key) { int bot = 0; int top = a.length - 1; while (bot <= top) { int mid=(bot + top) / 2; if (key < a[mid]) top=mid - 1; else if (key> a[mid]) bot = mid + 1; else return mid; } return -1; }
// precondition array a in ascending order public static int binarySearch(long[] a, long key) { int bot = -1; int top = a.length; while (top - bot > 1) { int mid = (bot + top) / 2; if (key > a[mid]) bot = mid; else top = mid; } if (a[top] == key) return top; else return -top - 1; }
Correctness / program verification.
In the early 1970s, Dijkstra championed the approach
of developing a program
along with a proof of correctness.
Deceptively simple program and surprisingly hard to
implement correctly.
Bentley: asked professional programmers to code up the
description of binary search in their favorite language.
After a couple hours, most programmers believed they
had written correct code. After 30 minutes of analysis
and review, 90% had bugs.
Knuth 6.2.1: first binary search algorithm was published in 1946,
but first published binary search without bugs did not
appear until 1962.
Second version: if key appears, top is the index
of the leftmost occurrence; otherwise top is
the index of the leftmost element greater than key.
While loop invariant says top >= bot + 2. This
implies bot Running time.
Now we analyze the running time of the binary seach algorithm.
We choose as our basic operation the number of key comparisons.
Each time through the while loop, we decrease the length
of the interval by roughly a factor of 2.
If the array has N elements,
then after one iteration the range consists of roughly
N/2 elements, after two iterations it is roughly N/4,
and after three it is roughly N/8.
We say "roughly" because the range actually decreases from
n to n/2 where we are using integer division.
This repeated shrinking can happen at most log N times,
since this is the number of iterations it takes to drive N down
to 1.
Since we do only a constant amount of work in the body
of the while loop, we obtain a running time of Θ(log N).
This analysis implies that we can search truly huge data sets. If N = 1 million,
we need only 20 comparisons. If N = 1 billion, we need only 30.
If N = 1 trillion, we need only 40.
Sorting generic objects.
In many applications, we are interested in sorting non-numeric data,
including strings and dates.
We assume that we can compare two elements to see whether
the first is bigger than, smaller than, or equal to the second. To help
enforce this assumption, we use Java's Comparable interface
and engineer our sort method so that it sorts
arrays of Comparable objects.
By implementing the Comparable interface, the data type
promises to implement a method compareTo so that
that a.compareTo(b) returns a negative integer if a b, and zero if they are equal. Although the compiler
does not check, the implementation of compareTo must be
consistent, e.g, if a > b, then b
We use two basic abstractions when sorting: less and exch.
The less function takes two inputs and return true if
the first input is "less" than the second one, and false
otherwise. This function is easy to code since the elements come
with a compareTo method.
The exch functions takes an array a
and two indices i and j and swap a[i] and
a[j]. This is the classic swapping idiom that we have
encountered before.
Using these abstractions, the body of insertion sort is
identical to the code for sorting integers. The only
difference is that the function takes an array of type
Comparable instead of int.
Java library.
The Java library Arrays contains static methods for manipulating
arrays, including sorting and binary search, but not shuffling or reversing.
The function Arrays.sort is applicable when sorting
primitive types or objects.
The algorithm for sorting strings (and other objects)
is a tuned version of mergesort.
It is guaranteed to be stable and run in time proportional to N log N.
Program SystemSort.java illustrates
how to use it.
The algorithm for primitive types is
a variant of quicksort developed by Bentley and McIlroy. It is
extremely efficient for most inputs that arise in practice, including
inputs that are already sorted. However, using a clever techinique
described by M. D. McIlroy in
A Killer Adversary
for Quicksort, it is possible to construct pathological inputs
that make the system sort run in quadratic time. Even worse, it is
blows the function call stack. To see Java's sorting
library break, here are some killer inputs of varying sizes:
10,000,
20,000,
50,000,
100,000,
250,000,
500,000, and
1,000,000.
You can test them out using the program
IntegerSort.java which takes
a command line input N, reads in N integers from standard input,
and sorts them using the system sort.
Stock market symbols.
Program MarketSymbol.java
reads in a database of market symbols associated with equities listed
on the NYSE, NASDAQ, and other exchanges.
Then it prompts the user to enter market symbols and prints out
the corresponding equity using binary search.
It uses the file
mktsymbols.txt, which contains
127,482 entries.
The data file stores the entries in alphabetical order, sorted by
symbol name.
Longest repeated substring.
An application of sorting to computational biology and plagiarism
detection.
Use system sort to suffix sort an input string and find longest repeated substring.
Applications: computational biology, plagiarism detection, SCO lawsuit
that Linux contains copies of their code.
Program LRS.java uses
String.substring to build the array of substrings and
Arrays.sort to sort them. In Java, this is not
terribly inefficient since each
call to substring reuses the characters from the original String.
However, it uses a substantial amount of memory because each of the N
strings formed uses 16 bytes (the character array itself is reused and
consumes 12 + 2N bytes). Exercise XYZ suggests a more memory efficient
solution.
Q. Why doesn't the Java library use a randomized version of quicksort?
A. Good question. At the very least, the library should cutoff to
some guaranteed N log N algorithm if it "realizes" it is in trouble.
Programmers may want their libraries to be deterministic
for debugging. But the library only uses quicksort for primitive types
when stability is not an issue, so the programmer probably wouldn't notice the
randomness, except in running time.
Answer: still correct (because partitioning element
acts as a sentinel) and a little faster.
Answer: array out-of-bounds possible if partitioning
element is smallest.
Answer: avoid directly comparing all n characters
if s and t are references to the same string.
Solution 1: divide the file into N/k pieces of size k, and
sort each piece in O(k log k) time, say using mergesort.
Note that this preserves the property that no element is more
than k elements out of position.
Now, merge each blocks of k elements with the block to its left.
Solution 2: insert the first k elements into a binary
heap. Insert the next element from the array into the heap,
and delete the minimum element from the heap. Repeat.
Next, read in some test points, testing, one at a time, whether or not
each is inside the polygon, using the following method. If a horizontal
line were drawn through the given test point, it would hit one line from the
left part of the polygon and one line from the right part of the polygon.
The point is inside the polygon if and only if it falls between those
two lines. But from the way that the polygon is stored (the y coordinates
of each part are sorted), the two lines can be found in logarithmic
time using binary search. After the two lines are found, it is a simple
matter to test whether the point falls between them.
public interface Comparable {
public int compareTo(Object b);
public class Rational implements Comparable {
private long num, den;
...
public int compareTo(Object o) {
Rational a = this;
Rational b = (Rational) o;
if (a.num * b.den
static boolean less(Comparable x, Comparable y) {
return (x.compareTo(y) <0); } static void exch(comparable[] a, int i, int j) { comparable swap=a[i]; a[i]=a[j]; a[j]=swap; }
We can also run this program on the built-in String data
type since it implements the Comparable interface.
We use the data file
dickens.txt (from Project
Gutenberg).
The file contains most of the writing of Charles Dickens and comprises
over 5.5 million words and 31 million characters. We will see how
long our program takes to read in the first N words and sort them.
static void insertionSort(Comparable[] a) {
int N = a.length;
for (int i = 0; i Q+A
Exercises
if (right <= left) return;
if (i == right) break;
if (j == left) break;
if (i >= j) break;
exch(a, i, right);
public int compareTo(Object x) {
String s = this;
String t = (String) x;
if (s == t) return 0;
int n = Math.min(s.length(), t.length());
for (int i = 0; i
public static String removeDuplicates(String[] duplicates) {
Arrays.sort(duplicates);
int N = 0;
for (int i = 0; i Creative Exercises
java Quicksort
