Computer Science Building, Princeton University


Introduction to CS


1.  A Simple Machine

2.  Java Programming

3.  OOP

4.  Data Structures

    • Performance

    • Sorting

    • Linked Structures

    • Stacks and Queues

    • Multisets

    • Priority Queues

    • Symbol Tables

    • Graphs

5.  A Computing Machine

6.  Building a Computer

7.  Theory of Computation

8.  Systems

9.  Scientific Computation

10.  Perspective


 Lecture Notes

Assignments

FAQ









4.7 SYMBOL TABLES


This section under major construction.

Symbol tables. A symbol table is a data type that associates a value with a key. The client can store (put) an entry by specifying the key-value pair, and can retrieve (get) a value corresponding to a particular key. As an example, we might associate a student's name and grades (the value) with their social security number (the key). The Registrar might access a student's records by specifying their social security number. Or we might associate a web page address (e.g., www.amazon.com) with the corresponding IP address (208.216.181.15). To avoid confusion between the key and the value, we will always assume that the key is a String, whereas the value can be any Object. One way to think of a symbol table is as an associative array. We can store information in a Java array by specifying its index. An associate array is similar, but instead of using an integer for the index, we use a string. This flexibility is useful in many application domains. Since it is not built directly into the Java language, we must create a data type to support it (or learn how to use one of Java's many libraries).

Applications. Telephone numbers and names, DNS lookup, students and social security numbers, spell checking, associative arrays, web page counter, web site log analyzer, caching, Java variables and their types, Java variables and their values.... Google is essentially one big symbol table - user types in query word and Google returns relevant web pages.

As an example, in DNS lookup, we are interested in associating URLs that humans understand (e.g., www.amazon.com) with IP addresses that computer network routers understand (e.g., 208.216.181.15). To insert the keys, our client program will create a new SymbolTable and use its put method. do the following

SymbolTable st = new SymbolTable();
st.put("www.cs.princeton.edu", "128.112.136.11");
st.put("www.princeton.edu",    "128.112.128.15");
st.put("www.yale.edu",         "130.132.143.21");
st.put("www.amazon.com",       "208.216.181.15");
st.put("www.simpsons.com",     "209.052.165.60");

Now, we can lookup various IP addresses by passing the URL to the get method.

System.out.println(st.get("www.cs.princeton.edu"));
System.out.println(st.get("www.harvardsucks.com"));
System.out.println(st.get("www.amazon.com"));

The resulting output is shown below. In the last query, the URL does not exists so the symbol table returns null.

128.112.136.11
null
208.216.181.15

The example above performs all of the put calls before any of the get calls, but this is not required. Also, the client can lookup the same URL several times if they want.

Removing duplicates. One common application of symbol tables is to identify duplicates in a list. Program DeDup.java reads in a list of strings from standard input and prints out the strings, but only prints out the first occurrence of each string.

Unordered array implementation. Program array/SymbolTable.java implements a symbol table with an unordered array. When a key-value pair is inserted (put), it adds it to the next available spot. When the array fills up, we double the size of the array. To search for a key (get), the program scans through the array until it either finds the desired key (in which case it returns the value) or if it exhausts the array (in which case it returns null). Insertion fast (except when re-doubling), search slow.

Linked list implementation. Program link/SymbolTable.java implements a symbol table with a linked list. Insertion fast, search slow.

Sorted array implementation. Program sortedarray/SymbolTable.java implements a symbol table with an ordered array. When a key-value pair is inserted (put), we update the array so that all of the keys are in ascending order. This enables us to use binary search for the correct value. Search fast, insertion slow.

Hashing. The idea of hashing appears to have originated by two independent groups at IBM in the early 1950s. H. P. Luhn proposed the use of chaining. Interestingly, this was one of the first applications of linked lists. An other group at IBM solve the problem with a completely different approach. They used open addressing to resolve collisions. Both approaches are still widely used today, and we will discuss them below.

Hash table with separate chaining. Program hash/SymbolTable.java implements a symbol table with a separate chaining hash table. Insertion fast, search fast. The current version fixes the number of chains at 100,003 (a prime number), but it is not resizable.

hashing with separate chaining

Moby Dick To get a sense of how the hashing function scrambles the data, we hashed the 16,834 distinct words in Moby Dick, after converting to lowercase and removing punctuation. Using a hash table size of size M = 8,191 chains, we expect an average of roughly two words per chain. Here are the first few and last few chains. Here is the complete hash table.

   0: jocularly  seriously
   1: listen
   2:
   3: suburban  untravelled  considerating
   4: liest  heroes  silk
   5: swiftest  intercourse  sill  vast
   6: foregoing
   7: adieu
      ...
8190: browsing

As you can see, some chains are completely empty while others have several words.

Java's HashMap implementation. The Java library HashMap has essentially the same basic funcionality as what we are referring to as a symbol table, except that it offers a number of additional operations which we chose not to implement (e.g., remove). Also the key can be any type of object, not just a string. It is the programmer's responsibility to ensure that the key does not change while it is on the symbol table. Immutable objects like strings make excellent keys. Program HashMapDemo.java illustrates how to use the library HashMap. Program SymbolTable.java illustrates how to implement our symbol table interface using a HashMap. If you want to write code that will work in 10 years, it is often wise not to directly rely on pre-packaged libraries since these libraries might change or cease to exists in a few years. By writing the "wrapper" class SymbolTable and making all of our client programs only depend on SymbolTable, we protect ourselves against having to change any of the clients should the HashMap libray dissappear. We would simply re-implement SymbolTable, compile it, and everything else would work.

Good hash functions. For Social Security numbers, would you prefer to use the first 3 digits or the last 3 digits? Using the first three digits would be a very bad idea since they encode the geographic region. For example, 138 always corresponds to New Jersey. In contrast, the last four digits are assigned in chronological order and are often used as "hash values" by financial institutions. For strings, we would not want a hash function that converts the first few letters to an integer. Instead, we would want to incorporate information from all of the characters. The following hash function is essentially how Java implements hashCode for strings. The + may look a little funny since we are adding an int to a char, but a char is really a 16 bit integer, so resulting sum makes sense.

public int hashCode() {
   int hash = 0;
   for (int i = 0; i 

Program HashCode.java reads in a command line string and prints out its hash code using Java's built in String.hashCode method, and also via the iterative algorithm above.

A bad hash function. Prior to Java 1.2, the string hash function only depended on 16 characters in the string, evenly spaced starting from the first character. This was done in the hopes of computing the hash function more quickly. Indeed, the hash values were computed more quickly, but it became more likely that many strings hashed to the same values. This causes a significantly degrades the performance. This proved to be disastrous as many real-world inputs (e.g., long URLs) all had the same hash value.

Caveats: while this hash function is good in general, there are certain contrived inputs that can make every inserted string hash to the same chain. To see the danger of this, suppose you are maintaining a website and users can add database information. Since Java's string hash function is part of the Java standard, a malicious adversary can pre-compute a list of bad strings and then enter these. For example, it is easy to verify that "BB" and "Aa" hash to the same value (2112). Now, any string of length 2N that is formed by concatenating these two strings together in any order (e.g., BBBBBB, AaAaAa, BBAaBB, AaBBBB) will hash to the same value. There are 2^N such strings. If your algorithm uses Java's hash function, then your program can take a much longer time to run than expected. Reference. A more sophisticated hash scheme known as universal hashing overcomes this flaw by choosing the hash function at random (at runtime) from a specific set of hash functions in a very clever way.

Java's hashtable.

Sun's implementation of HashMap in Java 1.4.2 uses a table size which is a power of 2 (instead of a prime). To guard against some poorly written hash functions, they apply the following scrambling routine to hashCode
static int hash(Object x) {
   int h = x.hashCode();
   h += ~(h <9); h ^=(h>>> 14);
   h +=  (h <4); h ^=(h>>> 10);
   return h;
}

Hash function (e.g., MD5 SHA-1) are also useful for verifying the integrity of a file. Hash the file to a short string, transmit the string with the file, if the hash of the transmitted file differs from the hash value then the data was corrupted.

Hashing with linear probing. (omit?)

hashing with linear probing

Binary search tree. Program BST.java implements a symbol table with a binary search tree.

Successful search for H (top), unsuccessful search for M (middle), insertion of M (bottom).

BST search and insertion

The illustration below shows the BST after the insertion of the elements A S E R C H I N G X M P L into an initially empty tree.

BST construction                          BST construction

Balanced in a BST. Inserting about 200 random keys into an initially empty tree. No search uses more than 12 comparisons. Average cost for a search hit is 7.

BST random

When a BST has equal elements, they can be scattered through the tree as the highlighted A's in the top figure below.

BST duplicates

Worst case inputs for BST.

BST worst case                          BST worst case

A BST can also be used to implement a heap. Program PQTreeSet.java implements a priority queue with a red-black tree, using Java's built in TreeSet data type. It is reasonably efficient, but not quite as fast as using a binary heap. Its main advantage over our binary heap implementation is that it supports deletion of arbitrary elements as well. It also supports both delMax and delMin.

Lessons

  1. Space-time tradeoffs.

Questions and Answers.

  1. What's the difference between Java's Hashtable library and its HashMap library?
  2. Not much. You can insert null values and keys into a HashMap, but not a Hashtable. Also, HashMap is unsynchronzied, whereas Hashtable is synchronized.
  3. Why does Java use 31 in String.hashCode?
  4. It's prime, so that when the user mods out by another number, they have no common factors (unless it's a multiple of 31). 31 is also a Mersenne prime (like 217 or 511) which is a power of 2 minus 1. This means that the mod can be done with one shift and one subtract if the machine's multiply instruction is slow.
  5. How would you extract the bits from a variable of type double for use in hashing?
  6. Double.doubleToLongBits(x) returns a 64-bit long integer whose bit representation is the same as the floating point representation of the double value x.
  7. What's wrong with using (s.hashCode() % M) or Math.abs(s.hashCode()) % M to hash to a value between 0 and M-1?
  8. The % operator returns a non-positive integer if its first argument is negative, and this would create an array index out-of-bounds error. Surprisingly, the absolute value function can even return a negative integer. This happens if its argument is Integer.MIN_VALUE because the resulting positive integer cannot be represented using a 32-bit two's complement integer. This kind of bug would be excruciatingly difficult to track down because it would only occur one time in 4 billion!


Exercises

  1. Write a program that creates a symbol table mapping letter grades to numerical scores, as in the table below. Then write a program to read in a list of letter grades and compute the GPA.
    A+    A     A-    B+    B     B-    C+    C     C-    D     F
    4.33  4.00  3.67  3.33  3.00  2.67  2.33  2.00  1.67  1.00  0.00
    

  2. Implement a method size that returns the number of elements in the hash talbe implementation of a symbol table.
  3. Implement a method contains that takes a string key as input and returns true if that key is in the symbol table, and false otherwise.
  4. Give the contents of the hash table that results when you insert items with the keys E A S Y Q U T I O N in that order into an initially empty hash table of M = 5 chains. Use the hash function 11 k mod M to transform the kth letter of the alphabet into a table index, e.g., hash(I) = hash(9) = 99 % 5 = 4.
  5. Suppose we wish to search a linked list of length N elements, each of which contains a very long string key s and a hash value h(s). How might we take advantage of the hash value when searching the list for an element with a given key t? Solution: compute h(t) and examine the list elements s in order, comparing their hash values to h(t). Only compare the string s and t if (h(s) == h(t)).
  6. Explain why if you implement hashCode that you must also re-implement equals to be consisent. That is, if a.equals(b), then we must also have (a.hashCode() == b.hashCode()).
  7. Draw the BST that results when you insert the items
    E A S Y Q U E S T I O N
    

    in that order into an initially empty tree. Handle duplicates according to the BST code.

  8. Write a function range that takes two Key arguments and prints out all of the items in a BST whose keys are between two values. Your function should not need to traverse most of the tree (unless the range is exceptionally large).
  9. Write a function count that takes a single Key argument and prints out all of the items in a BST whose key is equal to the given key. Your function should not need to traverse most of the tree (unless it contains an exceptional number of matching keys).
  10. Suppose we have numbers between 1 and 1000 in a BST and want to search for 363. Circle the one sequence below that cannot be the sequence of keys examined.
    1. 2, 252, 401, 398, 330, 363
    2. 2, 399, 387, 219, 266, 382, 381, 278, 363
    3. 923, 220, 911, 244, 898, 258, 362, 363
    4. 924, 278, 347, 621, 299, 392, 358, 363
    5. 925, 202, 910, 245, 363
  11. Suppose that the following 31 keys appear (in some order) in a binary search tree of height 5:
    10 15 18 21 23 24 30 30 38 41
    42 45 50 55 59 60 61 63 71 77
    78 83 84 85 86 88 91 92 93 94
    98
    

    Note that a binary tree consisting of one node has height 1. Draw the top 3 nodes of the binary search tree, i.e., the root and its two children. Hint: 31 = 25 - 1.

  12. Given a BST with positive and negative integer keys, write a function that returns the value associated with the key of maximum absolute value.
  13. Given a BST with positive and negative integer keys, write a function that returns the value associated with the key of minimum absolute value.
  14. True or false. Given a BST, let x be a leaf node, and let y be its parent. Then either (i) the key of y is the smallest key in the BST larger than the key of x or (ii) the key of y is the largest key in the BST smaller than the key of x. Answer: true.

Creative Exercises

  1. Existence symbol table. Create an abstract data type ExistenceSymbolTable.java that stores keys (but no corresponding values). It should support an add method which takes one input and inserts the given string into the data structure. It should also support a method contains that takes one string input and returns true if that string is in the table, and false otherwise.

    Note: this corresponds with Java's Set interface. HashSet and TreeSet are two implementations.

  2. Spell checking. Write a program SpellChecker.java that the name of a file containing a dictionary of words in the English language, and then reads string from standard input and prints out any word that is not in the dictionary. Use the existence symbol table ADT from the previous exercise.
  3. Highlighting browser hyperlinks. Browsers typically denote hyperlinks in blue, unless they've already been visited, in which case they're depicted in purple Write a program HyperLinkColorer.java that reads in a list of web addresses from standard input and output blue if it's the first time reading in that string, and purple otherwise.
  4. Web filter. Write a program that reads in a bunch of known objectionable web sites. Then read a sequence of web sites from standard input and print only those web sites that should not be filtered. Use an existence symbol table.
  5. Spam blacklist. Insert known spam email addresses into an existence table and use to blacklist spamm.
  6. Frequency symbol table. Write an abstract data type FrequencyTable.java that supports the following operations: hit(String), and count(String). The hit operation increments the number of times the string appears by one. The count operation returns the number of times the given string appears, possibly 0. Applications: web counter, web log analyzer, RIAA for counding file sharing, music jukebox that counts number of times each song has been played....
  7. Word frequencies. Write a client program that uses FrequencyTable.java to read strings from standard input (e.g., Moby Dick) and print out the number of times each word appears.
  8. 1D range searching. Create a data structure that supports the following operations: insert a date, search for a date, and count the number of dates in the data structure that lie in a particular interval.
  9. Non-overlapping interval search. Given a list of non-overlapping intervals of integers (or dates), write a function that takes an integer argument and determines in which if any interval that values lies, e.g., if the intervals are 1643-2033, 5532-7643, 8999-10332, 5666653-5669321, then the query point 9122 lies in the third interval and 8122 lies in no interval.
  10. Periodic table of elements. Create a program that can lookup the element's name based on its symbol (e.g., C for Carbon, Pu for Plutonium), or the element's symbol given its name. Use the data file elements.txt.
  11. Morse code. Create a program MorseCode.java that converts between letters and Morse code values. For example MorseCode.toMorse("B") should return the string "-..." and MorseCode.fromMorse("-...") should return the string "B". Use the list of Morse code values in morse.txt.
  12. Intersection of two sets. Given two arrays of N 64-bit integers each, possibly with duplicates, print out all the numbers in the first array that are contained in the second array. Your algorithm should run in linear time on average.
  13. Cryptograms. Write a program to read in a cryptogram and solve it. A cryptogram is ancient form of encryption known as a substitution cipher in which each letter in the original message is replaced by another letter. Assuming we use only lowercase letters, there are 26! possibilities, and your goal is to find one that results in a message where each word is a valid word in the dictionary. Use Permutations.java and backtracking.
  14. Hash attack. Write a program that takes a command line parameter N and produces 2N strings of length 2N that all hash to the same value. Perform some computational experiments by first inserting these 2N strings into a hash table, and then performing searches for each of the 2N strings. Compare to what happens when you insert 2N random strings instead. Try N = 10, 15, and 20.
  15. IP lookup by country. Use the data file ip-to-country.csv or GeoIPCountryWhois.csv to determine what country a given IP address is coming from. The data file has five fields (begining of IP address range, ending of IP address range, two character country code, three character country code, and country name. See The IP-to-country website or MaxMind GeoIP. The IP addresses are non-overlapping. Such a database tool can be used for: credit card fraud detection, spam filtering, auto-selection of language on a web site, and web server log analysis.
  16. Remove. Implement a remove method that takes as input a string key and removes the corresponding entry from the symbol table (if it exists). Do this for the linked list, sorted array, and hash table implementations. How efficient is the new operation in each of your implementations?
  17. Hash table iterator. Create an iterator for a hash table.
  18. Binary search tree iterator. Create an iterator for a binary search tree that iterates through the elements in ascending order. Hint: the iterator class should maintain its own stack to keep track of where it is in the tree.
  19. Boggle. Write a recursive program Boggle.java that takes a command line parameter N, generate a random N-by-N board of characters, and print it out. A Boggle word is a sequence of characters that is formed by snaking around from cell to cell either left, right, up, down, or diagonally, without repeating any cell. Read in a dictionary and print out all Boggle words that appear in the dictionary. Test your program out on boggle4.txt, boggle5.txt, and boggle15.txt. Print out the score (6-letter words worth x points, etc.)
  20. Boggle with backtracking. The solution to the previous exercise must examine every conceivable word, and there are an astronomically large number of words to check. Write a program Boggle.java that performs backtracking to avoid most of these possibilities. If you encounter a word, say hellox, and there is no word in the dictionary that begins with that string, you can safely stop the recursive search from here, and avoid trying out new words beginning with hellox. You will need to extend the symbol table ADT to support one new operation: containsPrefix(s) that returns true if the dictionary contains the string s as a prefix of one (or more) words in the dictionary. Consider using a 26-way trie for the symbol table. Use a second symbol table to ensure that you print out each word at most once.
  21. High-scoring boggle board. The Boggle dice look like the following: ACGSTE, ... A 6-letter word is worth x points, etc. Find the best boggle board you can. Assuming the dice are rolled uniformly at random, try estimate the the average number of points.
  22. Inverted index of web. Given a list of web pages, create a symbol table of words contained in the web pages. Associate with each word a list of web pages in which that word appears. Write a program that reads in a list of web pages, creates the symbol table, and support single word queries by returning the list of web pages in which that query word appears.
  23. Inverted index of web. Extend the previous exercise so that it supports multi-word queries. In this case, output the list of web pages that contain at least one occurrence of each of the query words.
  24. Repetition draw in chess. In the game of chess, if a board position is repeated three times with the same side to move, the side to move can declare a draw. Describe how you could test this condition using a computer program.
  25. Registrar scheduling. The Registrar at a prominent northeastern University recently scheduled an instructor to teach two different classes at the same exact time. Help the Registrar prevent future mistakes by describing a method to check for such conflicts. For simplicity, assume all classes run for 50 minutes starting at 9, 10, 11, 1, 2, or 3.
  26. Random element. Add a symbol table function random to a BST that returns a random element. Assume that the nodes of your BST have integer size fields that contain the number of elements in the subtree rooted at that node. The running time should be proportional to the length of the path from the root to the node returned.
  27. Indirect priority queue. Write a program IndirectPQ.java that implements an indirect priority queue. An indirect priority queue consists of key-value pairs, where the keys are strings and the values are of type Comparable. The method insert should insert a key-value pair, the method delMax should delete the item with the biggest key, the method getPriority should return the value corresponding to the given key, and the method changePriority should change the priority of the given key. All operations should be efficient.
  28. Max symbol table. Write a program Top.java that takes a command line parameter N, reads in a sequence of words from standard input, and prints out the top N most frequently occurring words, e.g., in Moby Dick. Hint: use IndirectPQ.java
  29. Most prolific actors and actresses. Write a program to read in data from the movie database and find the actor that appears in the most movies. Hint: use IndirectPQ.java
  30. Markov language model. Create a data type that supports the following two operations: add and random. The add method should insert a new item into the data structure if it doesn't yet exists; if it already exists, it should increase its frequency count by one. The random method should return an element at random, where the probabilities are weighted by the frequency of each element.
  31. Symbol table with no duplicates. Create a data type that is a symbol table, except that a key may only appear on the symbol table at most once at any given time. If a request is made to insert a new key-value pair and the key is already in the symbol table, delete the old key-value pair and overwrite it with the new one. (This is how Java's symbol tables HashMap and TreeMap work.)
  32. Queue with no duplicates. Create a data type that is a queue, except that an element may only appear on the queue at most once at any given time. Ignore requests to insert an item if it is already on the queue.
  33. UniQueue. Create a data type that is a queue, except that an element may only be inserted the queue once. Use an existence symbol table to keep track of all elements that have ever been inserted and ignore requests to re-insert such items.
  34. Bayesian spam filter. Follow the ideas in A Plan for Spam. Here is a place to get test data.
  35. California runoff election for governor. In order to thwart bias against candidates whose names appear toward the end of the alphabet, California will sort the candidates appearing on a ballot in the following orded:
    R W Q O J M V A H B S G Z X N T C I E K U P D Y F L
    

    Write a compareTo method for String objects that respects this order and use it to sort this list of candidates.

  36. Java 1.1 hashCode. Sun used a string hash function like the following in JDK 1.1. Why might this be a bad idea?
    public int hashCode() {
       int hash = 0;
       if (length() <16) { for (int i=0; i < length(); i++) hash=(hash * 37) + charat(i); } else { int skip=length() / 8; for (int i=0; i < length(); i=i + skip) hash=(hash * 37) + charat(i); } return hash; } 

  37. For your poetry class, you would like to tabulate a list of rhyming words. A crude way to accomplish this task is as follows:
    • Read in a dictionary of words into an array of strings.
    • Reverse the letters in each word, e.g., confound becomes dnuofnoc.
    • Sort the resulting array of words.
    • Reverse the letters in each word back to their original state.
    Now the word "confound" will be next to words like "astound" and "compound." Write a program Rhymer.java that reads in a sequence of words from standard input and prints them out in the order specified above.
  38. Checking if a tree is a BST. Write a method isBST that takes a binary tree and returns true if it is a binary search tree, and false otherwise. Hint: This is more difficult than it might seem. Use an overloaded method isBST that takes two additional arguments min and max and returns true if the tree is a BST and all its values are between min and max.
    public static boolean isBST(Tree root) {
        return isBST(root, Integer.MIN_VALUE, Integer.MAX_VALUE);
    }
    
    private boolean isBST(Tree root, int min, int max) {
        if (root == null) return true;
        if (root.item  max) return false;
        return isBST(root.left, min, root.item) &&
               isBST(root.right, root.item, max);
    }
    
  39. Mirroring a web site. Use hashing to figure out which files need to be updated to mirror web site.
  40. Symbol table with random access. Create a data type that supports inserting a key-value pair, searching for a key and returning the associated value, and deleting and returning a random value. Hint: combine a symbol table and a randomized queue.
  41. Password checker. Write a program that reads in a string from the command line and a dictionary of words from standard input, and checks whether it is a "good" password. Here, assume "good" means that it (i) is at least 8 characters long, (ii) is not a word in the dictionary, (iii) is not a word in the dictionary followed by a digit 0-9 (e.g., hello5), (iv) is not two words separated by a digit (e.g., hello2world)
  42. Reverse password checker. Modify the previous problem so that (ii) - (v) are also satisfied for reverses of words in the dictionary (e.g., olleh and olleh2world). Clever solution: insert each word and its reverse into the symbol table.
  43. Random phone numbers. Write a program that takes a comand line input N and prints N random phone numbers of the form (xxx) xxx-xxxx. Use a symbol table to avoid choosing the same number more than once. Use this list of area codes to avoid printing out bogus area codes.
  44. Unique substrings of length L. Write a program that reads in text from standard input and calculate the number of unique substrings of length L that it contains. For example, if the input is cgcgggcgcg then there are 5 unique substrings of length 3: cgc, cgg, gcg, ggc, and ggg. Applications to data compression. Hint: use the string method substring(i, i + L) to extract ith substring and insert into a symbol table. Alternative solution: compute the hash of the i+1st substring using the hash of the ith substring. Test it out on the first million digits of π. or 10 million digits of π.
  45. The great tree-list recursion problem. A binary search tree and a circular doubly linked list are conceptually built from the same type of nodes - a data field and two references to other nodes. Given a binary search tree, rearrange the references so that it becomes a circular doubly-linked list (in sorted order). Nick Parlante describes this as one of the neatest recursive pointer problems ever devised. Hint: create a circularly linked list A from the left subtree, a circularly linked list B from the right subtree, and make the root a one node circularly linked list. Them merge the three lists.
  46. Expandable hash table. Our hash table implementation has a fixed size of 100,003 chains. This wastes memory if the hash table has only a small number of elements. Instead, start the hash table with M = 509 chains. When the number of key-value pairs in the table exceeeds 509, increase it to 1,021. Use the following values for subsequent threshholds: 2,053, 4,093, 8,191, 16,381, 32,771, and 65,521. Each value is a prime number and roughly twice the previous one.
  47. Correcting misspellings. Write a program that reads in text from standard input and replaces any commonly misspelled words with the suggested replacement, and prints the result to standard output. Use this list of common misspellings adapted from Wikipedia.
  48. Hash table with linear probing.
  49. Optimal BST.
  50. Floor and ceiling. Create an ADT that supports insert, floor, and ceil. The method floor takes a string key s as input and returns the element in the ADT that contains the largest key (lexicographically) that is less than or equal to s. The method ceil is defined similarly. Application: best-fit heuristic for bin packing.
  51. Zipf's law. Harvard linguist George Zipf observed that the frequency of the ith most common word in an English text containing N words is roughly proporitional to 1/i, where the constant of proportionality is 1 + 1/2 + 1/3 + ... + 1/N. Test "Zipf's law by reading in a sequence of words from standard input, tabulate their frequencies, and compare against the predicted frequencies.
  52. Typing monkeys and power laws. (Micahel Mitzenmacher) Suppose that a typing monkey creates random words by appending each of 26 possible lettter with probability p to the current word, and finishes the word with probability 1 - 26p. Write a program to estimate the frequency distribution of the lengths of words produced. If "abc" is produced more than once, only count it once.
  53. Typing monkeys and power laws. Repeat the previous exercise, but assume that the letters a-z occur proportional to the following probabilities, which are typical of English text.


    CHAR FREQ   CHAR FREQ   CHAR FREQ   CHAR FREQ   CHAR FREQ
    A 8.04 G 1.96 L 4.14 Q 0.11 V 0.99
    B 1.54 H 5.49 M 2.53 R 6.12 W 1.92
    C 3.06 I 7.26 N 7.09 S 6.54 X 0.19
    D 3.99 J 0.16 O 7.60 T 9.25 Y 1.73
    E 12.51 K 0.67 P 2.00 U 2.71 Z 0.09
    F 2.30


  54. Indexing a book. Write a program that reads in a text file from standard input and compiles an alphabetical index of which words appear on which lines, as in the following input. Ignore case and puncuation.
    It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,
    
    age 3-4
    best 1
    foolishness 4
    it 1-4
    of 1-4
    the 1-4
    times 1-2
    was 1-4
    wisdom 4
    worst 2
    
  55. Entropy. We define the relative entropy of a text corpus with N words, k of which are distinct as E = 1 / (N log N) * sum (pi log(k) - log(pi), i = 1..k) where p_i is the fraction of times that word i appears. Write a program that reads in a text corpus and prints out the relative entropy. Convert all letters to lowercase and treat punctuation marks as whitespace.
  56. Find ith element. Write a Java method to return the ith element in a BST. Assume that the subtree sizes are maintained in each node.
  57. Rank query. Given a query string, write a Java method to return the integer i such that s is the ith biggest element in the binary search tree.
  58. Delete ith element. Create an ADT that supports the following operations: isEmpty, insert, and delete(int i), where the deletion operation deletes and returns the ith least recently added object. Use a BST with a counter that records the total number of elements n inserted and give the nth element inserted the key n. Each BST node should also maintain the total number of nodes in the subtree rooted at that node. To find the ith least recently added item, search for the ith smallest element in the BST.
  59. Move-to-front. Encoding: need rank query, delete, and insert. Decoding: need find ith, delete, and insert.
  60. Mutable string. Create an ADT that supports the following operations on a string: get(int i), insert(int i, char c), and delete(int i), where get returns the ith character of the string, insert inserts the character c and makes it the ith character, and delete deletes the ith character. Use a binary search tree.
  61. CRC-32. Another application of hashing is computing checksums to verify the integrity of some data file. To compute the checksum of a string s,
    import java.util.zip.CRC32;
    ...
    CRC32 checksum = new CRC32();
    checksum.update(s.getBytes());
    System.out.println(checksum.getValue());
    
  62. BST reconstruction. Given the preorder traveral of a BST (not including null nodes), reconstruct the tree.
  63. Power law of web links. (Micahel Mitzenmacher) The indegrees and outdegrees of World Wide Web obey a power law. Can be modeled by preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p <1, it links to one of the existing pages, chosen uniformly random. with probability 1-p it links to an existing page with probability proportional to the number of incoming links of that page. this rule reflects the common tendency for new web pages to point to popular pages. write a program to simulate this process and plot a histogram of the number of incoming links.