3.4. ENCAPSULATION AND ABSTRACT DATA TYPES
This section under major construction.
Developing abstract models for our data and for the ways in which our programs process those data is an essential ingredient in the process of solving problems with a computer. All computer systems are based on layers of abstraction: We adopt the abstract model of a bit that can take on a binary 0-1 value from certain physical properties of silicon and other materials; then we adopt the abstract model of a machine from dynamic properties of the values of a certain set of bits; then we adopt the abstract model of a programming language that we realize by controlling the machine with a machine-language program; then we adopt the abstract notion of an algorithm implemented as a Java program. We do not need to understand how each layer works in order to use it, only what it does. This approach has two main advantages. First, we can write useful Java programs even though we may not fully understand the physical properties of silicon. Seconds, if someone invents a better way to represent a bit that relies on light instead of silicon, we can simply change the computer, but reuse all of our Java programs. In this section, we consider how to adapt the basic Java class mechanism to achieve this same kind of separation in layers of abstraction built with Java code.
Abstract data types.
Designing reusable data types requires great care. Obviously we expect our data types to work according to their published APIs. The next most important design feature is separating the internal representation from the API, hiding all of the implementation details. An abstract data type (ADT) is a set of values and a set of operations on those values that is accessed only through an interface. The interface does not specify the way the values are represented or the way the operations are implemented. It only specifies which operations are permissible. "A nut and bolt are compatible if they have the same diameter and threads per inch." This is the interface. However, "they may be made of carbon steel, steel, bronze, nylon, titanium, whatever." We refer to a program that uses an ADT as a client and a program that specifies the data type as an implementation. The interface is a contract between the client and the implementation which specifies which operations are supported and what is the required behavior. Encapsulation is the process of separating and compartmentalizing the client from the implementation by hiding the internal representation of the data. By prohibiting clients to access directly the data representation, we are free to change it! Using ADTs enables us to:- Debug and test one module independently from others.
- Improve resiliency of software systems by limiting and localizing the effects of changing the internal representation of a data type.
- Build reusable libraries.
- Substitute different implementations of the same ADT to improve performance, accuracy, or memory footprint.
- Decomposability, composability.
Zip codes and time bombs.
There are many examples of programmers making assumptions about the internal representation of a data type, and later having to pay a big price to fix their mistakes.- ZIP codes. In 1963, The United States Postal Service (USPS) began using a 5 digit ZIP (Zoning Improvement Plan) code to improve the sorting and delivery of mail. Programmers wrotes lots of software that assumed that Lots of software assumed that zip codes would remain 5 digits forever, and represented them in their programs using a single integer. In 1983, the USPS introduced an expanded ZIP code called ZIP+4 which consists of the original 5 digit ZIP code plus 4 extra digits to assist in delivery. This broke hundreds of brittle programs and required millions of dollars to fix. One of the lessons we should learn from this is to use encapsulated data types so that if we must change the data representation, the effects are localized to one ADT and we don't need to search through millions of lines of code to find all of the places where we assumed ZIP codes were 5 digits long.
- Y2K.
The much publicized Y2K bug resulted from many programmers
encoding years using two digits (years since 1900).
As the year 2000 approached, programmers had to scavenge through
millions of lines of code to find all those places that
relied on a two digit year.
Some of these programs controlled nuclear power plants,
financial markets, commercial air traffic, and military
warships. The cost for fixing all of these problems was estimated
in the billions of dollars.
For example, we can internally represent a point in time by storing several small integers: the month (1-12), the day (1-31), and the year sine 1900 (00-99), the hour (1-24), the minute (0-59), and the second (0-59). Alternatively, we can represent it by storing a single integer which counts the number of seconds since January 1, 1970. Programmers have used both of these approaches to write program without anticipating that their programs would outlast the maximum representable value. Making the data type abstract means that the client programs would have no way of knowing or relying on the data type's internal representation. Specifically if we change the internal data representation of the ADT (e.g., to allow 4 digit years instead of 2) the client program will not need to change in any way, provided we continue to implement the interface exactly as before.
- Other time bombs. On September 14, 2004 Los Angeles airport was shut down due to software breakdown of a radio system used by air traffic controllers to communicate with pilots. The program used a Windows API function call GetTickCount() which returns the number of milliseconds since the system was last rebooted. The value is returned as a 32 bit integer, so after approximately 49.7 days it "wraps around." The software developers were aware of the bug, and instituted a policy that a technician would reboot the machine every month so that it would never exceed 31 days of uptime. Oops. LA Times blamed the technician, but the developers are more to blame for shoddy design.
- IPv4 vs. IPv6.
- Vehicle identification numbers. In 1981, The Society of Automotive Engineers established a unique naming scheme for vehicles known as the Vehicle Identification Number or VIN.
- VIN numbers.
In 1981, The Society of Automotive Engineers established a unique naming scheme
for vehicles in the known as the Vehicle Identification Number
or VIN. The VIN describes the make, model, year, and other attributes
of cars, trucks, buses, and other vehicles in the United States.
Automakers expect to run out by 2010. One solution is to increase the
length of the VIN. Another is to reuse existing VINs. Both would require
major logistical changes to supporting software systems, estimates in the
billions of dollars.
Moral: whenever a unique number is needed, don't skimp!
Intuition for standards. It may be easier to understand why writing to an interface or spec is so important by considering other domains. Fax machines, radio, MPEG-4, MP3 files, PDF files, HTML, etc. Simpler to use a common standard. Lack of incompatibilities enables business opportunities that would otherwise be impossible. Connection with patents? There is a tension between software patents and software standards. A patent grants a temporary monopoly, whereas a standard is meant to be used by all. An interesting article about the development of the MPEG standards.
One of the challenges of writing software is making it portable so that it works on a variety of operating systems including Windows, OS X, and Linux. Java Virtual Machine enables portability of Java across platforms.
ADTs in Java. Java's class mechanism directly supports the design of ADTs. We enforce data hiding by using the private and public access modifiers. When we declare an instance variable as private, this means that the client (code written in another module) cannot directly access that instance variable. The client can only access the ADT through the public methods and constructors. All of the data type that we have built use only private instance variables, so they are already ADTs. As a general rule, always use private as the access modifier for your instance variables. If you use public then you will greatly limit any opportunity to modify the class over time. Client programs may rely on your public variable in thousands of places, and you will not be able to remove it without breaking dependent code.
Date.
Give two implementations of a date ADT. For simplicity, we ignore leap years. See Exercise XYZ to deal with leap years. Include a constructor, toString, next, after. Program Date.java represents a date using three integers for the month, day, and year. Program Date2.java represents a date a single integer that counts the number of days since January 1, 1970. By accessing the data only through the interface, our data type can perform range checking (e.g., no such thing as January 0 or February 30). This enable our data type to maintain objects that are always in a consistent state and preserve data type invariants. The time invested in performing routine error-checking more than pays off as we start writing more complex programs. Using encapsulation enables us to isolate to a few places where an object can change state.Generating random numbers.
Different methods to generate a random number from the standard Gaussian distribution. We have seen one method in Section xyz. Trigonometric method is simple, but may be slow due to calling several transcendental functions. More importantly, it suffers from numerical stability problems when x1 is close to 0. Better method is alternate form of Box-Muller method. reference. Both methods require two values from a uniform distribution and produce two values from the Gaussian distribution with mean 0 and standard deviation 1. Can save work by remembering the second value for the next call. Can also use Random in Java's library. Their implementation is the polar method of Box-Muller, saving the second random number for a subsequent call. (See Knuth, ACP, Section 3.4.1 Algorithm C.)ADT for generating a random bit: LFSR.
Statistics.
Encapsulation enables us to easily replace one implementation with another. This is desirable when we anticipate creating an improved implementation with better performance or accuracy properties. We consider a simple, but representative example below. Our goal is to store statistics for a set of real values, say the sample mean x bar and the sample variance s2.
We want our data type to support the following methods: add, size, mean, stddev, variance. OnePass.java stores n, the sum, and the sum of squares. TwoPass.java stores n, the sum, and all of the data values in an array. The one pass algorithm is faster and uses substantially less space, but is susceptible to roundoff error. StableOnePass.java is a well-engineered alternative.
It is numerically stable and does not require an array to store the elements.
public void add(double x) { n++; s = s + 1.0 * (n-1) / n * (x - m) * (x - m); m = m + (x - m) / n; } public double mean() { return m; } public double variance() { return s / (n - 1); }
Complex numbers.
In Section 3.2, we considered program Complex.java, which is an ADT for complex numbers. Internally, it stores a complex number c = x + iy as its real and imaginary parts (x, y). This is called the rectangular representation. We can also represent a complex number as c = r (cos θ + i sin θ). This is known as the polar representation, and we refer to r as the modulus and θ as the angle. This representation is especially useful if we are multiplying or dividing complex numbers. Program PolarComplex.java implements the exact same complex number ADT as Complex.java, but stores the polar representation (r, θ) instead of the the rectangular representation (x, y). Since it has exactly the same public methods and public constructor as the version with rectangular coordinates, we can essentially use it interchangeably with the other version of Complex.java. The main difference to the client is in different performance properties. For example, implementing an exponentiation method becomes much easier with the polar representation, especially if the power is a negative integer or if the power itself is complex! Other operations become more cumbersome to implement and expensive to compute, e.g., addition and subtraction. There is one minor caveat that makes these two implementations not completely interchangeable: floating point precision may influence the two implementations in different ways. For example, the rectangular implementation represents 3 + 4i exactly, whereas the polar implementation must store an approximation to the angle θ since it is an irrational number. Such difference should not effect a well-behaved client, but we should be aware of any places where our ADT leaks information.Signals and waveforms.
Measure sound in time domain (represent as fluctuations in air pressure over time) or frequency domain (represent as freqencies and amplitudes). Basilar membrane of human ear converts from time domain to frequency domain for processing my brain. Explains why we "hear" a collection of frequencies played simultaneously as separate notes. Most common representation = time domain, but most signal analysis techniques are designed to work in frequency domain. (discard phase information) In section XYZ, we'll learn about an elegant algorithm called the Fast Fourier Transform which converts between time and frequency domains.Polynomials.
Data structure representations of degree N polynomial with T terms. Perhaps the most natural representation is to explicitly store the coefficients and exponents.- Array of N+1 coefficients indexed by exponent.
- Linked lists of T coefficient-exponent pairs.
- Array of T coefficient-exponent pairs.
- BST (for fast access to individual coeffcients)
- Coefficients: N+1 coefficients and exponents (as above).
- Point-values: evaluate polynomial at N+1 distinct points.
- Roots: N complex roots plus gain factor.
More scientific applications.
In Section 9.2 we implement a data type to represent polynomial functions, and implement classical operations including addition, multiplication, composition, evaluation, integration, and differentiation. In Section 9.5 we implement a data type to represent real-valued matrices, and implement classical operations including addition, multiplication, and solving Ax = b. For both data types, there are two natural representations, one for sparse polynomials and matrices, and one for dense polynomials and matrices. Performance differences are enormous. For example, airline crew scheduling problems might involves ten of millions of rows and a hundred thousand columns. Solving problems of this size is a formidable challenge, and it could not even be attempted without exploiting sparsity.First class ADTs. A first class ADT is an ADT for which you can create as many instances of the data type as desired. In.java is an object-oriented version of StdIn.java. It supports reading input from standard input, from a file, from a web site, or from a network socket. Unlike StdIn.java, you can open up several input streams in the same program. Similarly Draw.java is an object-oriented version of StdDraw.java. You can create and display several independent drawing windows.
Interface. To see all of the non-private members of a class, type
javap Complex.class
Q + A
Q. How would I add a constructor to the Complex data type that takes r and θ as arguments?
A. It's a problem since you already have one constructor that takes two real arguments. Instead, you should probably create two methods createRect(x, y) and createPolar(r, theta) that create and return new objects. This style will also help remind the client to pass in arguments of the correct form.
Q. The instance variable x in type Complex is declared with access modifier private. How come in the plus method I can access both this.x and b.x?
A. Declaring an instance variable to be private means that it is not directly accessible from any other class. Methods within the Complex class can access (read or write) the instance variables of any object in that class. Perhaps, Java should include a more restrictive access modifier, say superprivate, that makes an instance variable only accessible to the invoking object.
Q. If I don't specify an access modifier, what's the default?
A. It's something between public an private.
Q. Is it possible to defeat Java's type safety features?
A. If there are no bugs in the Java implementation, an untrusted program should not have access to the private data of a trusted program. This guarantee assumes that "the computer faithfully executes its specified instruction set." If a cosmic ray strikes the computer's memory, it can flip a bit, thereby invalidating this axiom. Appel and Govindavajhala showed how to craft a program so that when executed, any single bit error would yield a 70% chance of the program being have to gain unfettered access to the JVM. They also demonstrated that such errors could be induced by shining a 50 watt spotlight bulb at the memory for one minute. Smartcards are particularly susceptible since the attacker has physical access to the computer.
Exercises
- Add a method power to Complex so that a.power(b) returns the result of taking the complex number a and taking it to the complex power b.
-
What is the principle value of ii?
Answer: e-Π/2 = 0.207879576....
- Create a Rectangle ADT that represents a rectangle. Represent a rectangle by two points. Include a constructor, a toString method, a method for computing the area, and a method for drawing using our graphics library.
- Repeat the previous exercise, but this time represent a Rectangle as the lower left endpoint and the width and height.
- Repeat the previous exercise, but this time represent a Rectangle as the center and the width and height.
Creative Exercises
- International bank account (depositDollars, depositEuros, withdrawDollars, withdrawEuros). How to represent money (cents, BigDecimal, precision, etc.)
- Color. Take linear combinations of colors. (RGB vs. CMYK)
- Picture. 2D array of colors, 2D array of rgba ints.
- Turtle location starting at origin (goForward, rotate, draw) Use polar and cartesian coordinates.
- Polar representation of points.
Point.java and
PointPolar.java implement
the following point interface using rectangular and polar coordinates,
respectively.
Point() Point(double, double) double x() double y() double r() double theta() double distance(Point) public String toString()
- Spherical coordinates.
Represent a point in 3D space using Cartesian coordinates (x, y, z)
or spherical coordinates (r, θ φ). To convert from one to
the other, use
- Colors. Can represent in RGB, CMYK, or HSV formats. Natural to have different implementations of same interface.
- Time. Implement an ADT that represents the time of day, say with seconds, minutes, and hours. Representation 1: three fields for hours, minues, and seconds. Representation 2: number of seconds since 12 midnight. Include compareTo, plus.
- ZIP codes. Implement an ADT that represents a USPS ZIP code. Support both the original 5 digit format and the newer (but optional) ZIP+4 format.
- Vector fields. Write a data type for force vectors in two dimensions. You should be able to add vectors using the usual rules from physics using x sin(theta) and x cos (theta)....
- Genome.
Implement a data type to store the genome of an organism. Biologists
often abstract away the genome to a sequence of nucleotides (A, C, G, or T).
The data type should support the method addNucleotide,
nucleotideAt(int i), and doSomeComputation.
Perhaps change to addCodon.
- StringGenome.java has one instance variable of type String. It implements addNucleotide with string concatenation. Each method call takes time proportional to the size of the current genome. Not practical spacewise either for large genomes since nucleotide is stored as a 16-bit char.
- Genome.java implements a genome as an array of characters. The size of the array is doubled when the array fills up. The method addNucleotide is now constant time, but the space consumption is still 16 bits per nucleotide.
- CompactGenome.java implements a genome as boolean array. We need to use two bits per nucleotide since there are 4 different nucleotides. As in the previous implementation, we use a dynamic array with repeated doubling. Now, each nucleotide consumes 2 bits of storage (instead of 16).
Include a method to return the reverse-complemented genome. Include a method for testing for equality.
Codon usage table. Print out summary statistics for each codon, e.g.,
UUU 13.2( 724) UCU 19.6( 1079) UAU 16.5( 909) UGU 12.4( 680) UUC 23.5( 1290) UCC 10.6( 583) UAC 14.7( 808) UGC 8.0( 438) UUA 5.8( 317) UCA 16.1( 884) UAA 0.7( 38) UGA 0.3( 15) UUG 17.6( 965) UCG 11.8( 648) UAG 0.2( 13) UGG 9.5( 522) CUU 21.2( 1166) CCU 10.4( 571) CAU 13.3( 733) CGU 10.5( 578) CUC 13.5( 739) CCC 4.9( 267) CAC 8.2( 448) CGC 4.2( 233) CUA 6.5( 357) CCA 41.0( 2251) CAA 24.9( 1370) CGA 10.7( 588) CUG 10.7( 590) CCG 10.1( 553) CAG 11.4( 625) CGG 3.7( 201) AUU 27.1( 1491) ACU 25.6( 1405) AAU 27.2( 1495) AGU 11.9( 653) AUC 23.3( 1279) ACC 13.3( 728) AAC 21.0( 1151) AGC 6.8( 374) AUA 5.9( 324) ACA 17.1( 940) AAA 32.7( 1794) AGA 14.2( 782) AUG 22.3( 1223) ACG 9.2( 505) AAG 23.9( 1311) AGG 2.8( 156) GUU 25.7( 1412) GCU 24.2( 1328) GAU 49.4( 2714) GGU 11.8( 650) GUC 15.3( 843) GCC 12.6( 691) GAC 22.1( 1212) GGC 7.0( 384) GUA 8.7( 478) GCA 16.8( 922) GAA 39.8( 2185) GGA 47.2( 2592)
