Math 176 - Homework #2

Set Operations using a Hash Table with Linear Probing

Due date: Friday, October 27, Midnight.

This programming assignment is covered by special Academic Integrity Guidelines.

Overview: This homework assignment requires you to implement the basic operations for a hash table with linear probing including the ability to create an iterator that allows sequential access to the elements stored in the hash table.  You must implement the hash table so that it extends the Java class AbstractSet.   The associated interator must implement the Java specification of Iterator.    You will also implement non-lazy deletion from the hash table.  You will compare your hash table with linear probing implementation of a set with the built-in Java 1.2 classes for data structures.  A special class CountComparable extends Comparable has been written that will allow you to easily gather statistics on the number of comparisons performed during the data structure operations, allowing you to indirectly compare the performance of your hash table against the built-in Java data structures.  For hash tables, we shall be using the number of collisions as the measure of how good the algorithm is.

These instructions may change somewhat or be augmented: please watch for announcements on this, or check back to this page.

The outline of the homework assignment is follows:

  1. Write a hash table with linear probing implementation of a set.  This must a Java class named HashSetLinear and it must extend AbstractSet.  The basic operations it must support are:

    1. a constructor: public HashSetLinear(int, float).  You should write this constructor that sets the initial capacity and loadFactor.  You may write other constructors too if you wish.
    2. an "add" or "insert" method: public boolean add( Object o ).
    3. a "search" or "contains" method: public boolean contains( Object o ).
    4. a size operation:  public int size().
    5. an iterator creater: public Iterator iterator().
    6. an "remove" or "delete" method: public boolean remove( Object o ).

    Your implementation must obey the Java implementation standards for AbstractSet's: namely, the return code for add and remove indicate whether the set was changed as a result of the operation.  The add function will not add another copy of an object that is already present.  The iterator must implement hasNext() and  next() exactly as specified by the Java 1.2 specifications for Iterator.  For a little extra credit, you may implement the remove() method in the iterator if you wish.

  2. Use the supplied CountComparable class to wrap objects so as to keep track of the number of comparisons (equals() tests) used when inserting and removing objects from Set's.   Run tests with the supplied data file, first with the Java TreeSet class (which is based on Red-Black trees), with the the supplied class HashSetSeparateChaining, and then with your HashSetLinear implementation of a hash table with linear probing..   Gather statistics and prepare a table reporting the average number of comparisons used per operation.    You will need to run your tests with several different load factors and with large data sets.  

To do the homework you should do the following steps:

  1. In the directory ../public/ProgHomework2, there is a main programs MainHw2.  The MainHw2 shows examples of how use the Java classes HashSet and TreeSet.  You should learn how to use these classes if you are not already familiar with them:  good ways to learn this is to look at the appendix in the text book and to read the online Sun java documentation at www.javasoft.com.    MainHw2 is very similar to the former MainHw1, but with several enhancements to make it a little easier to use.

        Important: You should get the new class CountComparable from the same directory: it overloads hashCode() and must be used for this assignment in place of the old CountComparable class.

        There are classes HashSetSeparateChaining and HashMapSeparateChaining in the same directory.  You should copy this over to your own directory for use in comparing your implementation's efficiency with the efficiency of a hash table based on linear chaining.  For the "SeparateChaining" classes, copy all the .class files (there are six of them) to your own directory.

       Documentation for these classes in HTML format is available from the directory ../public/ProgHomework1, or on the web via ftp, at http://math.ucsd.edu/~sbuss/Math176/ProgHomework1/, or go directly to the following HTML files for documentation:  MainHw2.html, CountComparable.htmlHashSetLinear.htmlHashSetSeparateChaining.html, and  HashMapSeparateChaining.html.

        The program MainHw1 should provide you with a good skeleton for a main program for testing your Hash Table with Linear Probing implementation.

        Later, you will need to read commands from a file to gather statistics on the behavior of your HashSetLinear, and on the Java classes of red-black trees and hash sets with separate chaining.  Look at the program MainHw1IO from programming assignment #1 which shows how to read from files and how to parse an input line into tokens with a StringTokenizer.

  2. Write and debug your HashSetLinear class and iterator.  At first, do not try to implement any remove methods.  The iterator must be implemented as an inner class named HslIterator.

       
    Use the built-in Java k.hashCode() to generate hash codes.  When you rehash a table of size S, make the new table have size 2*S+1.

  3. Extend your HashSetLinear class to support the remove operations.   Implement non-lazy deletion as described in class.  The algorithm is also available on line.

  4. This step should not be left until the very last minute!  Once you have completed step 3 (or step 2, if you are unable to finish step 3), gather statistics.   There is a file named hw2Data in the same directory ../public/ProgHomework2.  This contains a series of lines with the format: "A xxxxx" or "D xxxxx" where "xxxxx" denotes a string of symbols.  These lines are commands to either add or delete the corresponding string from the set.  (If you have not implemented remove methods, then just skip over the delete commands.)  Sometimes, the delete commands will ask to delete a word that is not present (this happens about 10% of the time): in this case the set is not to be changed.  Sometimes an add command will ask you to add a string that is already present in the set: again, in this case the set is not to be changed, since sets do not support the presence of duplicate objects.
        Run these commands on the data structures of (1) HashSetSeparateChaining, and (2) TreeSet, and (3) your implementation of HashSetLinear.  Do this for the first N add commands (and the delete commands which appear before the N-th add command), letting N equal 100, then 1000, then 10000, then 100000, then 1000000 --- but stop whenever the algorithms become so slow as to require more than 5-10 minutes of total running time.  You can use larger data sets by increasing the heap size of the Java virtual machine which is controlled by the -Xmx command line option to java.  (Run java -help and java -X for information on the java machine command line options). 
        In addition, for the larger size (500,000 or 1,000,000 if possible), run the test for four different load factors --- select load factors that range from about 0.2 to 0.75 or even higher.  Include the results of these tests in your table, or make a second table with the results of these tests.
        Write a short report or tables giving for each test: the number of adds attempted, the number adds which failed due to trying to insert duplicates, the number of delete attempted, the number of deletes which failed since the element was not present, the total number of comparisons performed, the average number of comparisons per attempt to add or delete (i.e., per line processed from the file), and the number of collisions in total and the average number of collisions per operation.   You may include additional information in the table if you wish, but you must include at least the items mentioned.  Your tables/report should be prepared as a plain text file.

  5. You must turn in:

    1. A text file, named README, with the results from step 4.  Your report must also include a description (a short paragraph) describing how much of the homework you completed, and any special circumstances regarding your homework solution.
    2. A file HashSetLinear.java with your source code.   This file will graded by an automated procedure, so it is important that it can compile on the ieng9 machine, and that you use the exactly correct interface for your classes and methods.   This one file should contain all your code.  It should not write any diagnostics to stderr or stdout as output.
    3. The "turnin" procedure is as follows.  You must create two files: one named HashSetLinear.java and the other named README.   Both files should be text files.  Lines in the README file should be at most 80 columns.  Place both files in a directory, and from that directory give the command bundleP2.   ("bundleP2" stands for bundle up programming assignment #2).  This command will check that the required files are present and then turn them in.
    4. In you later run bundleP2 again, it will overwrite all of your previously submitted homework.  (So: do not turn the files one at a time!). 
    5. If you get error messages that appear not to be your fault, please email me immediately at sbuss@math.ucsd.edu
    6. Just in case something goes wrong with the turnin procedures:  Keep your files on ieng until you have received your programming assignment grade.  In addition, do not modify them so that we can verify the last modified dates if necessary.
  6. Testing suggestions.  We will provide you with a program that checks whether your class definitions are correct.  It will be called CheckHslSignature.  You should be able to test your HashSetLinear by comparing it against a HashSetSeparateChaining dictionary or a TreeSet (red-black tree) from the Java library and checking whether it gives the same results as your hash set with linear probing implementation.  However, note that the iterators will return the elements in different orders.  (If you are using large sets, try sorting them.  Alternately, compare the sums of the hash codes of the objects returned by the iterators.)
        I am looking into the possibility of writing code that will do a more thorough testing of your code (like the code used to auto-grade the first programming assignment).  The code still needs to be more robust than what I have written so far, so it is not yet available.  Watch this space for further announcements.
        It is OK to do your program development on another machine other than ieng9, however, the final version must run on ieng9 and it would behoove you to allow a day or two extra time to make sure it runs there.  It is also OK to report your results in step 4 as run on another machine, but in this case, please report also on the machine type and especially on how much RAM memory it has.

  7. Extra credit:  For 10 points extra credit, devise a method for implementing remove() in the iterator.  Describe your method in the README file.  Implement the remove.  What is the big-O runtime of your remove() code. How much extra runtime is needed?  How much extra memory does it use? 

  8. All programming work must be your own.  You may get help from TA's, from fellow students, etc., but must do your own work, and especially must "internalize" all advice, i.e., be able to understand everything well enough that you could re-implement it on your own.  In particular, you should not use code either verbatim from any source or which is a straightforward translation of some one else's code.  More information on what kinds of assistance are permitted can be found in the Academic Integrity Guidelines.   If you are not sure what kind of outside assistance is allowed, discuss it with me or a TA.

  9. Grading: The grade for your programming assignment will be based on the following (percentages and categories are preliminary and I reserve the right to change them based on the class performance).

    1. Correctness of the class definitions and method specifications.   No compilation errors.  Programming style (~ 10%).
    2. Table of data in your report is complete and numbers appear to be correct.   (~20%)
    3. Add and iterator and size and contains are correctly implemented. (~50%).
    4. HashSetLinear.remove() is correctly implemented (~20%)
    5. The iterator remove is well-designed, explained and correctly implemented. (~10 %)  Extra credit -- warning: this is a lot more work per point than the non-extra-credit part.  Please note that complicated solutions or inefficient solutions that use too much time/space will not get credit, and should not be implemented (you will get only a portion of the points if you implement such a method.)