Class MainHw4

java.lang.Object
  |
  +--MainHw4

public class MainHw4
extends java.lang.Object

This is a skeleton of a main program for implementing a mini-search engine on text files. This code provided takes care of the following:

For the Math 176 programming assignment you will need to implement the following functions (see the programming assignment web page for more information):


Constructor Summary
MainHw4()
           
 
Method Summary
static boolean getCommand()
          Reads in one of three commands
static boolean getTwoSearchWords()
          Read two search words.
static void main(java.lang.String[] args)
           
static void prettyPrint(java.lang.String s)
          Prints out a long string in 80 column width.
static void printExtractWithTwoWords(int docNo, int posA, int posB)
          Prints a two line extract from a file, containing the words at the indicated position.
static void printFrequentWords()
          THIS IS SOME OLD CODE that I used to gather the common word file.
static void readWordsFromFiles()
          This is a demo routine that shows how to read words one at a time from the files.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

MainHw4

public MainHw4()
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException

readWordsFromFiles

public static void readWordsFromFiles()
                               throws java.io.IOException
This is a demo routine that shows how to read words one at a time from the files. Another demo routine with a little more substance is the next routine, printFrequentWords.

printFrequentWords

public static void printFrequentWords()
                               throws java.io.IOException
THIS IS SOME OLD CODE that I used to gather the common word file. Any word that appeared more than 2000 times out of 5,000,000+ words was stored to a file. I then edited out by hand a lot of non-common words that appeared more than 2000 times.

The result was that 5,000,000 occurences of words was reduced to 3,500,000+ occurences of non-common words, about a 1/3 reduction in the words that need to be considered.

Note that I used a HashMap to keep a counter associated with each word. I had to store Integers as values, rather than storing ints as values, since ints are not Objects in Java.


getTwoSearchWords

public static boolean getTwoSearchWords()
                                 throws java.io.IOException
Read two search words. Rejects the use of common words, and the use of too short words. As currently coded, it also rejects duplicate words, but you may optionally change this if you wish.

Returns values in wordA and wordB and in wordAwildCard and wordBwildCard.


getCommand

public static boolean getCommand()
                          throws java.io.IOException
Reads in one of three commands

prettyPrint

public static void prettyPrint(java.lang.String s)
Prints out a long string in 80 column width. Tries to print as much per line as possible and to break only at spaces. I have used this to print a long list of document numbers separated by spaces (for purposes of using the 'd' command). I also used it for printing extracts from the files (for implementing the 'x' command).

printExtractWithTwoWords

public static void printExtractWithTwoWords(int docNo,
                                            int posA,
                                            int posB)
Prints a two line extract from a file, containing the words at the indicated position. The extract is proceeded and followed with ellipses (...) and is printed with prettyPrint. The extract will fit into two or three 80 column printed lines.
Parameters:
docNo - The document or file number in which the pair of words appear.
posA - The position of word A in the file (measured in bytes).
posB - The position of word B in the file (measured in bytes).