Class WordGrabber

java.lang.Object
  |
  +--WordGrabber

public class WordGrabber
extends java.lang.Object

This class is responsible for reading words one at a time from a set of files. The tasks it is responsible for include:


Field Summary
static java.io.FileFilter txtFileFilter
          This is the default file filter.
 
Constructor Summary
WordGrabber()
          Start a new WordGrabber object.
WordGrabber(java.lang.String directory, java.lang.String commonWordsFile)
          Start a new WordGrabber, but use a different directory than the default one.
WordGrabber(java.lang.String directory, java.lang.String commonWordsFile, java.io.FileFilter filter)
          Start a new WordGrabber.
 
Method Summary
 java.lang.String getFileInfo(int fileNum)
          Gets identifying information about the file's contents, such as the file name and in some cases, its title and author.
 boolean isCommon(java.lang.String s)
          Tests whether a word is "common", and should be ignored.
 boolean isIgnored(java.lang.String s)
          Tests whether a word will be ignored either because of being too short or because of being a common word.
 java.lang.String nextWord()
          Returns the next non-common word from the file.
 int posNextWord()
          Tests whether the current file has a next word to be processed.
 java.lang.String printableExtract(int fileNum, int start, int end)
          Gets a small portion of a file and formats it as a String suitably for printing.
 java.lang.String printableExtractLen(int fileNum, int mid, int len)
          Gets a small portion of a file and formats it as a String suitable for printing.
 int startNextFile()
          Start reading from a new file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

txtFileFilter

public static final java.io.FileFilter txtFileFilter
This is the default file filter. A FileFilter that accepts files with the extension .txt. The extension is also permitted be in uppercase.
Constructor Detail

WordGrabber

public WordGrabber()
            throws java.io.FileNotFoundException,
                   java.io.IOException
Start a new WordGrabber object. It reads from the default directory, every file that has the extension .txt or .TXT.

WordGrabber

public WordGrabber(java.lang.String directory,
                   java.lang.String commonWordsFile)
            throws java.io.FileNotFoundException,
                   java.io.IOException
Start a new WordGrabber, but use a different directory than the default one.
Parameters:
directory - the name of a directory which will be recursively searched for files ending with .txt or .TXT. It is also permissable to use the name of a file, in which case only that one file is used.
commonWordsFile - the name a file containing a list of common words that will be ignored. All words of 1 or 2 or 3 symbols are automatically treated as being common words. If this parameter is null then there is no file of common words.

WordGrabber

public WordGrabber(java.lang.String directory,
                   java.lang.String commonWordsFile,
                   java.io.FileFilter filter)
            throws java.io.FileNotFoundException,
                   java.io.IOException
Start a new WordGrabber.
Parameters:
directory - the name of a directory which will be recursively searched for files ending with .txt or .TXT. It is also permissable to use the name of a file, in which case only that one file is used. If null, it reverts to the default directory.
filter - a FileFilter which sets which files should be read. (Intended for use by Java experts only.) If null, it reverts to the default file filter.
Method Detail

startNextFile

public int startNextFile()
                  throws java.io.IOException
Start reading from a new file. Throws an IllegalStateException if the previous file (if any) was not completely read.
Returns:
an integer which is the number of the file. This integer can be used to print or read excerpts from the file later on. The returned integer is -1 if there is no next file to read.
Throws:
IllegalStateException - if a previous file was being read and is not yet finished. This condition can be checked with hasNextWord.

posNextWord

public int posNextWord()
Tests whether the current file has a next word to be processed. It returns the starting position of the word in the file. The position is measured in characters (bytes), starting with 0. If there is no next word, it returns -1. The file is closed if there is no word left in the file.

To start reading the next file, you should call startNextFile.

Common words are skipped over and are not returned (see nextWord).

Returns:
a positive integer if the next time nextWord is called it will return a word. This positive integer is the starting position of the word in the file. Otherwise returns -1.

nextWord

public java.lang.String nextWord()
                          throws java.io.IOException
Returns the next non-common word from the file. Common words such as "the", "if", "and", etc., are skipped over. Words of at most three letters are always treated as common. Otherwise, the words are checked against the set of common words read in from the common words file. Hyphenated words are treated as two words.

You should always call posNextWord() before calling nextWord() to get the position of the word.

The returned word is in all lowercase (no matter how it appears in the file.

Returns:
a word.
Throws:
java.util.NoSuchElementException - if there is no next non-common word in the current file. In this case, the file is closed.

isCommon

public boolean isCommon(java.lang.String s)
Tests whether a word is "common", and should be ignored. It does not automatically return false for words of less than 4 symbols, even though these words are treated as common words automatically. Use isIgnored() if you want to test that condition.

isIgnored

public boolean isIgnored(java.lang.String s)
Tests whether a word will be ignored either because of being too short or because of being a common word.

getFileInfo

public java.lang.String getFileInfo(int fileNum)
Gets identifying information about the file's contents, such as the file name and in some cases, its title and author.

Usual format for the returned information is:

File: the file name. "Title" by Author

The second line of the string will be missing if title and author are unknown.


printableExtract

public java.lang.String printableExtract(int fileNum,
                                         int start,
                                         int end)
                                  throws java.lang.IllegalArgumentException
Gets a small portion of a file and formats it as a String suitably for printing. The usual usage of this method is as follows:

prettyPrint (printableExtract(fileNum, start, end));

Parameters:
fileNum - the file number, as obtained from startNextFile
start - the start position in the file, measured in characters.
end - the end position in the file, measured in characters.
Returns:
The extract from the file. Carriage returns, line feeds and other control characters are replaced by spaces.
Throws:
java.lang.IllegalArgumentException - is the requested portion is more than 256 characters long, or if the fileNum is invalid.

printableExtractLen

public java.lang.String printableExtractLen(int fileNum,
                                            int mid,
                                            int len)
Gets a small portion of a file and formats it as a String suitable for printing. This is merely a helper function which invokes printableExtract(fileNum, start, end) where

start = mid-len/2
end = start+len

Parameters:
mid - the middle position of the string to extract
len - the length of string to extract
fileNum - the file number, as obtained from startNextFile
Returns:
The extract from the file. Carriage returns, line feeds and other control characters are replaced by spaces.
Throws:
java.lang.IllegalArgumentException - is the requested portion is more than 256 characters long, or if the fileNum is invalid.