This programming project will require you to write a spell checker with some "correction" features. It will be assigned in various phases throughout the semester. Each phase requires the completion of the previous phases including correction of any errors reported during the grading process. The idea is to build up a significant program through an incremental process. The requirements for various phases will be added to this page as the semester progresses.
Throughout the entire assignment, it is expected that you will work alone, i.e. considering the assignment to be of the Individual Assignment, No Collaboration, model as outlined in the department's Academic Policies and Procedures document. As you are maturing programmers, your assignment will be graded on correctness and documentation, but also structure and style. You are encouraged to discuss programming details with the instructor.
In its early stages, your program will have two files: the document being checked, and the dictionary. The document will obviously vary as different files are checked, but the dictionary will be constant throughout. In this phase, you are to open each of the files and add the various words to a TreeSet. You may assume that the dictionary file has a constant name, but you must prompt the user for the name of the document to be checked. In the case of the dictionary file, you may assume that each word is on its own line, but in the case of the document being checked, you may make no such assumption.
Upon completion, you are to email your file (or files) to dlevine@cs.sbu.edu. The assignment is due Monday, February 9, at noon. Be sure to get the email address correct; Outlook tends to leave off the "cs." Assignments mailed to the wrong address risk being considered late; I will consider them to arrive only when I actually check the other address. I do not do so continually.
In this phase, you are to add some more features to your spell checker. In particular, it should now "process" any words from the document file before considering them. By "process", we mean that the words should be placed into lower case and should have all leading and trailing punctuation removed. [Hint: the use of a separate function to do this processing is strongly encouraged.] Also, words should only be placed into the document file if they are misspelled, i.e. if they are not in the dictionary. Finally, you should print the list of misspelled words, one per line, on the console at the end of the program.
Upon completion, you are to email your file (or files) to dlevine@cs.sbu.edu. The assignment is due Wednesday, February 16, at noon. Be sure to get the email address correct; Outlook tends to leave off the "cs." Assignments mailed to the wrong address will be returned with the request that you send them to the correct address. They will count as submitted ONLY when received at the correct address.
In this phase, you are to add the beginnings of corrections to your spell checker. First you must wirte a method that converts a word to its "canonical root form" (see below). Then, you are to build your dictionary as a Map, with the keys in the map being canonical forms and the values being Sets of words that share that canonical form. This change will require that you modify two sections of code: that which builds the dictionary AND that which checks to see if a word is in the dictionary. Finally, for each unspelled word, you are to print out the word and all possible corrections for it. The line below suggests one format:
Misspelled: dag Correct to: {dig, dog, doge, dug}
If no corrections exist, an appropriate message should be printed.
The canonical root form of a word is determined by doing the following:
Note that the order is important. The crf of "people" is "ppl", not "pl". For extra credit, you may apply additional rules based upon phonetic formation, e.g. "ph"->"f", but you must state that you have done so in both your email message and in your program documentation.
Upon completion, you are to email your file (or files) to dlevine@cs.sbu.edu. The assignment is due Tuesday, March 16, at noon. See the warning above about the email address. An additional 10% credit will be given to any assignment received by Monday, March 8 at noon.
In this phase, you are to expand the list of possible corrections to a misspelled word. From the user's point of view, nothing changes about the program except that the list of "possible corrections" is larger than before.
You are to expand the list through the inclusion of words that are "off-by-one" from the canonical form. In other words, if the user types word W (not in the dictionary), then if word X has a canonical form that is "off-by-one" from the canonical form of W, word X will be one of the suggested corrections for W.
You may choose what constitutes "off-by-one" for canonical forms. One possibility is to consider that two canonical forms are off-by-one if they are the same except that two adjacent letters have been swapped. Another is to consider that two forms are "off-by-one" if they are identical except for one letter. Yet a third is to consider that one form is "off-by-one" from another if it can be turned into the other through the insertion or deletion of a single letter. A fourth way is to modify the definition of canonical forms to include "equivalent phonemes" as discussed in class. Your code MUST implement at least one of these mechanisms (or some other one that is approved by your instructor), although it may implement more than one - thereby causing a longer list of corrections to be given.
You may choose how to effect these changes. One good idea, no matter what you choose, is to keep a set of corrections that is initially empty and to add words to it as you find them. The addAll() method may be particularly helpful here. It may also be worth your while to build, for a given canonical form, a map that connects that form to the set of forms that are "off-by-one" from it. This map may take a while to initialize, but can then be used quite efficiently to produce corrections.
Upon completion, you are to email your file (or files) to dlevine@cs.sbu.edu. Due to the late posting of this file, the assignment is due Saturday, May 8, at 4 p.m. See the warning above about the email address. An additional 10% credit will be given to any assignment received by Wednesday, May 5 at noon.