Linguistics 570

Homework 5

 

For this project, you will build a spam filter.  The task is similar to the task in project2, in that you need to categorize a set of documents into two classes.  There are some differences, however.  For instance, the e-mails will be a little noisier, will not be as formally written, and the differences between the two categories will not be so clear.  All required files can be found in ~/dropbox/08-09/570/hw5.

 

The Documents

 

There are approximately 1,000 documents, split into the training and testing folders, and then further split into an email and a spam folder, where the former contains legitimate e-mails and the latter does not.  You are not allowed to look at the test documents and can only use them to test and evaluate your model.

 

The Task

 

Train your model against the training data.  You may use whatever techniques you wish, and may recycle the model you built in Project 2 (with modifications), or you may build a new model.  Any features may be used to build your model, except any relating directly to the directory structure (e.g., you can’t use the folder labels email and spam as features) or file names.  Once you have trained your model, test it against the testing data, and produce precision and recall numbers showing your model’s success at eliminating spam documents.  The target is to eliminate the spam documents, so 100% precision would mean that all the e-mails you have identified have been correctly identified (no spam), and 100% recall would mean that you have identified all legitimate e-mails.

 

Due Date and What To Submit

 

Submit your code and output.  For the output, provide your precision and recall numbers, and give a sorted list (by name) of the documents you classified as legitimate e-mail documents with any threshold numbers showing their relevance scores, listing one document per line.  In a separate file, give the list of all documents that you classified as spam, again with relevance scores.  Your evaluations and output should only be against the test documents.  Include an overview of your observations and commentary on your methods.  Note:  all documents have unique names, irrespective of whether they’re in training or test, email or spam.

 

due date:  11:45 p.m., Tuesday, December 2nd

Submit minimally the following (in one zip or tar file):

  1. A shell script labeled hw5.sh (should run app & generated output).  This script should take only one parameter, the path to the test data.  In other words, your model should already be trained and included with your code.
  2. Your code, written in python, java, perl or c. 
  3. hw5-email.txt (containing the sorted list of e-mail documents and thresholds)
  4. hw5-spam.txt (containing the sorted list of spam documents and thresholds)
  5. hw5-pr.txt (containing your precision and recall numbers)
  6. comments.txt – Commentary as described