Linguistics 570
Homework 5
For this project, you will
build a spam filter. The task is similar
to the task in project2, in that you need to categorize a set of documents into
two classes. There are some differences,
however. For instance, the e-mails will
be a little noisier, will not be as formally written, and the differences
between the two categories will not be so clear. All required files can be found in ~/dropbox/08-09/570/hw5.
The Documents
There are approximately 1,000
documents, split into the training
and testing folders, and then
further split into an email and a spam folder, where the former contains legitimate
e-mails and the latter does not. You are
not allowed to look at the test documents and can only use them to test and
evaluate your model.
The Task
Train your model against the
training data. You may use whatever
techniques you wish, and may recycle the model you built in Project 2 (with
modifications), or you may build a new model.
Any features may be used to build your model, except any relating
directly to the directory structure (e.g., you can’t use the folder labels email and spam as features) or file names.
Once you have trained your model, test it against the testing data, and
produce precision and recall numbers showing your model’s success at
eliminating spam documents. The target
is to eliminate the spam documents, so 100% precision would mean that all the
e-mails you have identified have been correctly identified (no spam), and 100% recall
would mean that you have identified all legitimate e-mails.
Due Date and What To Submit
Submit your code and
output. For the output, provide your
precision and recall numbers, and give a sorted list (by name) of the documents
you classified as legitimate e-mail documents with any threshold numbers
showing their relevance scores, listing one document per line. In a separate file, give the list of all
documents that you classified as spam, again with relevance scores. Your evaluations and output should only be
against the test documents. Include an
overview of your observations and commentary on your methods. Note:
all documents have unique names, irrespective of whether they’re in
training or test, email or spam.
due date: 11:45 p.m., Tuesday, December 2nd
Submit minimally the
following (in one zip or tar file):