Linguistics 570
Project 2
For this project, you will build
a vector space model over a set of documents.
For the first part of the project, your model should separate linguistic
from non-linguistic documents. For the
second part, your model should facilitate search over the set of linguistic
documents, and return those documents most relevant to a query. All required files can be found in ~/dropbox/08-09/570/project2.
The Documents
The data consists of a set of
approximately 710 text documents, all of which were converted from PDF
documents downloaded from the Web. The
set’s a mix of linguistic and non-linguistic documents, where the linguistic
documents are mostly of scholarly documents discussing language data (and
sometimes contain analyzed language data).
The vocabulary used in the linguistic documents tends to be distinctive for
particular sub-domains of linguistics, such phonology, syntax, morphology, etc. For example, documents that analyze a
language’s morphology may use terms such as morpheme, prefix, suffix, clitic,
inflection, derivation, etc. More
general linguistic terms are also likely to be used. (Note:
(1) There is a little noise in some of the documents, due to the PDF
conversion. Your methods will likely
ignore this noise since the noise will be low frequency across the set of
documents. (2) Although most documents
are in English, some are not. You may
treat any of the non-English documents as non-linguistic.)
Vectors and Distance Measures
There are multiple ways to
measure the distance between vectors in a multi-dimensioned space, but the
easiest to implement is cosine. You are
free, however, to try other methods.
Because of the very large number of words that the set of documents will
contain, you will need to filter out irrelevant words using a stop list. Even this may not be adequate, and you are
free to implement other strategies to reduce dimensionality (e.g., stemming),
to adjust weights, or to classify the documents.
Task 1
For this task, you’ll want to
build a vector space model that will separate the documents into two sets,
those that are linguistic and those that are not. Your model will contain vectors that will
represent each document, mapping key terms, phrases, or “features” in the
vectors. The elements of the vectors can
be binary (a simple 1 or 0 indicating the presence or absence of a term), but may
prove more useful if more varied values are used. For instance, integer or real values
reflecting different or repeated usage across the document may prove useful, as
might weighted values dependent on the relevance of a particular term. Because some terms will be inflected,
stemming algorithms, such as the Porter Stemmer, may prove useful.
You’ll train and test your
model on the set of documents contained in ~/dropbox/08-09/570/project2/files1. A file,
labeled ~/dropbox/08-09/570/project2/files1-gold-standard.txt, gives the breakdown between linguistic and
non-linguistic documents. You may use
this list to help determine the vocabulary relevant to building your model
(such as to build a prototype vector containing the most relevant vocabulary). The Friday before the assignment is due, a
second set of documents will be provided in ~/dropbox/08-09/570/project2/files2. Test your
model against this second set to see how it categorizes these documents. To test, you’ll build a vector space for the
new set of documents, as you did for the first set. However, neither the dimensions nor the
weights of the model should be changed from the first to the second. In other words, don’t add new vocabulary or
features (dimensions) to accommodate the new set of documents. With this second set, there will be another
gold standard file, ~/dropbox/08-09/570/files2-gold-standard.txt,
that you can use for calculating your precision and recall numbers. (Please note:
the file names between files1
and files2 are not unique. In other words, the file name alone cannot be
used as a unique identifier.)
Task 2
For the second task, take
input provided by the user (at the prompt’s fine) and return the documents that
most closely match the terms that are given.
You’ll take the user input, and structure it as a vector, which you will
then compare against the document vectors you have created. The output should consist of a list of the documents
that most closely match the query, where proximity is measured by some
threshold value you have set. Only
documents identified as linguistic in Task 1 should be output. A week before this Task is due, you will be
provided with a set of test queries.
Due Dates and What You Submit
Task 1: Submit your code and output. For the output, provide your precision and
recall numbers, and give a sorted list (by name) of the documents you
classified as linguistic (remember to include documents in both Files1 and Files2). Include any
commentary about the difficulties you had in building the model, and specifics
about its failings (for instance, why it might have failed for particular
documents).
Task 1 due date:
11:45 p.m., Tuesday, November 25th
Submit minimally the
following (in one zip or tar file):
Task 2: Submit your code and output. Your output should consist only of a sorted
list of documents relevant to the queries provided you, one list for each query,
and the proximity score for each document to the given query. Only documents above some predefined
threshold should be included. Include a
comments file describing the methods you used, and describe how you came to the
threshold value that selected documents relevant to the queries.
Task 2 due date: 11:45 p.m., Tuesday, November 25th
Submit minimally the
following (in one zip or tar file):
Here are some readings and
references that may be of use: