Linguistics 570

Project 2

 

For this project, you will build a vector space model over a set of documents.  For the first part of the project, your model should separate linguistic from non-linguistic documents.  For the second part, your model should facilitate search over the set of linguistic documents, and return those documents most relevant to a query.  All required files can be found in ~/dropbox/08-09/570/project2.

 

The Documents

 

The data consists of a set of approximately 710 text documents, all of which were converted from PDF documents downloaded from the Web.  The set’s a mix of linguistic and non-linguistic documents, where the linguistic documents are mostly of scholarly documents discussing language data (and sometimes contain analyzed language data).  The vocabulary used in the linguistic documents tends to be distinctive for particular sub-domains of linguistics, such phonology, syntax, morphology, etc.  For example, documents that analyze a language’s morphology may use terms such as morpheme, prefix, suffix, clitic, inflection, derivation, etc.  More general linguistic terms are also likely to be used.  (Note:  (1) There is a little noise in some of the documents, due to the PDF conversion.  Your methods will likely ignore this noise since the noise will be low frequency across the set of documents.  (2) Although most documents are in English, some are not.  You may treat any of the non-English documents as non-linguistic.)

 

Vectors and Distance Measures

 

There are multiple ways to measure the distance between vectors in a multi-dimensioned space, but the easiest to implement is cosine.  You are free, however, to try other methods.  Because of the very large number of words that the set of documents will contain, you will need to filter out irrelevant words using a stop list.  Even this may not be adequate, and you are free to implement other strategies to reduce dimensionality (e.g., stemming), to adjust weights, or to classify the documents.

 

Task 1

 

For this task, you’ll want to build a vector space model that will separate the documents into two sets, those that are linguistic and those that are not.  Your model will contain vectors that will represent each document, mapping key terms, phrases, or “features” in the vectors.  The elements of the vectors can be binary (a simple 1 or 0 indicating the presence or absence of a term), but may prove more useful if more varied values are used.  For instance, integer or real values reflecting different or repeated usage across the document may prove useful, as might weighted values dependent on the relevance of a particular term.  Because some terms will be inflected, stemming algorithms, such as the Porter Stemmer, may prove useful. 

 

You’ll train and test your model on the set of documents contained in ~/dropbox/08-09/570/project2/files1.  A file, labeled ~/dropbox/08-09/570/project2/files1-gold-standard.txt, gives the breakdown between linguistic and non-linguistic documents.  You may use this list to help determine the vocabulary relevant to building your model (such as to build a prototype vector containing the most relevant vocabulary).  The Friday before the assignment is due, a second set of documents will be provided in ~/dropbox/08-09/570/project2/files2.  Test your model against this second set to see how it categorizes these documents.  To test, you’ll build a vector space for the new set of documents, as you did for the first set.  However, neither the dimensions nor the weights of the model should be changed from the first to the second.  In other words, don’t add new vocabulary or features (dimensions) to accommodate the new set of documents.  With this second set, there will be another gold standard file, ~/dropbox/08-09/570/files2-gold-standard.txt, that you can use for calculating your precision and recall numbers.  (Please note:  the file names between files1 and files2 are not unique.  In other words, the file name alone cannot be used as a unique identifier.)

 

Task 2

 

For the second task, take input provided by the user (at the prompt’s fine) and return the documents that most closely match the terms that are given.  You’ll take the user input, and structure it as a vector, which you will then compare against the document vectors you have created.  The output should consist of a list of the documents that most closely match the query, where proximity is measured by some threshold value you have set.  Only documents identified as linguistic in Task 1 should be output.  A week before this Task is due, you will be provided with a set of test queries.

 

Due Dates and What You Submit

 

Task 1:  Submit your code and output.  For the output, provide your precision and recall numbers, and give a sorted list (by name) of the documents you classified as linguistic (remember to include documents in both Files1 and Files2).  Include any commentary about the difficulties you had in building the model, and specifics about its failings (for instance, why it might have failed for particular documents). 

 

Task 1  due date:  11:45 p.m., Tuesday, November 25th

Submit minimally the following (in one zip or tar file):

  1. A shell script labeled project2-1.sh (should run app & generated output)
  2. Your code, written in python, java, perl, C#, C, or ruby. 
  3. project2-1-output.txt (containing the output as described)
  4. project2-1-pr.txt (containing your precision and recall numbers)
  5. comments.txt – Commentary as described, including instructions if necessary.

 

Task 2:  Submit your code and output.  Your output should consist only of a sorted list of documents relevant to the queries provided you, one list for each query, and the proximity score for each document to the given query.  Only documents above some predefined threshold should be included.  Include a comments file describing the methods you used, and describe how you came to the threshold value that selected documents relevant to the queries.

 

Task 2 due date:  11:45 p.m., Tuesday, November 25th

Submit minimally the following (in one zip or tar file):

  1. A shell script labeled project2-2.sh (should run app & generated output)
  2. Your code, written in python, java, perl, C#, C, or ruby. 
  3. project2-2-output.txt (containing the output as described, ordered by question)
  4. comments.txt – Commentary as described, including instructions if necessary.

 

 

Readings 

Here are some readings and references that may be of use: