University of Washington: Linguistics: Ling 571: Winter 2015: Homework #7

Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #7: Due 11:59 March 4, 2015

Goals

Through this assignment you will:

Investigate issues and design of distributional semantic models.
Analyze the effects of different context sizes and types as well as association measures in distributional similarity models.
Evaluate distributional models relative to human assessments.

Background

Please review the class slides and readings in the textbook on distributional semantics and models. You may implement the assignment in whatever language you choose, provided that it runs on the CLMS cluster. In some cases below, Python functions are referenced, but you can use alternate implementations in other languages if you so choose.

Creating Local Context Bag-of-Words Representations

Create a program named hw7_bow.{py|java|*} to compute distributional similarity models using a local context term cooccurrence model. Your program should:

Read in a corpus. In this case, you should use the Brown corpus provided with NLTK in /corpora/nltk/nltk-data/corpora/brown/. The file is white-space tokenized, but all tokens are of the form "word/POS". If you choose to use NLTK, you may use the Brown corpus reader as in:
brown_words = list(nltk.corpus.brown.words())
For each target word in the corpus:
- Create a vector representation based on word cooccurrence in a specified window around the target word. For a window value of 2, the window should span the two words before and the two words after the current word.
- Pre-processing:
  - Lowercase all tokens.
  - Exclude stopwords and punctuation from the context. A standard stopword list appears in: /corpora/nltk/nltk-data/corpora/stopwords/english
- Each entry should receive weight according to the specified weighting, either:
  - Frequency: the number of times the word appeared in the context of the target
  - Point-wise Mutual Information: PMI as defined in the text
For each word pair in a provided file:
- Print the ten highest weighted features and their weights, in the form:
  feature:weight
- Compute and print the similarity between the two words, based on cosine similarity as:
  wd1,wd2:similarity
`
Lastly, compute and print the Pearson correlation between the similarity scores you have computed and the human-generated similarity scores in the provided file as:
Correlation:computed_correlation.
You may use any available software for computing the correlation. In Python, you can use pearsonr from scipy.stats.stats.

Creating Local Relation-based Models

Create a program named hw7_relation.{py|java|*} to compute distributional similarity models using a local dependency relation-based model, similar to Lin's. The basic structure should be similar to that in the local cooccurrence model above, except:

You should read the word-dependency information from the provided file of dependency triples, created by Dekang Lin's dependency parser over a large newswire corpus. Entries in this file are of the form:
<target>TAB<relation>TAB<word>TAB<count>
You should compute the Lin association measure as defined in the slides and text instead of the standard PMI measure.

Files

Test Data Files

All files related to this assignment may be found on patas in /dropbox/14-15/571/hw7/, as below:

mc_similarity.txt: These are the pairs of words whose similarity is to be evaluated under each of your models, along with human similarity judgments from [Miller and Charles, 1991].

Distributional Semantic Analysis

hw7_bow.* that creates and evaluates your local context cooccurrence model with respect to human judgments should take parameters as specified below:

window: Specifies the size of the context window for your model.
weights: Specifies the weighting scheme to apply: "FREQ" or "PMI".
mc_similarity.txt: The pairs of words and their similarity to evaluate against. Each line is of the form:
wd1,wd2,similarity_score
hw7_results_bow_<window>_<weights>.out: The output file with the results of computing similarities and correlations over the word pairs. The file name should identify the configuration under which it was run, e.g. hw7_results_bow_30_FREQ.out would hold the results of running the bag of words model with context window of 30 and frequency weights.

hw7_relation.* that creates and evaluates your local context cooccurrence model with respect to human judgments should take parameters as specified below:

/dropbox/14-15/571/hw7/deps: File of Lin's dependency triples.
weights: Specifies the weighting scheme to apply: "FREQ" or "LIN".
mc_similarity.txt: The pairs of words and their similarity to evaluate against. Each line is of the form:
wd1,wd2,similarity_score
hw7_results_relation_<weights>.out: The output file with the results of computing similarities and correlations over the word pairs. The file name should identify the configuration under which it was run, e.g. hw7_results_relation_FREQ.out would hold the results of running the relation model with frequency weights.

Testing

You should run your programs and store the results for the following configurations (not identical to invocations):

hw7_bow.* 2 FREQ
hw7_bow.* 30 FREQ
hw7_bow.* 30 PMI
hw7_relation.* FREQ
hw7_relation.* LIN

Write-up

Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work. This will allow you to receive maximum credit for partial work.

NOTE: You should discuss your results in terms of the different impacts of context type, context window, and weighting scheme. You may reference both qualitative observations and your quantitative scores.

Please name the file readme.{txt|pdf} with a suitable extension.

Testing

Your program must run on patas using:
$ condor-submit hw7.cmd

Note: Your condor script should run (at least) one configuration of the hw7_bow.* and hw7_relation.* models. You can simply run the other configurations yourself and store the results files.

Please see the CLMS wiki pages on the basics of using the condor cluster. All files created by the condor run should appear in the top level of the directory.

Handing in your work

All homework should be handed in using the class CollectIt. Use the tar command to build a single hand-in file, named hw#.tar where # is the number of the homework assignment and containing all the material necessary to test your assignment. Your hw7.cmd should be at the top level of whatever directory structure you are using. For example, in your top-level directory, run:
$ tar cvf hw7.tar *

Ling 571 - Deep Processing Techniques for NLP Winter 2015 Homework #7: Due 11:59 March 4, 2015

Goals

Background

Creating Local Context Bag-of-Words Representations

Creating Local Relation-based Models

Files

Test Data Files

Distributional Semantic Analysis

Testing

Write-up

Testing

Handing in your work

Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #7: Due 11:59 March 4, 2015