Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #7: Due 11:59 March 4, 2015
Goals
Through this assignment you will:
- Investigate issues and design of distributional semantic models.
- Analyze the effects of different context sizes and types as well as
association measures in distributional similarity models.
- Evaluate distributional models relative to human assessments.
Background
Please review the class slides and readings in the textbook on distributional
semantics and models. You may implement the assignment in whatever language
you choose, provided that it runs on the CLMS cluster. In some cases below,
Python functions are referenced, but you can use alternate implementations
in other languages if you so choose.
Creating Local Context Bag-of-Words Representations
Create a program named hw7_bow.{py|java|*} to compute
distributional similarity models using a local context term cooccurrence
model. Your program should:
- Read in a corpus. In this case, you should use the Brown corpus provided
with NLTK in /corpora/nltk/nltk-data/corpora/brown/.
The file is white-space tokenized, but all tokens are of the form "word/POS".
If you choose to use NLTK, you may use the Brown corpus reader as in:
brown_words = list(nltk.corpus.brown.words())
- For each target word in the corpus:
- Create a vector representation based on word cooccurrence in a specified window around the target word. For a window value of 2, the
window
should span the two words before and the two words after the current word.
- Pre-processing:
- Lowercase all tokens.
- Exclude stopwords and punctuation from the context. A standard stopword list appears in: /corpora/nltk/nltk-data/corpora/stopwords/english
- Each entry should receive weight according to the specified weighting, either:
- Frequency: the number of times the word appeared in the context of the target
- Point-wise Mutual Information: PMI as defined in the text
- For each word pair in a provided file:
- Print the ten highest weighted features and their weights, in the form:
feature:weight
- Compute and print the similarity between the two words, based on cosine similarity as:
wd1,wd2:similarity
` - Lastly, compute and print the Pearson correlation between the similarity
scores you have computed and the human-generated similarity scores in the
provided file as:
Correlation:computed_correlation.
You may use any available software for computing the correlation. In Python,
you can use pearsonr from scipy.stats.stats.
Creating Local Relation-based Models
Create a program named hw7_relation.{py|java|*} to compute
distributional similarity models using a local dependency relation-based
model, similar to Lin's. The basic structure should be similar to that
in the local cooccurrence model above, except:
- You should read the word-dependency information from the provided
file of dependency triples, created by Dekang Lin's dependency parser
over a large newswire corpus. Entries in this file are of the form:
<target>TAB<relation>TAB<word>TAB<count>
- You should compute the Lin association measure as defined in the slides
and text instead of the standard PMI measure.
Files
Test Data Files
All files related to this assignment may be found on patas in
/dropbox/14-15/571/hw7/, as below:
- mc_similarity.txt: These are the pairs of words
whose similarity is to be evaluated under each of your models, along with
human similarity judgments from [Miller and Charles, 1991].
Distributional Semantic Analysis
hw7_bow.* that creates and evaluates your
local context cooccurrence model with respect to human judgments
should take parameters as specified below:
- window: Specifies the size of the context window for your
model.
- weights: Specifies the weighting scheme to apply: "FREQ" or "PMI".
- mc_similarity.txt: The pairs of words and their similarity to evaluate against. Each line is of the form:
wd1,wd2,similarity_score
- hw7_results_bow_<window>_<weights>.out: The output file with the results of computing similarities and correlations
over the word pairs. The file name should identify the configuration
under which it was run, e.g. hw7_results_bow_30_FREQ.out would hold the
results of running the bag of words model with context window of 30 and
frequency weights.
hw7_relation.* that creates and evaluates your
local context cooccurrence model with respect to human judgments
should take parameters as specified below:
- /dropbox/14-15/571/hw7/deps: File of Lin's dependency
triples.
- weights: Specifies the weighting scheme to apply: "FREQ" or "LIN".
- mc_similarity.txt: The pairs of words and their similarity to evaluate against. Each line is of the form:
wd1,wd2,similarity_score
- hw7_results_relation_<weights>.out: The output file with the results of computing similarities and correlations
over the word pairs. The file name should identify the configuration
under which it was run, e.g. hw7_results_relation_FREQ.out would hold the
results of running the relation model with
frequency weights.
Testing
You should run your programs and store the results for the following configurations (not identical to invocations):
- hw7_bow.* 2 FREQ
- hw7_bow.* 30 FREQ
- hw7_bow.* 30 PMI
- hw7_relation.* FREQ
- hw7_relation.* LIN
Write-up
Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you
were able to solve them, any insights, special features, and what you learned. Give examples if possible.
If you were not able to complete parts of the project, discuss what you tried and/or what did not work.
This will allow you to receive maximum credit for partial work.
NOTE: You should discuss your results in terms of the
different impacts of context type, context window, and weighting scheme.
You may reference both qualitative observations and your quantitative scores.
Please name the file readme.{txt|pdf} with a suitable extension.
Testing
Your program must run on patas using:
$ condor-submit hw7.cmd
Note: Your condor script should run (at least) one configuration of the hw7_bow.*
and hw7_relation.* models. You can simply run the other configurations
yourself and store the results files.
Please see the CLMS wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw7.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw7.tar *