Ling 571 - Deep Processing Techniques for NLP
Winter 2017
Homework #7: Due February 21, 2017, 23:45


Goals

Through this assignment you will:

Background

Please review the class slides and readings in the textbook on distributional semantics and models, as well as the detailed assignment notes in HW7.pdf. You may implement the assignment in whatever language you choose, provided that it runs on the CLMS cluster. In some cases below, Python functions are referenced, but you can use alternate implementations in other languages if you so choose. The count-based and word2vec models are to be implemented separately, so that you may do the more extensive coding required for the count-based distributional model in your preferred programming language and then use the Python-based gensim package for the word2vec implementation.

Creating and Evaluating Count-based Models of Distributional Semantic Similarity

Implement a program to create and evaluate a distributional model of word similarity based on local context term cooccurrence. Your program should:

Programming

Create a program hw7_dist_similarity.{py|pl|etc} that implements the creation and evaluation of the distributional similarity model as described above and invoked as:
hw7_dist_similarity.{py|pl|etc} <window> <weighting> <judgment_filename> <output_filename>, where:
In this assignment, you should use the Brown corpus provided with NLTK in /corpora/nltk/nltk-data/corpora/brown/ as the source of cooccurrence information. The file is white-space tokenized, but all tokens are of the form "word/POS". If you choose to use NLTK, you may use the Brown corpus reader as in:
brown_words = list(nltk.corpus.brown.words())

Comparison to Continuous Bag of Words (CBOW) using Word2Vec

Implement a program to evaluate a predictive CBOW distributional model of word similarity using Word2Vec. Your program should:

Programming #2

Create a program hw7_cbow_similarity.{py|pl|etc} that implements the creation and evaluation of the Continuous Bag-of-Words similarity model as described above and invoked as:
hw7_cbow_similarity.{py|pl|etc} <window> <judgment_filename> <output_filename>, where:

Files

Test and Example Data Files

Aside from the Brown corpus, all files related to this assignment may be found on patas in /dropbox/16-17/571/hw7/, as below:

Submission Files

Handing in your work

All homework should be handed in using the class CollectIt.