University of Washington: Linguistics: Ling 571: Winter 2011: Homework #6

Ling 571 - Deep Processing Techniques for NLP
Winter 2011
Homework #6: Due 11:59 March 8, 2011

Goals

Through this assignment you will:

Explore issues in word sense disambiguation.
Gain familiarity with WordNet and the WordNet API.
Gain some further familiarity with NLTK.
Implement a thesaurus-based word sense disambiguation technique on standard data.

Background

Please review the class slides and readings in the textbook on lexical semantics, including WordNet, and word sense disambiguation. Also please read the article describing Resnik's word sense disambiguation approach in detail.

Computing Semantic Similarity

Based on the examples in the text, class slides, and the article, implement a procedure resnik_similarity that implements Resnik's Wordnet-based similarity measure. The prodecure should take two words and return the similarity of the most similar sense pair.

Resnik's similarity measure relies on two components:

the Wordnet taxonomy, and
a corpus-based based information content measure.

You may use any API to Wordnet that you wish. There are APIs in a number of languages available for download here. NLTK implements a Python implementation of the Wordnet API which you may use.

NOTE: You may use the API to access components of Wordnet, extract synsets, identify hypernyms, etc. You may NOT use the methods which directly implement the similarity measure or the identification of Least Common Subsumer. You must implement those functions yourself as procedures for the similarity calculcation. You may use accessors such as common_hypernyms and information_content. If you have questions about the admissibility of a procedure, please ask; I'll clarify as quickly as a I can.

The NLTK corpus provides a number of resources for information content calculation including frequency tables indexed by Wordnet offset and part-of-speech in /corpora/nltk/nltk-data/corpora/wordnet_ic/. For consistency and quality, I would suggest that you use /corpora/nltk/nltk-data/corpora/wordnet_ic/ic-brown-resnik-add1.dat, which derives its counts from the 'balanced' Brown Corpus, using fractional counts for ambiguous words (aka Resnik counting), and add1 smoothing to avoid zero counts for words not in Wordnet. (Not that there aren't other problems with words not in Wordnet...) You may use this source either through the NLTK API or directly through methods that you implement yourself. The file is flat text.

NOTE: The IC files assume that you are using Wordnet 3.0. If you choose to use a different API but want to use the precomputed IC measures, you must make sure to use Wordnet version 3.0, or the IC measures will be inconsistent.

NOTE: If you prefer, for extra credit, you may implement a procedure to calculate the information content measure yourself using one of the POS tagged corpus excerpts provided with NLTK (such as the Brown corpus or the Penn treebank) or elsewhere on the patas cluster. It should produce an output file of a format similar to that in the NLTK Wordnet IC files.

Performing Word Sense Disambiguation

Based on the materials above, implement a word sense disambiguation procedure that employs the Resnik similarity measure you created. The procedure will only need to work on nouns. It should take a word and a context, defined as a bag of words, and return the WordNet synset for the ambiguous word selected by the context.

The procedure should select the preferred sense based on the similarities between the senses of the probe word and the senses of the nouns in the context. NOTE: You do not need to select senses for all words, only for the probe word; this is a simplification of the word group disambiguation model in the paper.

Data

The words and contexts to analyze are found in this file.

A file where each input line is prefaced with gold standard sense labels can be found here. Please note that the Resnik WSD approach may not be able to disambiguate these words corectly. It is more likely that you will be able to select the 'gold standard' sense for the first eight examples, than in the rest.

Running word sense disambiguation

Your program should:

Optional: Compute an information content measure for the noun tree in Wordnet from a corpus. Print it to a file in a format consistent with the IC file provided for NLTK.
Load the information content measure.
For each (word,context) pair,

Use your Resnik similarity function with your desired Wordnet API to compute the preferred Wordnet sense for the probe word given the context.
On a single line, print the similarity between the probe word and each context word in the format (W1,W2,similarity)
Print out the preferred sense, by synsetID, of the word.

Files

Please name your program hw6.cmd and your output file results
Please comment all code and remember to include your name in a comment at the top of each file.

Testing

Your program must run on patas using:
$ condor-submit hw6.cmd

Please see the CLMA wiki pages on the basics of using the condor cluster. All files created by the condor run should appear in the top level of the directory.

Handing in your work

All homework should be handed in using the class CollectIt. Use the tar command to build a single hand-in file, named hw#.tar where # is the number of the homework assignment and containing all the material necessary to test your assignment. Your hw1.cmd should be at the top level of whatever directory structure you are using. For example, in your top-level directory, run:
$ tar cvf hw6.tar *

Ling 571 - Deep Processing Techniques for NLP Winter 2011 Homework #6: Due 11:59 March 8, 2011