University of Washington: Linguistics: Ling 571: Winter 2015: Homework #8

Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #8: Due 11:59 March 11, 2015

Goals

Through this assignment you will:

Explore issues in word sense disambiguation.
Gain familiarity with WordNet and the WordNet API.
Gain some further familiarity with NLTK.
Implement a thesaurus-based word sense disambiguation technique on standard data.

Background

Please review the class slides and readings in the textbook on lexical semantics, including WordNet, and word sense disambiguation. Also please read the article Section 5.1, describing Resnik's word sense disambiguation in groupings approach in detail.

Please also see the hw8_notes for a detailed discussion of useful implementation hints.

For additional information on NLTK's WordNet API and information content measures, see:

Computing Semantic Similarity

Based on the examples in the text, class slides, and the article, implement a procedure resnik_similarity that implements Resnik's Wordnet-based similarity measure. The procedure should take two words and return the similarity of the most similar sense pair, i.e. the information content for their most informative subsumer.

Resnik's similarity measure relies on two components:

the Wordnet taxonomy, and
a corpus-based based information content measure.

NLTK implements a Python implementation of the Wordnet API which you are encouraged to use. There are other WordNet APIs, and you may use them, but they come with no warranty, and may require substantial effort to work with.

NOTE: You may use the API to access components of Wordnet, extract synsets, identify hypernyms, etc. You may NOT use the methods which directly implement the similarity measure or the identification of Most Informative Subsumer. You must implement those functions yourself as procedures for the similarity calculcation. You may use accessors such as common_hypernyms and information_content. If you have questions about the admissibility of a procedure, please ask; I'll clarify as quickly as I can.

The NLTK corpus provides a number of resources for information content calculation including frequency tables indexed by Wordnet offset and part-of-speech in /corpora/nltk/nltk-data/corpora/wordnet_ic/. For consistency and quality, I would suggest that you use /corpora/nltk/nltk-data/corpora/wordnet_ic/ic-brown-resnik-add1.dat, which derives its counts from the 'balanced' Brown Corpus, using fractional counts and add1 smoothing to avoid zero counts for words not in Wordnet. (Not that there aren't other problems with words not in Wordnet...) You may use this source either through the NLTK API (as in wnic = nltk.corpus.wordnet_ic.ic('ic-brown-resnik-add1.dat') or directly through methods that you implement yourself. The file is flat text.

NOTE: The IC files assume that you are using Wordnet 3.0. If you choose to use a different API but want to use the precomputed IC measures, you must make sure to use Wordnet version 3.0, or the IC measures will be inconsistent.

Performing Word Sense Disambiguation

Based on the materials above, implement a word sense disambiguation procedure that employs the Resnik similarity measure you created. The procedure will only need to work on nouns. It should take a word and a context, defined as a bag of words, and return the WordNet synset for the ambiguous word selected by the context.

The procedure should select the preferred sense based on the similarities between the senses of the probe word and the senses of the nouns in the context. NOTE: You do not need to select senses for all words, only for the probe word; this is a simplification of the word group disambiguation model in the paper.

Files

Data

All files are found in /dropbox/14-15/571/hw8/ on patas:

wsd_contexts.txt: File of probe words with disambiguation word grouping lists. Each line is formatted as:
probe_word\tword_grouping where,
- probe_word is the word to disambiguate
- word_grouping is the comma-separated word list that serves as disambiguation context
wsd_contexts.txt.gold: Corresponding file with gold standard sense tags, in which the sense id and gloss are prepended to the original line.
example_results: Formatted (partial) example file

Note: Please note that the Resnik WSD approach may not be able to disambiguate these words corectly. You will probably achieve about 60% accuracy overall. The first several instances are much easier than the later ones.

Word sense disambiguation program

Please create a program named hw8_wsd.{py|*} with the following parameters:

information_content_file: This should specify the name of the information content file. Here, you should be using: ic-brown-resnik-add1.dat
wsd_test_file: This file will contain the lines of "probe-word, context-words" pairs on which to evaluate your system. You should use the wsd_contexts.txt file specified above.
hw8_results.out: This is your results file in the format specified in the example_results file.

Your program should:

Load the information content measure.
For each (word,context) pair,

Use your Resnik similarity function with a Wordnet API to compute the preferred Wordnet sense for the probe word given the context.
On a single line, print the similarity between the probe word and each context word in the format (W1,W2,similarity)
Print out the preferred sense, by synsetID, of the word.

Write-up

Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work. This will allow you to receive maximum credit for partial work.

Please name the file readme.{txt|pdf} with a suitable extension.

Testing

Your program must run on patas using:
$ condor-submit hw8.cmd
`

Please see the CLMS wiki pages on the basics of using the condor cluster. All files created by the condor run should appear in the top level of the directory.

Handing in your work

All homework should be handed in using the class CollectIt. Use the tar command to build a single hand-in file, named hw#.tar where # is the number of the homework assignment and containing all the material necessary to test your assignment. Your hw1.cmd should be at the top level of whatever directory structure you are using. For example, in your top-level directory, run:
$ tar cvf hw8.tar *

Ling 571 - Deep Processing Techniques for NLP Winter 2015 Homework #8: Due 11:59 March 11, 2015