Ling 571 - Deep Processing Techniques for NLP
Winter 2011
Homework #6: Due 11:59 March 8, 2011
Goals
Through this assignment you will:
- Explore issues in word sense disambiguation.
- Gain familiarity with WordNet and the WordNet API.
- Gain some further familiarity with NLTK.
- Implement a thesaurus-based word sense disambiguation technique on standard data.
Background
Please review the class slides and readings in the textbook on lexical semantics, including WordNet, and word sense disambiguation. Also please read the
article describing
Resnik's word sense disambiguation approach in detail.
Computing Semantic Similarity
Based on the examples in the text, class slides, and the article,
implement a procedure resnik_similarity that implements
Resnik's Wordnet-based similarity measure. The prodecure should take
two words and return the similarity of the most
similar sense pair.
Resnik's similarity measure relies on two components:
- the Wordnet taxonomy, and
- a corpus-based based information content measure.
You may use any API to Wordnet that you wish. There are APIs in
a number of languages available for download here. NLTK implements a Python
implementation of the Wordnet API which you may use.
NOTE: You may use the API to access components of Wordnet,
extract synsets, identify hypernyms, etc. You may NOT
use the methods which directly implement the similarity measure or
the identification of Least Common Subsumer. You must
implement those functions yourself as procedures for the similarity
calculcation. You may use accessors such as common_hypernyms and
information_content. If you have questions about the
admissibility of a procedure, please ask; I'll clarify as quickly
as a I can.
The NLTK corpus provides a number of resources for information content
calculation including frequency tables indexed by Wordnet offset and part-of-speech in /corpora/nltk/nltk-data/corpora/wordnet_ic/.
For consistency and quality, I would suggest that you use
/corpora/nltk/nltk-data/corpora/wordnet_ic/ic-brown-resnik-add1.dat,
which derives its counts from the 'balanced' Brown Corpus, using fractional
counts for ambiguous words (aka Resnik counting), and add1 smoothing to
avoid zero counts for words not in Wordnet. (Not that there aren't
other problems with words not in Wordnet...)
You may
use this source either through the NLTK API or directly through
methods that you implement yourself. The file is flat text.
NOTE: The IC files assume that you are using Wordnet 3.0.
If you choose to use a different API but want to use the precomputed IC
measures, you must make sure to use Wordnet version 3.0, or the IC measures
will be inconsistent.
NOTE: If you prefer, for extra credit, you may implement
a procedure to calculate the information content measure yourself using
one of the POS tagged corpus excerpts provided with NLTK (such as the Brown corpus or the Penn treebank) or elsewhere on the patas cluster.
It should produce an output file of a format similar to that in the
NLTK Wordnet IC files.
Performing Word Sense Disambiguation
Based on the materials above, implement a word sense disambiguation
procedure that employs the Resnik similarity measure you created.
The procedure will only need to work on nouns. It should take
a word and a context, defined as a bag of words,
and return the WordNet synset for the ambiguous word selected by
the context.
The procedure should select
the preferred sense based on the similarities between the senses
of the probe word and the senses of the nouns in the context.
NOTE: You do not need to select senses for all
words, only for the probe word; this is a simplification of
the word group disambiguation model in the paper.
Data
The words and contexts to analyze are found in
this file.
A file where each input line is prefaced with gold standard sense labels
can be found here. Please note that
the Resnik WSD approach may not be able to disambiguate these words corectly.
It is more likely that you will be able to select the 'gold standard' sense
for the first eight examples, than in the rest.
Running word sense disambiguation
Your program should:
- Optional: Compute an information content measure for the noun tree in Wordnet from a corpus. Print it to a file in a format consistent with the IC file provided for NLTK.
- Load the information content measure.
- For each (word,context) pair,
- Use your Resnik similarity function with your desired Wordnet API to compute the preferred Wordnet sense for the probe word given the context.
- On a single line, print the similarity between the probe word and each context word in the format (W1,W2,similarity)
- Print out the preferred sense, by synsetID, of the word.
Files
Please name your program hw6.cmd and your output file results
Please comment all code and remember to include your name in a comment at the
top of each file.
Testing
Your program must run on patas using:
$ condor-submit hw6.cmd
Please see the CLMA wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw1.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw6.tar *