Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #8: Due 11:59 March 11, 2015
Goals
Through this assignment you will:
- Explore issues in word sense disambiguation.
- Gain familiarity with WordNet and the WordNet API.
- Gain some further familiarity with NLTK.
- Implement a thesaurus-based word sense disambiguation technique on standard data.
Background
Please review the class slides and readings in the textbook on lexical semantics, including WordNet, and word sense disambiguation. Also please read the
article Section 5.1, describing
Resnik's word sense disambiguation in groupings approach in detail.
Please also see the hw8_notes
for a detailed discussion of useful implementation hints.
For additional information on NLTK's WordNet API and information
content measures, see:
Computing Semantic Similarity
Based on the examples in the text, class slides, and the article,
implement a procedure resnik_similarity that implements
Resnik's Wordnet-based similarity measure. The procedure should take
two words and return the similarity of the most
similar sense pair, i.e. the information content for their most informative
subsumer.
Resnik's similarity measure relies on two components:
- the Wordnet taxonomy, and
- a corpus-based based information content measure.
NLTK implements a Python
implementation of the Wordnet API which you are encouraged to use.
There are other WordNet APIs, and you may use them, but they come with
no warranty, and may require substantial effort to work with.
NOTE: You may use the API to access components of Wordnet,
extract synsets, identify hypernyms, etc. You may NOT
use the methods which directly implement the similarity measure or
the identification of Most Informative Subsumer. You must
implement those functions yourself as procedures for the similarity
calculcation. You may use accessors such as common_hypernyms and
information_content. If you have questions about the
admissibility of a procedure, please ask; I'll clarify as quickly
as I can.
The NLTK corpus provides a number of resources for information content
calculation including frequency tables indexed by Wordnet offset and part-of-speech in /corpora/nltk/nltk-data/corpora/wordnet_ic/.
For consistency and quality, I would suggest that you use
/corpora/nltk/nltk-data/corpora/wordnet_ic/ic-brown-resnik-add1.dat,
which derives its counts from the 'balanced' Brown Corpus, using fractional
counts and add1 smoothing to
avoid zero counts for words not in Wordnet. (Not that there aren't
other problems with words not in Wordnet...)
You may
use this source either through the NLTK API (as in wnic = nltk.corpus.wordnet_ic.ic('ic-brown-resnik-add1.dat')
or directly through
methods that you implement yourself. The file is flat text.
NOTE: The IC files assume that you are using Wordnet 3.0.
If you choose to use a different API but want to use the precomputed IC
measures, you must make sure to use Wordnet version 3.0, or the IC measures
will be inconsistent.
Performing Word Sense Disambiguation
Based on the materials above, implement a word sense disambiguation
procedure that employs the Resnik similarity measure you created.
The procedure will only need to work on nouns. It should take
a word and a context, defined as a bag of words,
and return the WordNet synset for the ambiguous word selected by
the context.
The procedure should select
the preferred sense based on the similarities between the senses
of the probe word and the senses of the nouns in the context.
NOTE: You do not need to select senses for all
words, only for the probe word; this is a simplification of
the word group disambiguation model in the paper.
Files
Data
All files are found in /dropbox/14-15/571/hw8/ on patas:
- wsd_contexts.txt: File of probe words with
disambiguation word grouping lists. Each line is formatted as:
probe_word\tword_grouping where,
- probe_word is the word to disambiguate
- word_grouping is the comma-separated word list that serves as disambiguation context
- wsd_contexts.txt.gold: Corresponding file with gold standard sense tags, in which the sense id and gloss are prepended to the
original line.
- example_results: Formatted (partial) example file
Note: Please note that
the Resnik WSD approach may not be able to disambiguate these words corectly.
You will probably achieve about 60% accuracy overall. The first several
instances are much easier than the later ones.
Word sense disambiguation program
Please create a program named hw8_wsd.{py|*} with the
following parameters:
- information_content_file: This should specify the name of the
information content file. Here, you should be using: ic-brown-resnik-add1.dat
- wsd_test_file: This file will contain the lines of
"probe-word, context-words" pairs on which to evaluate your system. You
should use the wsd_contexts.txt file specified above.
- hw8_results.out: This is your results file in the
format specified in the example_results file.
Your program should:
- Load the information content measure.
- For each (word,context) pair,
- Use your Resnik similarity function with a Wordnet API to compute the preferred Wordnet sense for the probe word given the context.
- On a single line, print the similarity between the probe word and each context word in the format (W1,W2,similarity)
- Print out the preferred sense, by synsetID, of the word.
Write-up
Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you
were able to solve them, any insights, special features, and what you learned. Give examples if possible.
If you were not able to complete parts of the project, discuss what you tried and/or what did not work.
This will allow you to receive maximum credit for partial work.
Please name the file readme.{txt|pdf} with a suitable extension.
Testing
Your program must run on patas using:
$ condor-submit hw8.cmd
`
Please see the CLMS wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw1.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw8.tar *