Ling 571 - Deep Processing Techniques for NLP
Winter 2011
Homework #3: Due February 1, 2011
Goals
Through this assignment you will:
- Explore issues in probabilistic parser design for natural language processing.
- Learn how to extract rule probabilites and explore parser evaluation.
- Improve your understanding of the probabilistic CKY algorithm through implementation.
- Investigate the tradeoffs in probabilistic parser design in terms of
speed and accuracy.
Background
Please review the class slides and readings in the textbook on the probabilistic Cocke-Kasami-Younger algorithm, optimization, and evaluation.
Data
You will use the following files, derived from the ATIS subset of the
Penn Treebank as described in class.
- sents.test: over 50 sentences from the Air Travel Information System domain.
- parses.train: parses of 514 sentences from the Air Travel Information System domain.
- parses.test: parses of the 50+ test sentences from the Air Travel Information System domain.
Inducing a Probabilistic Context-free Grammar
Based on the material in the lectures and text, implement a procedure
that takes a set of context-free grammar parses of sentences (a small
treebank) and induces a probabilistic context-free grammar from them.
Your algorithm must create a grammar of the form:
A -> B C [0.38725]
All productions must have an associated probability. The results should
be written to a file called trained.pcfg
Converting from CKY to Probabilistic CKY
Implement a probabilistic version of the CKY parsing algorithm. Given a
probabilistic context-free grammar and an input string, the algorithm
should return the highest probability parse tree for that input string.
You should follow the approach outlined in the textbook and course notes.
You may adapt the CKY implementation that you created for HW#2. You may
use any language that you like, provided that it can be run on the CLMA
cluster.
You will then parse the sentences in sents.test using the PCFG induced in part 1 and your PCKY implementation. You should write the output to a file named parses.hyp. Your output file MUST have the same number of lines as the input strings to parse. If your
algorithm fails to parse a sentence you should output a blank line in the
output file corresponding to that input.
Evaluating the PCKY parser
Use the evalb program to evaluate your parser. The executable
may be found in ~/dropbox/10-11/571/tools/ along with the
required parameter file. It should be run as:
$dir/evalb -p $dir/COLLINS.prm parses.test parses.hyp
where $dir is the directory where the program resides.
The results should be written to the file parses.hyp.eval.
Improving the parser
You now want to improve your parser. You can improve the parser
either by:
- Improving the accuracy of the parser as measured by evalb, or
- Improving the efficiency of the parser as measured by running time, with little or no degradation in accuracy.
The output of your improved parser should be written to parses.improved.hyp and the results of evaluating them with evalb should
be written to parses.improved.hyp.eval.
Finally, please provide a short (1-2 paragraph) description of your
approach and results. Any format, including flat text, is fine.
Files
Please name your program hw3.cmd
Please comment all code and remember to include your name in a comment at the
top of each file.
Testing
Your program must run on patas using:
$ condor_submit hw3.cmd
Please see the CLMA wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw3.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw3.tar *