University of Washington: Linguistics: Ling 571: Winter 2011: Homework #3

Ling 571 - Deep Processing Techniques for NLP
Winter 2011
Homework #3: Due February 1, 2011

Goals

Through this assignment you will:

Explore issues in probabilistic parser design for natural language processing.
Learn how to extract rule probabilites and explore parser evaluation.
Improve your understanding of the probabilistic CKY algorithm through implementation.
Investigate the tradeoffs in probabilistic parser design in terms of speed and accuracy.

Background

Please review the class slides and readings in the textbook on the probabilistic Cocke-Kasami-Younger algorithm, optimization, and evaluation.

Data

You will use the following files, derived from the ATIS subset of the Penn Treebank as described in class.

sents.test: over 50 sentences from the Air Travel Information System domain.
parses.train: parses of 514 sentences from the Air Travel Information System domain.
parses.test: parses of the 50+ test sentences from the Air Travel Information System domain.

Inducing a Probabilistic Context-free Grammar

Based on the material in the lectures and text, implement a procedure that takes a set of context-free grammar parses of sentences (a small treebank) and induces a probabilistic context-free grammar from them.

Your algorithm must create a grammar of the form:
A -> B C [0.38725]
All productions must have an associated probability. The results should be written to a file called trained.pcfg

Converting from CKY to Probabilistic CKY

Implement a probabilistic version of the CKY parsing algorithm. Given a probabilistic context-free grammar and an input string, the algorithm should return the highest probability parse tree for that input string.

You should follow the approach outlined in the textbook and course notes. You may adapt the CKY implementation that you created for HW#2. You may use any language that you like, provided that it can be run on the CLMA cluster.

You will then parse the sentences in sents.test using the PCFG induced in part 1 and your PCKY implementation. You should write the output to a file named parses.hyp. Your output file MUST have the same number of lines as the input strings to parse. If your algorithm fails to parse a sentence you should output a blank line in the output file corresponding to that input.

Evaluating the PCKY parser

Use the evalb program to evaluate your parser. The executable may be found in ~/dropbox/10-11/571/tools/ along with the required parameter file. It should be run as:
$dir/evalb -p $dir/COLLINS.prm parses.test parses.hyp
where $dir is the directory where the program resides. The results should be written to the file parses.hyp.eval.

Improving the parser

You now want to improve your parser. You can improve the parser either by:

Improving the accuracy of the parser as measured by evalb, or
Improving the efficiency of the parser as measured by running time, with little or no degradation in accuracy.

The output of your improved parser should be written to parses.improved.hyp and the results of evaluating them with evalb should be written to parses.improved.hyp.eval.

Finally, please provide a short (1-2 paragraph) description of your approach and results. Any format, including flat text, is fine.

Files

Please name your program hw3.cmd
Please comment all code and remember to include your name in a comment at the top of each file.

Testing

Your program must run on patas using:
$ condor_submit hw3.cmd

Please see the CLMA wiki pages on the basics of using the condor cluster. All files created by the condor run should appear in the top level of the directory.

Handing in your work

All homework should be handed in using the class CollectIt. Use the tar command to build a single hand-in file, named hw#.tar where # is the number of the homework assignment and containing all the material necessary to test your assignment. Your hw3.cmd should be at the top level of whatever directory structure you are using. For example, in your top-level directory, run:
$ tar cvf hw3.tar *

Ling 571 - Deep Processing Techniques for NLP Winter 2011 Homework #3: Due February 1, 2011