Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #4: Due February 4, 2015: 11:59pm
Note: You MAY work in teams of two on this assignment. If you do so, your write-up must indicate this clearly and also describe the structure of
your collaboration.
Goals
Through this assignment you will:
- Explore issues in probabilistic parser design for natural language processing.
- Learn how to extract rule probabilites and explore parser evaluation.
- Improve your understanding of the probabilistic CKY algorithm through implementation.
- Investigate the tradeoffs in probabilistic parser design in terms of
speed and accuracy.
Background
Please review the class slides and readings in the textbook on the probabilistic Cocke-Kasami-Younger algorithm, optimization, and evaluation.
Additional slides on the homework itself may be found here.
1: Inducing a Probabilistic Context-free Grammar
Based on the material in the lectures and text, implement a procedure
that takes a set of context-free grammar parses of sentences (a small
treebank) and induces a probabilistic context-free grammar from them.
Your algorithm must create a grammar of the form:
A -> B C [0.38725]
All productions must have an associated probability.
2: Converting from CKY to Probabilistic CKY
Implement a probabilistic version of the CKY parsing algorithm. Given a
probabilistic context-free grammar and an input string, the algorithm
should return the highest probability parse tree for that input string.
You should follow the approach outlined in the textbook and course notes.
You may adapt the CKY implementation that you created for HW#3. You may
use any language that you like, provided that it can be run on the CLMS
cluster.
3: Evaluating the PCKY parser
Use the evalb program to evaluate your parser. The executable
may be found in ~/dropbox/14-15/571/hw4/tools/ along with the
required parameter file. It should be run as:
$dir/evalb -p $dir/COLLINS.prm <gold_standard_parse_file> <hypothesis_parse_file>
where $dir is the directory where the program resides.
4, 5: Improving the parser
You will also need to improve your baseline parser. You can improve the parser
either by:
- Improving the coverage of the parser in terms of sentences parsed,
- Improving the accuracy of the parser as measured by evalb, or
- Improving the efficiency of the parser as measured by running time, with little or no degradation in accuracy.
Files
Training, Test, and Evaluation Data
You will use the following files, derived from the ATIS subset of the
Penn Treebank as described in class. All files can be found on patas in
/dropbox/14-15/571/hw4/data/
- parses.train: parses of 514 sentences from the Air Travel Information System domain. These parses will form your (relatively small) training treebank for this assignment. They will form the basis for inducing
your PCFG rules.
- sents.test: over 50 sentences from the Air Travel Information System domain. These are the test sentences your system must parse and will
be evaluated on, both for the baseline and for your improved system.
- parses.gold: parses of the 50+ test sentences from the Air Travel Information System domain. These parses will provide the gold standard
and will be used to be evaluate the output of your system on the test
sentences above.
Output Files
You should generate the output files, named as specified below, corresponding
to each of the main components of this assignment:
Running your code
Please name your program hw4.cmd. The hw4.cmd file should run all of these steps end-to-end: PCFG learning, PCKY parsing,
evaluation of baseline results, PCKY improvement, and evaluation of
improved results. You may wish to create a shell script to execute the
different steps required for the assignment, and call that in your
hw4.cmd condor file.
Please remember to include your name in a comment at the
top of each code file.
Write-up
Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you
were able to solve them, any insights, special features, and what you learned. Give examples if possible.
If you were not able to complete parts of the project, discuss what you tried and/or what did not work. Make sure to discuss the improvements you implemented
and compare your 'improved' results to your
baseline results.
This will allow you to receive maximum credit for partial work.
For team submissions, be sure to specify your collaboration.
Please name the file readme.{txt|pdf} with a suitable extension.
Testing
Your program must run on patas using:
$ condor_submit hw4.cmd
Please see the CLMS wiki pages on the basics of using the condor
cluster.
All output files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw4.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw4.tar *