Ling 571 - Deep Processing Techniques for NLP
Winter 2017 
Homework #4: Due January 31, 2017: 23:45
 Goals
Through this assignment you will:
- Explore issues in probabilistic parser design for natural language processing.
- Learn how to extract rule probabilites and explore parser evaluation.  
- Improve your understanding of the probabilistic CKY algorithm through implementation.
-  Investigate the tradeoffs in probabilistic parser design in terms of
speed and accuracy.
NOTE: You may work in teams of two (2) on this assignment.  If
you do so:
- 
Submit the hand-in file (hw4.tar) to one teammate's CollectIt.  
- Please put a note in the other teammate's CollectIt indicating where the
joint assignment should be found.
- Please include a brief discussion of each teammate's contribution in the readme.{txt|pdf} file.
Background
Please review the class slides and readings in the textbook on the probabilistic Cocke-Kasami-Younger algorithm, optimization, and evaluation.
Additional slides on the homework itself may be found here.
1: Inducing a Probabilistic Context-free Grammar
Based on the material in the lectures and text, implement a procedure
that takes a set of context-free grammar parses of sentences (a small
treebank) and induces a probabilistic context-free grammar from them.
Your algorithm must create a grammar of the form:
 A -> B C [0.38725]
All productions must have an associated probability. 
Specifically, the program should:
-  Read in a set of parsed sentences (a mini-treebank) from a file
-  Identify productions and estimate their probabilities
-  Print out the induced PCFG with production of the form above.
 Programming 1 
Create a program named hw4_topcfg.{py|pl|etc} to perform
PCFG induction invoked as:
hw4_topcfg.{py|pl|etc} <treebank_filename> <output_PCFG_file>, where:
-  <treebank_filename> is the name of the file holding the 
parsed sentences, one parse per line, in Chomsky Normal Form.
-  <output_PCFG_file> is the name of the file where the induced
grammar should be written.
2: Converting from CKY to Probabilistic CKY
Implement a probabilistic version of the CKY parsing algorithm.  Given a
probabilistic context-free grammar and an input string, the algorithm
should return the highest probability parse tree for that input string.
You should follow the approach outlined in the textbook and course notes.
You may adapt the CKY implementation that you created for HW#3.  You may
use any language that you like, provided that it can be run on the CL
cluster.
 
Specifically, your program should:
-  Read in a PCFG in NLTK format as generated above
-  Read in a set of sentences to parse 
-  For each sentence:
-  Parse  the sentences using a PCKY algorithm that you implement
-  Print the highest scoring parse to a file, on a single line 
 
Programming 2
Create a program named hw4_parser.{py|pl|etc} to perform
PCKY parsing invoked as:
hw4_parser.{py|pl|etc} <input_PCFG_file> <test_sentence_filename> <output_parse_filename>, where:
3: Evaluating the PCKY parser
Use the evalb program to evaluate your parser.  The executable 
may be found in ~/dropbox/16-17/571/hw4/tools/ along with the 
required parameter file. It should be run as:
 $dir/evalb -p $dir/COLLINS.prm <gold_standard_parse_file> <hypothesis_parse_file>
where 
- $dir is the directory where the program resides,
- <gold_stardard_parse_file>  is the name of the file
containing the gold standard parses for the sentences to evaluate over.
The file has one parse per line.
- <hypothesis_parse_file> is the name of the file containing
the parses output by your system to evaluated against the gold standard
parses.  The file has one parse per line.
4, 5: Improving the parser
You will also need to improve your baseline parser.  You can improve the parser
either by:
-  Improving the coverage of the parser in terms of sentences parsed, 
-  Improving the accuracy of the parser as measured by evalb, or
-  Improving the efficiency of the parser as measured by running time, with little or no degradation in accuracy. 
You will either :
Re-run the evaluation script on your new parses to demonstrate your improvement.
Files
Training, Test, Evaluation, Example Data
You will use the following files, derived from the ATIS subset of the
Penn Treebank as described in class.  All files can be found on patas in
/dropbox/16-17/571/hw4/data/, unless otherwise mentioned:
-  parses.train: parses of 514 sentences from the Air Travel Information System domain.  These parses will form your (relatively small) training treebank for this assignment.  They will form the basis for inducing 
your PCFG rules. 
-  sentences.txt: over 50 sentences from the Air Travel Information System domain.  These are the test sentences your system must parse and will
be evaluated on, both for the baseline and for your improved system.
-  parses.gold: parses of the 50+ test sentences from the Air Travel Information System domain.  These parses will provide the gold standard
and will be used to be evaluate the output of your system on the test 
sentences above.  
-  toy.pcfg: This file contains a simple Probabilistic  Context-Free Grammar 
(in Chomsky Normal Form) that parses a simple toy set of sentences.
-  toy_sentences.txt: This file contains a small set of toy
example sentences.
-  toy_output.txt: Example output from parsing with the
pcfg in the one parse per line format required by evalb.
-  example_cky.py: Example implementation of CKY 
in Python, available for reference. In /dropbox/16-17/571/hw4/tools/
Submission Files
- Output Files
 You should generate the output files, named as specified below, corresponding
to each of the main components of this assignment:
- 1: hw4_trained.pcfg: This file should contain the probabilistic context-free grammar trained on the parses.train treebank. 
- 2: parses_base.out: This file should contain the results
of  parsing the sentences in sentences.txt using the PCFG induced in part 1 and your PCKY implementation.   Your output file MUST have the same number of lines as the input strings to parse.  If your
algorithm fails to parse a sentence you should output a blank line in the
output file corresponding to that input.
- 3:  parses_base.eval:  This file should contain the  results of running evalb on your baseline parser output.
- 4: parses_improved.out: This file should contain the output of your improved parser, again run on the sentences.txt test 
data.  It should have the same format as parses_base.out.
- 5: parses_improved.eval: This file should contain the
results of running evalb  on your improved parser output.
 
-   Program Files 
-  hw4_topcfg.{py|pl|etc}: Code to induce the PCFG.
-  hw4_parser.{py|pl|etc}: Code to implement PCKY.
-  hw4_topcfg_improved.{py|pl|etc} and/or hw4_parser_improved.{py|pl|etc}
-  hw4_run.sh: This file should run all of these steps end-to-end, and be called by your condor file: 
- PCFG grammar induction,
-  PCKY parsing, 
-  evaluation of baseline results,
-  PCKY improvement, and 
-  evaluation of improved results. 
 
 
 
-  hw4.cmd: Condor file which drives your code for this assignment.  
 
-  readme.{txt|pdf}: Write-up file
- hw4.tar: Your hand-in file
Handing in your work
All homework should be handed in using the class CollectIt.