Due: Feb. 2nd, 2010 at 11:59PM
1. Objectives and Overview
For this assignment you are asked to create a probabilistic CKY parser. You will experiment with optimizing your parser and become familiar with various evaluation techniques. You may use whatever language you like to complete this assignment.
2. Inputs
The data files for this assignment are:
3. Detailed instructions
Task 0:
Read J&M Chapter 14 (PCFGs and probabilistic CKY); review the lecture slides.
Task 1:
Induce a PCFG over train.trees. For this, create a class/function called inducePCFG. The input should be train.trees; the output should be a PCFG called induced.pcfg, according to this format
Task 2:
Create a PCKY parser—you may adapt your code from hw2—that returns the best parse for each of the 58 input sentences. If you get no parse for a particular sentence, you should output a blank line (necessary for the evaluation program used in the next step). The output should look like this (notice the blank lines) and be called baseline.trees. Your results file should be 58 lines long.
Task 3:
Evaluate your parser by executing ~/dropbox/09-10/571/tools/EVALB/evalb a script that calls the EVALB package. This is in the dropbox if you want to play around with it. evalb should print a results file called baseline.score. Each group in the class should have the same score for the baseline.
Task 4:
Improve upon these baseline results. You may use one or more of the techniques discussed in class. That is, you should have the baseline parser AND the improved parser. The output of the new parser should be called improved.trees. For full credit, either (1) simply show improvement in your EVALB scores or (2) show improved runtime having either the same score, or only a slightly worse score. Run evalb again and print results to a file called improved.score. Briefly discuss your results in a plain text file called README. There is no need to create a PDF or use Word, etc. A couple of paragraphs is sufficient.
Task 5: Please comment your code; include your names AND NetIDs somewhere in the main file and/or cmd script.
NOTE: To avoid confusion, please have a look at this sample hw3.tar for the kind of directory structure we’re looking for.
4. Running your code
Your code should run on Patas without error. And in order for us to run your assignment in a semi-automated fashion, please include a single shell script file called, e.g., hw3.cmd. We will run your homework on Patas using the following command:
$ condor_submit hw3.cmd
Once we untar your assignment (see below), this shell script should be in the top level of whatever directory structure you’re using.
Within your hw3.cmd file write your .out, .log, .error, etc, files to the top-level directory where the hw3.cmd file is. The script should call all necessary code. This way, you can use whatever language you like and whatever directory structure makes sense to you. Please refer to the detailed explanation of each assignment for what kinds of output files to produce, and what kinds of supplementary files are required. See the CLMA wiki pages for help on this.
5. How to turn in your work
Turn in your assignment using CollectIt. Please TAR your files and name the tar’d file with the extension .tar. Please don’t use ZIP, tar.gz, gzip, rar, etc.
Use the filename of whatever homework we’re on, e.g. for homework 6 name your file hw6.tar. Yes you will all have the same filename for your homeworks, but this doesn’t matter because of the way that CollectIt handles things.
To tar (available on Patas) from the directory that your work is in:
$ tar -cvf hw6.tar *
6. Assessment
This homework is worth 15% of your total grade. Assessment criteria are explained here.