Ling 571 - Deep Processing Techniques for NLP
Winter 2015
Homework #2: Due January 21, 2015
Goals
Through this assignment you will:
- Begin development of an automatic parser. Homework #3 will require
the implementation of the CKY algorithm.
- Develop and manipulate a representation for context-free grammars.
- Improve your understanding of Chomsky Normal Form and weak grammatical
equivalence through implementation.
Background
Please review the class slides and readings in the textbook on Chomsky Normal Form conversion.
Converting a Grammar to Chomsky Normal Form
As noted in the text, the CKY algorithm requires a grammar in Chomsky Normal Form (CNF). While it is not always intuitively clear how to write a grammar from
scratch in CNF, it is fairly straightforward to convert a context-free grammar
into a weakly equivalent grammar in CNF.
Following the approach outlined in class, implement a procedure to
convert an input grammar of the form used for the first assignment to
a new weakly equivalent grammar in CNF.
You will want to create data structures corresponding to RULE, RHS, LHS, etc.
You may use whatever programming language you like, provided that it can
be run on the CLMS cluster using condor. You may use existing implementations
of these data structures in NLTK or other NLP toolkits (e.g. the Stanford
parser), but you must implement the conversion algorithm yourself.
Converting a general context-free grammar to Chomsky Normal Form
The program you submit should do the following:
- Read in an original context-free grammar.
- Convert this grammar to Chomsky Normal Form.
- Print out the rules of the converted grammar to a file.
Files
Please adhere to the naming conventions.
Programming
Create a program named hw2_tocnf.py to perform conversion
to Chomsky Normal Form with the following parameters ordered as below:
- - /corpora/nltk/nltk-data/grammars/large_grammars/atis.cfg: a file holding grammar rules in the NLTK .cfg format.
- - cnf_grammar.cfg: the output grammar file from your system with all rules in Chomsky Normal Form.
NOTE: The ATIS grammar is fairly large (193K), so consider
developing your algorithm on a subset of that grammar or another small grammar
like the NLTK "toy.cfg" or your HW#1 grammar.
Verification
Using your system from HW#1,
- use the original ATIS grammar to parse the sentences in /dropbox/14-15/571/hw2/test_sentences.txt. The results should be stored in original_parses.out.
- use your new cnf_grammar.cfg to
parse the sentences in /dropbox/14-15/571/hw2/test_sentences.txt. The results should be stored in cnf_parses.out.
Condor file
Please name your condor file hw2.cmd.
Write-up file
Please name your write-up readme.{txt|pdf} as appropriate.
Describe and discuss your work in a write-up file. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work.
Also, please review the parses generated by the original grammar and
those from the converted CNF grammar. Provide a brief discussion of
similarities and differences.
Testing
Your program must run on patas using:
$ condor_submit hw2.cmd
Please see the CLMS wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw1.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw2.tar *