Ling 571 - Deep Processing Techniques for NLP
Winter 2011
Homework #1: Due January 11, 2011
Goals
Through this assignment you will:
- Explore the basics of context-free grammar design.
- Identify some of the challenges in building grammars for natural languages.
- Begin to gain some familiarity with the Natural Language Toolkit (NLTK).
- Gain some experience with the cluster and condor.
Background
Please review the class slides and readings in the textbook on context-free grammars.
Building a Grammar
Based on the text and class notes, create a set of context-free grammar
rules that are adequate to analyze a small set of English natural language
sentences.
Your grammar should be able to produce parses for all
sentences in the files (as well as other similar ones in the English language).
The grammar should capture the major clause type (S,FRAG,etc), the major
phrase types (NP,VP,PP,etc), the parts of speech (POS) (NN,VBZ), and any punctuation
or special symbols. Thes phrase and POS types specified in the Jurafsky and Martin text (CH. 12 and inside front cover) provide a good basis for your grammar.
You may hard-code capitalization.
Data
The sentences to analyze are found in
this file.
Grammar Format
The grammar should be written in a format that can be read in by
nltk.data.load() and stored in a file named grammar.cfg.
A toy example grammar can be found here.
Parsing
Create a program to parse the example sentences based on your grammar
and analyze the results. Specifically, your program should:
- Load your grammar.
- Build a parser for your grammar using nltk.parse.EarleyChartParser.
- Read in the example sentences.
- For each example sentence, output to a file
- the simple bracketed structure parse(s), and
- the number of parses for that sentence.
- Finally, print the average number of parses per sentence obtained by your grammar.
Files
Please name your program hw1.cmd and your output file hw1.out
Please comment all code and remember to include your name in a comment at the
top of each file.
Testing
Your program must run on patas using:
$ condor_submit hw1.cmd
Please see the CLMA wiki pages on the basics of using the condor
cluster.
All files created by the condor run should appear in the top level of
the directory.
Handing in your work
All homework should be handed in using the class CollectIt.
Use the tar command to build a single hand-in file, named
hw#.tar where # is the number of the homework assignment and
containing all the material necessary to test your assignment. Your
hw1.cmd should be at the top level of whatever directory structure
you are using.
For example, in your top-level directory, run:
$ tar cvf hw1.tar *