Ling 571 - Deep Processing Techniques for NLP
Winter 2016
Homework #1: Due January 12, 2016: 11:45pm
Goals
Through this assignment you will:
- Explore the basics of context-free grammar design.
- Identify some of the challenges in building grammars for natural languages.
- Begin to gain some familiarity with the Natural Language Toolkit (NLTK).
- Gain some experience with the cluster and condor.
Background
Please review the class slides and readings in the textbook on context-free grammars.
Also, see Section 8.3 of the NLTK Book for examples of how to write grammars and configure the included parsers. We'll get to the later parts of that chapter soon.
Building a Grammar
Based on the text and class notes, create a set of context-free grammar
rules that are adequate to analyze a small set of English natural language
sentences.
Your grammar should be able to produce parses for all
sentences in the files (as well as other similar ones in the English language).
The grammar should capture the major clause type (S, etc.), the major
phrase types (NP, VP, PP, etc.), the parts of speech (POS) (NN, VBZ), and any punctuation
or special symbols. The phrase and POS types specified in the Jurafsky and Martin text (CH. 12 and inside front cover) provide a good basis for your grammar.
You may hard-code capitalization.
Parsing
Create a program to parse the test sentences based on your grammar
and analyze the results. Specifically, your program should:
- Load your grammar.
- Build a parser for your grammar using nltk.parse.EarleyChartParser.
- Read in the example sentences.
- For each example sentence, output to a file
- the sentence itself
- the simple bracketed structure parse(s), and
- the number of parses for that sentence.
- Finally, print the average number of parses per sentence obtained by your grammar.
Programming
Create a program named hw1_parse.py to perform the
parsing as described above invoked as:
hw1_parse.py <grammar_file> <test_sentence_file> <output_file>
where
- <grammar_file> is the name of the file holding your grammar rules in the NLTK .cfg format.
- <test_sentence_file> is the name of the file holding the set of sentences to parse, one sentence per line
- <output_file> is the name of output file for your system
Files
Please adhere to the naming conventions.
Test and Example Files
- sentences.txt: Sentences to test against
- toy.cfg: Toy example NLTK-format grammar file. Other examples can be found on patas under /corpora/nltk/nltk-data/grammars/
- toy_sentences.txt: Example set of sentences to parse with toy grammar
- toy_output.txt: Example output file based on toy grammar
Submission files
- hw1_parse.py: Primary program file
- hw1_grammar.cfg: Your grammar file
- hw1_output.txt: Results of running your parsing program on the test sentences with your grammar.
- hw1.cmd: Condor file which drives your parser.
- readme.{txt|pdf}: Write-up file
-
This file should describe and discuss your work on this assignment. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work.
- hw1.tar: Your hand-in file
Handing in your work
All homework should be handed in using the class CollectIt.