.. _hw1: Homework 1 ========== Due: Tuesday, Jan 12th, 2010 at 11:59PM **1. Objectives and Overview** This assignment is aimed at developing a familiarity with natural language grammars by writing a context-free grammar (CFG) for sentences of English. You will be asked to use a parser from the Natural Language Toolkit (NLTK) to check the results. This exercise will familiarize you with using the NLTK and its associated documentation. **2. Inputs** The input files for this assignment are: - `sentences `_: a text file with 20 English sentences **3. Detailed instructions** **Task 0:** Review the lectures on context-free grammars and parts of speech inventories. **Task 1:** Using the syntactic categories found in the readings and lectures, write a CFG that is adequate to account for the syntactic structure of all the example sentences in ``sentences``. *Adequate* means that, for each input sentence, your grammar accounts for: - the clause type - major phrase types - the parts of speech of each syntactic word - punctuation and special symbols. You may hard-code the capitalization, e.g., NNS --> dogs | Dogs . Your grammar should be able to account for all the 20 input sentences (and may account for other sentences as well). Encode the CFG in a file called ``grammar.cfg`` in a format readable by the function ``nltk.data.load()``. See the NLTK documentation for examples. **Task 2:** Create a script called ``hw1.sh`` that processes ``grammar.cfg`` and ``sentences``. The script should call code that performs the following tasks: - Load the ``grammar.cfg`` - Initialize ``nltk.parse.EarleyChartParser`` with the grammar - Read ``sentences`` line by line - Parse each sentence - Print the simple bracketed structure(s) for each parsed sentence, followed by the number of parses for that sentence, to a file called ``hw1.out``. - Print the average number of parses per input sentence (this will show how ambiguous the sentences are wrt the grammar). (See `sample.out `_ for the appropriate output format.) **Task 3:** Please comment your code; include your names somewhere in the main script file. **4. Running your code** Your code should run on Patas without error. And in order for us to run your assignment in a semi-automated fashion, please include a single shell script file called, e.g., ``hw1.sh``. We will run your homework on Patas using the following command: ``$ sh hw1.sh`` This shell script should be in the top level of whatever directory structure you're using. The script should call all necessary code. This way, you can use whatever language you like and whatever directory structure makes sense to you. Please refer to the detailed explanation of each assignment for what kinds of output files to produce, and what kinds of supplementary files are required. **5. How to turn in your work** Turn in your assignment using `CollectIt `_. Please TAR your files and name the tar'd file with the extension ``.tar``. Please don't use ZIP, tar.gz, gzip, rar, etc. Use the filename of whatever homework we're on, e.g. for homework 6 name your file ``hw6.tar``. Yes you will all have the same filename for your homeworks, but this doesn't matter because of the way that CollectIt handles things. To tar (available on Patas) from the directory that your work is in: ``$ tar -cvf hw6.tar *`` **6. Assessment** This homework is worth 10\% of your total grade. Assessment criteria are explained `here `_.