Previous topic

Schedule

Next topic

Homework 2

This Page

Homework 1

Due: Tuesday, Jan 12th, 2010 at 11:59PM

1. Objectives and Overview

This assignment is aimed at developing a familiarity with natural language grammars by writing a context-free grammar (CFG) for sentences of English. You will be asked to use a parser from the Natural Language Toolkit (NLTK) to check the results. This exercise will familiarize you with using the NLTK and its associated documentation.

2. Inputs

The input files for this assignment are:

  • sentences: a text file with 20 English sentences

3. Detailed instructions

Task 0: Review the lectures on context-free grammars and parts of speech inventories.

Task 1: Using the syntactic categories found in the readings and lectures, write a CFG that is adequate to account for the syntactic structure of all the example sentences in sentences. Adequate means that, for each input sentence, your grammar accounts for:

  • the clause type (e.g., S, FRAG)
  • major phrase types (e.g., NP, VP)
  • the parts of speech of each syntactic word (e.g., NN, VBZ)
  • punctuation and special symbols (e.g., .,;%)

You may hard-code the capitalization, e.g., NNS –> dogs | Dogs . Your grammar should be able to account for all the 20 input sentences (and may account for other sentences as well).

Encode the CFG in a file called grammar.cfg in a format readable by the function nltk.data.load(). See the NLTK documentation for examples. The grammar should look like this sample.

Task 2: Create a script called hw1.cmd that processes grammar.cfg and sentences. The script should call code that performs the following tasks:

  • Load the grammar.cfg
  • Initialize nltk.parse.EarleyChartParser with the grammar
  • Read sentences line by line
  • Parse each sentence
  • Print the simple bracketed structure(s) for each parsed sentence, followed by the number of parses for that sentence, to a file called hw1.out.
  • Print the average number of parses per input sentence (this will show how ambiguous the sentences are wrt the grammar).

(See sample.out for the appropriate output format.)

Task 3: Please comment your code; include your names somewhere in the main script file.

4. Running your code

Your code should run on Patas without error. And in order for us to run your assignment in a semi-automated fashion, please include a single shell script file called, e.g., hw1.cmd. We will run your homework on Patas using the following command:

$ condor_submit hw1.cmd

Once we untar your assignment (see below), this shell script should be in the top level of whatever directory structure you’re using.

Within your hw1.cmd file write your .out, .log, .error, etc, files to the top-level directory where the hw1.cmd file is. The script should call all necessary code. This way, you can use whatever language you like and whatever directory structure makes sense to you. Please refer to the detailed explanation of each assignment for what kinds of output files to produce, and what kinds of supplementary files are required. See the CLMA wiki pages for help on this.

5. How to turn in your work

Turn in your assignment using CollectIt. Please TAR your files and name the tar’d file with the extension .tar. Please don’t use ZIP, tar.gz, gzip, rar, etc.

Use the filename of whatever homework we’re on, e.g. for homework 6 name your file hw6.tar. Yes you will all have the same filename for your homeworks, but this doesn’t matter because of the way that CollectIt handles things.

To tar (available on Patas) from the directory that your work is in:

$ tar -cvf hw6.tar *

Finally, see this sample of how you should package your code.

6. Assessment

This homework is worth 10 points toward your total grade.

  • 3.5 pts. Completeness (does the code run to completion?)
  • 3.5 pts. Output (does the code produce the expected output?)
  • 2 pts. Following directions (did you complete the assignment according to the instructions?)
  • 1 pt. Documentation (is the code adequately commented?)

General assessment criteria are explained here.