Ling/CSE 472: Assignment 3: Part-of-Speech Tagging and Parsing

Due May 1st, by 6:00pm

Part 1: Part-of-Speech Tagging

This assignment explores the relationship between the tagsets chosen in a part-of-speech tagging task, and the accuracy of taggers trained with those tagsets. You will need to turn in one python script and one file with answers to some questions, which should be submitted via CollectIt.

You will need the following three python scripts to complete the assignment:

  • alltags.py
  • 36tags.py
  • 23tags.py
  • Download them and save them to a directory in your Patas account. These python scripts depend on the NLTK library, which is installed on Patas. They will not work on other machines, even if they have python installed, unless NLTK is installed as well.

    In alltags.py, a tagger is trained from Penn Treebank data using all 46 of the Penn Treebank tags. It is a bigram tagger; for unseen bigrams it backs off to a unigram tagger; by default it tags unseen words NN.

    In the other two files, a mapping has been defined which is used to convert the tags in the training and test data. For instance, using the mapping

    { 'NN':'NN', 'NNP':'NN', 'NNPS':'NN' ... }

    the corpus is converted so that every word tagged with 'NN', 'NNP' or 'NNPS' becomes tagged with 'NN'.

    In the file 36tags.py, the mapping collapses the full 46 tags to 36 tags; in the file 23tags.py, the 46 tags are collapsed to just 23 tags.

    When you run one of these files in python, for instance:

    $ python alltags.py

    It will train the tagger, and print out its accuracy against the test data.

    For this assignment, turn in two files. The first file should contain the answers to the following questions:

    The second file should be a python script containing a mapping which leads to a tagger whose accuracy is higher than that of any of the other three. Just copy either 36tags.py or 23tags.py and change the mapping however you like, except there must be at least two different tags in the resulting tagset.

    Part 2: Parsing

    For this assignment, you will be using the LKB grammar development environment. The LKB is installed on the machines in the Treehouse. It is open source software, available for linux and windows. Installation instructions here.

    For this assignment, you will be asked to turn in typed answers. Here is a template answer file into which you can type your answers for Part 2.

    Be sure to get an early start so that you have time to ask questions on GoPost.

    [These directions assume you are working on the Treehouse machines, which are linux machines.]

    Start the LKB, and load the grammar

    A. Charts and trees, edges and nodes

    B. Start symbols


    Back to main course page