Linguistics 570

HW #2

 

This exercise will consist of 2 parts, one corresponding to Homework #2a and one to Homework #2b. For the HW#2a component, you will process a large annotated corpus and determine the type-tag frequency for all tagged tokens across the corpus.  For the HW#2b component, you will (1) list the most frequent tag-tag sequences (e.g., DT followed by NN), (2) produce a Markov transition matrix for the most frequently occurring tag-tag sequences, and (3), produce an emission probability matrix for the most frequently occurring types and their associated tags.

 

Please note:  you will be using files from the Penn Tree Bank, licensed from the LDC to the University of Washington.  Please review the corpus use policies that can be found off of the home page for the course.  The corpus you will be using is part of LDC95T7 and LDC99T42, which falls under the general licensing arrangements with the LDC.  It is important that you adhere to the licensing restrictions for this and other corpora.

 

HW#2a

 

The input corpus for the exercise consists of a set of files that can be found in the ~/dropbox/08-09/570/hw2 directory.  The files are tagged with PTB tags, a snippet of which follows:

 

Whenever/WRB

a/DT computer/NN

randomly/RB calls/VBZ

them/PRP

from/IN

jail/NN

,/,

the/DT former/JJ prisoner/NN

plugs/VBZ in/IN to/TO let/VB

corrections/NNS officials/NNS

know/VB

they/PRP

're/VBP in/IN

the/DT right/JJ place/NN

at/IN

the/DT right/JJ time/NN

./.

 

From the input, you will generate a sorted output file that contains the 20 most frequent type-tag pairs, sorted by frequency, such as follows (note the removal of case):

 

the/DT                          3

right/JJ                          2

in/IN                            2

whenever/WRB            1


HW#2b

 

Building on the work that you started under HW#2a, using the same training corpus, generate the following:

 

(1)   List the 20 most frequent tag-tag sequences, e.g.

 

DT NN            3

DD JJ               1

 

            Should two tag sequences have the same frequency, sort alphabetically by tag sequence.

 

(2)   Produce a Markov transition matrix for the 10 most frequently occurring tags.  This table will be a 10x10 matrix which will represent part of the entire transition matrix you will build (in other words, you’ll have to build the whole matrix in order to output this part). The rows correspond to the tag that comes first, with the columns corresponding to the second (e.g., the logprob for DT JJ is -0.514573).  Use log probabilities, base 2.  All tags should be in upper case.

 

 

DT

JJ

NN

VBD

VBZ

DT

-Inf

-0.514573

-1.689660

-10.965784

-Inf

JJ

-3.321928

-2.000000

-3.251539

-1.000000

-3.000000

           

 

(3)   Produce an emission probability matrix for the 20 most frequently occurring types and their associated tags.  This matrix will be a 20 row by n column matrix, which will represent a part of the entire probability matrix you will build.  The columns (tags) should correspond to all tags seen in the corpus, and should be sorted alphabetically by tag name.  Rows should be sorted by frequency (the most frequently occurring type first).  Use log probabilities, base 2.  All tags should be in upper case.

 

 

DT

JJ

IN

NN

TO

VB

the

-0.736966

-Inf

-Inf

-Inf

-Inf

-Inf

a

-1.321928

-Inf

-Inf

-Inf

-Inf

-Inf

in

-Inf

-Inf

-Inf

-Inf

-Inf

-Inf

book

-Inf

-11.702750

-Inf

-0.514573

-Inf

-1.736966

 

Notes:

  1. You may do this exercise in Perl, Java, Python or C.
  2. All languages have native hash functions/methods that will probably be of great use for this homework.
  3. For all tables, use white space (tabs, spaces) as column delimiters.  At least one white space character required.
  4. Number output in tables (2) and (3) should be to 6 digits after the decimal point.

 

HW#2 due date:  11:45 p.m., Wednesday, October 8th

Submit the following (in one gz, tar, or zip file):

  1. hw2a.sh – A shell script to get hw2a results.  The script should allow the specification of only one parameter, namely the path to the input files (e.g., hw2a.sh ~/dropbox/08-09/570/hw2).  All output should be to standard out.
  2. output2a.txt (containing the output as described)
  3. hw2b.sh – A shell script to get hw2b results. The script should all the specification of only one parameter, namely the path to the input files.
  4. output2b.txt (containing the output as described)
  5. Your code in Perl, Java, C, Python, Ruby, …
  6. readme.txt – Any comments or explanations.  Can be left empty if none.