Linguistics 570
HW #2
This exercise will consist of 2 parts, one corresponding to Homework #2a and one to Homework #2b. For the HW#2a component, you will process a large annotated corpus and determine the type-tag frequency for all tagged tokens across the corpus. For the HW#2b component, you will (1) list the most frequent tag-tag sequences (e.g., DT followed by NN), (2) produce a Markov transition matrix for the most frequently occurring tag-tag sequences, and (3), produce an emission probability matrix for the most frequently occurring types and their associated tags.
Please note: you will
be using files from the Penn Tree Bank, licensed from the LDC to the
HW#2a
The input corpus for the exercise consists of a set of files that can be found in the ~/dropbox/08-09/570/hw2 directory. The files are tagged with PTB tags, a snippet of which follows:
Whenever/WRB
a/DT computer/NN
randomly/RB calls/VBZ
them/PRP
from/IN
jail/NN
,/,
the/DT former/JJ prisoner/NN
plugs/VBZ in/IN to/TO let/VB
corrections/NNS officials/NNS
know/VB
they/PRP
're/VBP in/IN
the/DT right/JJ place/NN
at/IN
the/DT right/JJ time/NN
./.
From the input, you will generate a sorted output file that contains the 20 most frequent type-tag pairs, sorted by frequency, such as follows (note the removal of case):
the/DT 3
right/JJ 2
in/IN 2
whenever/WRB 1
HW#2b
Building on the work that you started under HW#2a, using the same training corpus, generate the following:
(1) List the 20 most frequent tag-tag sequences, e.g.
DT NN 3
DD JJ 1
…
Should two tag sequences have the same frequency, sort alphabetically by tag sequence.
(2) Produce a Markov transition matrix for the 10 most frequently occurring tags. This table will be a 10x10 matrix which will represent part of the entire transition matrix you will build (in other words, you’ll have to build the whole matrix in order to output this part). The rows correspond to the tag that comes first, with the columns corresponding to the second (e.g., the logprob for DT JJ is -0.514573). Use log probabilities, base 2. All tags should be in upper case.
|
|
DT |
JJ |
NN |
VBD |
VBZ |
|
DT |
-Inf |
-0.514573 |
-1.689660 |
-10.965784 |
-Inf |
|
JJ |
-3.321928 |
-2.000000 |
-3.251539 |
-1.000000 |
-3.000000 |
…
(3) Produce an emission probability matrix for the 20 most frequently occurring types and their associated tags. This matrix will be a 20 row by n column matrix, which will represent a part of the entire probability matrix you will build. The columns (tags) should correspond to all tags seen in the corpus, and should be sorted alphabetically by tag name. Rows should be sorted by frequency (the most frequently occurring type first). Use log probabilities, base 2. All tags should be in upper case.
|
|
DT |
JJ |
IN |
NN |
TO |
VB |
|
the |
-0.736966 |
-Inf |
-Inf |
-Inf |
-Inf |
-Inf |
|
a |
-1.321928 |
-Inf |
-Inf |
-Inf |
-Inf |
-Inf |
|
in |
-Inf |
-Inf |
-Inf |
-Inf |
-Inf |
-Inf |
|
book |
-Inf |
-11.702750 |
-Inf |
-0.514573 |
-Inf |
-1.736966 |
Notes:
HW#2 due date: 11:45 p.m., Wednesday,
October 8th
Submit the following
(in one gz, tar, or zip file):