Linguistics 570
HW #4
For this homework you will modify the tools you developed in
HW#3 and use them for a slightly different task, related to HW#3b. You will build 4 Portuguese language models,
and use perplexity to measure their accuracy against an independent test set
(represented as a model).
Description
Use the Portuguese language model (porteci.lm)
that can be found in dropbox/07-08/570/hw4) as your test set. Heres what you do:
- Build
two bigram language models over the Portuguese data in /corpora/LDC/LDC99T40/RAW/afp,
one built over just the 96 corpus files and one built over all the files. (You can use the C&T software and
extract the bigram and unigram counts, if you want, or build the bigram
and unigram models yourself.)
- Use
Add-1 smoothing over the two models, and then calculate their perplexity
(as defined in class, Day 10) against the porteci.lm
language model.
- Use
Uniform Good Turing over the two models, and again calculate their
perplexity against the same model.
Use a threshold k=5.
- Output
these perplexity scores (3 digits of significance) in a table like so:
Model 1 96, Add 1 n.nn
Model 2 all, Add 1 n.nn
Model 3 96, Good Turing n.nn
Model 4 all, Good Turing n.nn
- Output
your frequency of frequency tables for Models 3 & 4 (c* to 3 digits of
significance) up to k:
Model 3 Good Turing
c Nc c*
0 nnnnn n.nn
1 nnnnn n.nn
3 nnnnn n.nn
4 nnnnn n.nn
5 nnnnn n.nn
Notes:
- You
may do this exercise in Perl, Java, Python, C, C#, or Ruby.
- For
all tables, use white space (tabs, spaces) as column delimiters. At least one white space character
required.
- Number
output in tables should be to 3 significant digits (except for the N
values in the second table).
HW#4 due date: 11:45 p.m., Tuesday,
November 4th
Submit the following
(in one gz, tar, or zip file):
- hw4a.sh A shell script to get the hw4 results
for the first table (use standard out in the script, and you may list the tables
serially to standard out)
- hw4b.sh A shell script to get the hw4
results for the second and third tables (the Good Turing tables; you may
use stored results if youd like, as long as the files are kept locally)
- output4-0.txt containing the first
table (under bullet 4 above)
- output4-3.txt and output4-4.txt
containing the Model 3&4 tables (c, N, and c*, as described under
bullet 5)
- Your code
- comment.txt Any comments or other notes
of relevance