Linguistics 570
HW #3
For this homework, you will be doing two things. For the first part (3a), you’ll compare the n-gram language models built over language data to see how closely they resemble each other (a quasi-genetic study of the languages). For the second part (3b), you’ll be testing the goodness of a language model by comparing it to a language model that you build yourself.
HW#3a
In class, we used an application built on the Cavnar and Trenkle algorithm to do language ID. In some instances, we noticed it erred, and in some of those cases, it produced several language names. You may have noticed in some cases where it erred, it returned the names for genetically related languages (sometimes in addition to the one we were expecting).
Build language models using the C&T package (in dropbox/08-09/570/CT/) for four languages, English, German, Spanish, and Portuguese, and calculate KL-divergence as a measure of divergence for each of the language models. Do a pair-wise comparison, and be sure to smooth across each pair (to eliminate zero denominators). For the purposes of this exercise, compare only the unigram, bigram, and trigram models, each one against the other. You should output 3 tables for each n-gram (one for each value of n), one table after another in your output. They should be similar to the following:
English German Spanish Portuguese
English n.nn n.nn n.nn
German n.nn n.nn n.nn
Spanish n.nn n.nn n.nn
Portuguese n.nn n.nn n.nn
Since KL-divergence is not symmetric, use the probability distributions for languages on the left (the rows) as p(x) and probability distributions for those at the top (the columns) as q(x).
HW#3b
Build a C&T model over the Portuguese Newswire corpus (LDC99T40), which can be found in /corpora/LDC/LDC99T40/RAW/afp, for just the year 1996 (afp96*.sgm). You will want to build the model just over the Portuguese text, which is surrounded by the tags <TEXT></TEXT>. There are some other SGML tags (such as <p>) which will also need to be removed.
See how good your model is as compared to the one that C&T provides by calculating KL-divergence between the unigram, bigram and trigram spaces (use yours as p(x), and C&T as q(x)). Also, calculate the perplexity of your model (not theirs) for each n. Your output should look like the following:
1-gram 2-gram 3-gram
1996 Portuguese model divergence
Perplexity
Do the same thing for a larger sample of Portuguese, built over the 1996 and 1997 texts (afp96*.sgm and afp97*.sgm), and another for an even larger sample 1996-1998.
Notes:
HW#3 due date: 11:45 p.m., Wednesday,
October 29th
Submit the following
(in one gz, tar, or zip file):