Linguistics 570

HW #4

 

For this homework you will modify the tools you developed in HW#3 and use them for a slightly different task, related to HW#3b.  You will build 4 Portuguese language models, and use perplexity to measure their accuracy against an independent test set (represented as a model).

 

Description

 

Use the Portuguese language model (porteci.lm) that can be found in dropbox/07-08/570/hw4) as your test set.  Here’s what you do:

 

  1. Build two bigram language models over the Portuguese data in /corpora/LDC/LDC99T40/RAW/afp, one built over just the 96 corpus files and one built over all the files.  (You can use the C&T software and extract the bigram and unigram counts, if you want, or build the bigram and unigram models yourself.)
  2. Use Add-1 smoothing over the two models, and then calculate their perplexity (as defined in class, Day 10) against the porteci.lm language model.
  3. Use Uniform Good Turing over the two models, and again calculate their perplexity against the same model.  Use a threshold k=5.
  4. Output these perplexity scores (3 digits of significance) in a table like so:

 

Model 1 – 96, Add 1               n.nn

Model 2 – all, Add 1                n.nn

Model 3 – 96, Good Turing      n.nn

Model 4 – all, Good Turing       n.nn

 

  1. Output your frequency of frequency tables for Models 3 & 4 (c* to 3 digits of significance) up to k:

 

Model 3 – Good Turing

 

c                      Nc                   c*

0                      nnnnn               n.nn

1                      nnnnn               n.nn

3                      nnnnn               n.nn

4                      nnnnn               n.nn

5                      nnnnn               n.nn

 

 


Notes:

  1. You may do this exercise in Perl, Java, Python, C, C#, or Ruby.
  2. For all tables, use white space (tabs, spaces) as column delimiters.  At least one white space character required.
  3. Number output in tables should be to 3 significant digits (except for the N values in the second table).

 

HW#4 due date:  11:45 p.m., Tuesday, November 4th

Submit the following (in one gz, tar, or zip file):

  1. hw4a.sh – A shell script to get the hw4 results for the first table (use standard out in the script, and you may list the tables serially to standard out)
  2. hw4b.sh – A shell script to get the hw4 results for the second and third tables (the Good Turing tables; you may use stored results if you’d like, as long as the files are kept locally)
  3. output4-0.txt – containing the first table (under bullet 4 above)
  4. output4-3.txt and output4-4.txt – containing the Model 3&4 tables (c, N, and c*, as described under bullet 5)
  5. Your code
  6. comment.txt – Any comments or other notes of relevance