This assignment involves a little bit of coding with xfst. You'll need to turn in some code and results files, which should be submitted via CollectIt.
Using xfst, write a finite-state transducer that can generate and analyze a small set of verbs in all of their inflected forms. This FST will handle two spelling change rules: the rule that deletes a final e before -ing or -ed, and the rule that inserts a k when c appears between a vowel and -ing or -ed. The first rule is provided already. Your job is to write the second.
Note: xfst defines a language for regular expressions which makes it relatively easy to write morphophonological rewrite rules. For this problem, however, you must stay with the basic operators. No credit will be given for answers that use the xfst operator -> or its kin. On the other hand, if you get stuck, you might find it helpful to write the rule in that notation, and then examine the network that xfst produces.
To do this assignment, you'll need the following two files:
Copy them somewhere onto your Patas home directory. If you're using Windows to download them, make sure that it doesn't add any new file extensions.
verb_lexicon is the lexicon of verbs (in citation form) that we'll be
working with.
k.xfst is the xfst
script that does the work. It is the file you'll need to modify for this part
of the assignment.
To start xfst, log onto Patas and type "xfst" (your $path variable should already be set appropriately). You'll get an xfst prompt.
To run the script, enter:
source k.xfst
After you've run the script, there should be an FST on the stack. To apply that FST, try:
apply up spruced apply down picnic+ing
Observe that it doesn't yet have the right behavior in the second example.
Modify k.xfst until it has the right behavior. The files produced by the script (underlying, onerule, tworules, and threerules) should be helpful in testing it as you go. You can also use apply up and apply down to observe the behavior of the network. Here is a short summary of xfst syntax.
To examine a network, type:
print net
The network defined in k.xfst is too large to be usefully examined like this, but you might try some others:
read regex [a b c]; print net read regex [a+ b c]; print net read regex [e %+ -> 0 || _ [e|i] ]; print net
The SRI Language Modeling Toolkit is a toolkit for creating and using N-gram language models. It is installed on Patas, at /NLP_TOOLS/ml_tools/lm/srilm. In this part of the exercise, you will use it to train a series of language models, and see how well they model various sets of test data.
Copy these files to a directory on Patas.
| holmes.txt | 614,774 words | The complete Sherlock Holmes novels and short stories by A. Conan Doyle, with the exception of the collection of stories His Last Bow (see below) and the collection The Case Book of Sherlock Holmes (which is not yet in the public domain in this country). We will use this corpus to train the language models. |
| hislastbow.txt | 91,144 words | The collection of Sherlock Holmes short stories His Last Bow by A. Conan Doyle. |
| lostworld.txt | 89,600 words | The novel The Lost World by A. Conan Doyle |
| otherauthors.txt | 52,516 words | Stories by English Authors: London, a collection of short stories written around the same time as the Sherlock Holmes canon and The Lost World. |
We will use two utilities, ngram-count and ngram, both found in /NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64/. I suggest setting your PATH variable to include this path, at least for the duration of this assignment, by adding the following to the end of the file .bashrc in your home directory:
PATH=/NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64:$PATH
type
echo $PATHto make sure /NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64 appears.
You can find basic documentation for ngram and ngram-count here , and more extensive documentation here .
The following command will create a bigram language model called wbbigram.bo, using Witten-Bell discounting, from the text file holmes.txt:
ngram-count -text holmes.txt -order 2 -wbdiscount -lm wbbigram.bo
The following command will evaluate the language model wbbigram.bo against the test file hislastbow.txt
ngram -lm wbbigram.bo -order 2 -ppl hislastbow.txtThe file NgramQ.txt has an outline of the items to turn in. Modify the file by adding your answers and turn in via CollectIt.
Evaluate this language model against the other test sets, lostworld.txt and otherauthors.txt. In your writeup, tell us:
Now build trigram and 4gram language models against the same training data (still using Witten-Bell discounting). Tell us:
Build more language models using different smoothing methods. In particular, use "Ristad's natural discounting law" (the -ndiscount flag) and Kneser-Ney discounting (the -kndiscount flag). Tell us: