Linguistics 570
Project 1
For this project, you will
build a Markov Tagger for Korean. Your
tagger will be trained on a subset of the data from the Morphologically
Annotated Korean Text, LDC2004T03, and be tested and evaluated against
another subset of the corpus. All
required files can be found in dropbox/08-09/570/project1.
The Data
The Morphologically
Annotated Korean Text is a corpus of Korean texts that have been annotated
using a modified Treebank tagset. The corpus file was split into two files, one
called korean-training.txt (the
first 1000 records) and the other called korean-testing.txt
(the final 574 records). Both are romanized versions
of the original corpus (to make it easier for you to use!). See the attached for the list of tags used
in the corpus. One of the entries from
this corpus is shown here:
kunun ku/NPN+ngun/PAU
lunoka luno/NPR+ngi/PCA
3 3/NNU
ngwelmalkkaci ngwel/NNX+mal/NNX+kkaci/PAU
nginswuceynguy nginswu/NNC+ceynguy/NNC
sihanngul sihan/NNC+ngul/PCA
kacko kac/VV+ko/ECS
ngisstako ngiss/VX+ta/EFN+ko/PAD
tespwuthngyessta tespwuthngi/VV+ngess/EPF+ta/EFN
. ./SFN
^EOS
The corpus consists of two
columns, the second annotated and the first not. For purposes of this project, you can ignore
the first column. Each line consists of
one word, with each sentence ending in ^EOS. Because Korean is typologically distinct from
English—Korean is agglutinating and English is quasi-analytic—tagging
Korean presents challenges over English.
Note, for example, that the tags are often marked on morphemes within
words (although for monomorphemic forms, the tag can
apply to the word). Also note that the
morphemes are separated by plusses (“+”).
The Project
There are three parts to this
project: (1) analysis and write-up, (2)
generating the transition and emission probabilities central to the tagger
(training), and (3) running the model you have built against a subset of the
data (testing). For (3), you will use
(and possibly modify) a Viterbi decoder supplied to
you.
1. Analysis and Write-up
English-centric taggers rely
heavily on the word to word transitions for their success. For instance, the transition from DT to JJ or
NN is major predictor for these tags.
Will a similar approach work with Korean? How might the methodology be adapted? Since Korean tags apply to morphemes, how
might relying on the morphological structure of Korean improve the performance
of a tagger? As always, unknown words
(and morphemes) present challenges to Markov taggers. How can the unknown word/morpheme problem be
addressed for Korean?
Write-up your analysis of the
Korean corpus, with an eye on what you intend to do in the development of your
own tagger. Feel free to think outside
the box, but recognize that your tagger must rely on n-gram transition
probabilities (bigram will probably be easiest to implement). Note:
it may prove worthwhile to write tools to do preliminary analyses of the
entire corpus to help guide your write-up.
2. Training
The training phase of your
tagger will involve generating transition and emission probabilities calculated
over the corpus. Although you will have
access to the entire corpus, training should only be against the training set,
which consists of the first 1000 sentences (there are 1574 sentences total).
3. Testing
For testing, adapt the Viterbi decoder supplied to you, viterbi.pl. This implementation of viterbi
runs as a standalone application, accepting as input a tag vocabulary, and transition
and emission probabilities. It then
takes as input on the command line a quoted string that it evaluates, outputting
the best tag sequence for that input.
Try the application with the supplied matrices for English to see how it
performs.
You’ll want to adapt this
decoder for your purposes, mostly to integrate it with your code and also to
figure out how to present the Korean data to it. You may need to change the code to support
log probabilities since underflows may be possible with a larger input
set. Don’t forget that this decoder will
not handle unknown words/morphemes well, an issue you will need to address.
Test against the remaining 474
sentences in the corpus, which have been separated into the testing set. Evaluate your tagger by computing accuracy,
and give a count of the number of morphemes evaluated. Submit in your project1-output.txt the
following (in this order): a. the tagged
output assigned to the final 474 sentences (the test set), which should be
formatted similar to the input corpus, noting mismatches with asterisks (e.g.,
**NNX**), b. the total number of morphemes evaluated, and c. the accuracy figure. Include in comments.txt an evaluation of your tagger’s successes and
failures.
Write-up due
date: 11:59 p.m., Monday, October 13th
Submit the following:
Project code and
output due date: 11:59 p.m., Wednesday, October 22nd
Submit minimally the
following (in one zip file is fine):
Han,
Chung-Hye and Martha Palmer. 2004. “A Morphological Tagger for Korean: Statistical Tagging
Combined with Corpus-Based Morphological Rule Application.” Machine Translation (18)4.
Sarkar, Anoop and Chung-hye Han. 2005. Statistical
Morphological Tagging and Parsing of Korean with an LTAG Grammar. Unpublished ms.
Noun NNC common noun
NNU numeric noun
NNX dependent noun
NPN pronoun
NPR proper noun
NFW foreign word
Post- PCA case postposition
position PAD adverbial
PAN adnominal
PCJ conjunctive
Predicate
VV verb
VJ adjective
VX auxiliary predicate
Verbal EPF pre-final ending
ending EFN final ending
ECS non-final ending
EAN adnominal ending
ENM nominalization ending
ADV adverb
ADC conjunctive adverb
DAN adnominal modifier
XSF suffix
XPF prefix
XSV verbalization suffix
XSJ adjectivization suffix
IJ interjection
Symbol SFN sentence-final symbols . ? !! ......
SCM comma ,
SLQ left delimiters: " ' ( < [ {
SRQ right delimiters: " ' ) > ] }
SSY symbol
Sample output, where **
ku/NPN+ngun/PAU
luno/NPR+ngi/PCA
3/NNU
ngwel/NNX+mal/**PAU**+kkaci/PAU
nginswu/NNC+ceynguy/NNC
sihan/NNC+ngul/PCA
kac/VV+ko/ECS
ngiss/VX+ta/**NNX**+ko/PAD
tespwuthngi/VV+ngess/EPF+ta/EFN
./SFN
^EOS
Total # of morphemes evaluated: nn
Accuracy: pp.pp %