Linguistics 570

Project 1

 

For this project, you will build a Markov Tagger for Korean.  Your tagger will be trained on a subset of the data from the Morphologically Annotated Korean Text, LDC2004T03, and be tested and evaluated against another subset of the corpus.   All required files can be found in dropbox/08-09/570/project1.

 

The Data

 

The Morphologically Annotated Korean Text is a corpus of Korean texts that have been annotated using a modified Treebank tagset.  The corpus file was split into two files, one called korean-training.txt (the first 1000 records) and the other called korean-testing.txt (the final 574 records). Both are romanized versions of the original corpus (to make it easier for you to use!).   See the attached for the list of tags used in the corpus.  One of the entries from this corpus is shown here:

 

kunun               ku/NPN+ngun/PAU

lunoka              luno/NPR+ngi/PCA

3                  3/NNU

ngwelmalkkaci      ngwel/NNX+mal/NNX+kkaci/PAU

nginswuceynguy      nginswu/NNC+ceynguy/NNC

sihanngul           sihan/NNC+ngul/PCA

kacko               kac/VV+ko/ECS

ngisstako           ngiss/VX+ta/EFN+ko/PAD

tespwuthngyessta   tespwuthngi/VV+ngess/EPF+ta/EFN

.                  ./SFN

^EOS

 

The corpus consists of two columns, the second annotated and the first not.  For purposes of this project, you can ignore the first column.  Each line consists of one word, with each sentence ending in ^EOS.  Because Korean is typologically distinct from English—Korean is agglutinating and English is quasi-analytic—tagging Korean presents challenges over English.  Note, for example, that the tags are often marked on morphemes within words (although for monomorphemic forms, the tag can apply to the word).  Also note that the morphemes are separated by plusses (“+”).

 

The Project

 

There are three parts to this project:  (1) analysis and write-up, (2) generating the transition and emission probabilities central to the tagger (training), and (3) running the model you have built against a subset of the data (testing).  For (3), you will use (and possibly modify) a Viterbi decoder supplied to you.

 

1.      Analysis and Write-up

 

English-centric taggers rely heavily on the word to word transitions for their success.  For instance, the transition from DT to JJ or NN is major predictor for these tags.  Will a similar approach work with Korean?  How might the methodology be adapted?  Since Korean tags apply to morphemes, how might relying on the morphological structure of Korean improve the performance of a tagger?  As always, unknown words (and morphemes) present challenges to Markov taggers.  How can the unknown word/morpheme problem be addressed for Korean?

 

Write-up your analysis of the Korean corpus, with an eye on what you intend to do in the development of your own tagger.  Feel free to think outside the box, but recognize that your tagger must rely on n-gram transition probabilities (bigram will probably be easiest to implement).  Note:  it may prove worthwhile to write tools to do preliminary analyses of the entire corpus to help guide your write-up. 

 

2.      Training

 

The training phase of your tagger will involve generating transition and emission probabilities calculated over the corpus.  Although you will have access to the entire corpus, training should only be against the training set, which consists of the first 1000 sentences (there are 1574 sentences total).

 

3.      Testing

 

For testing, adapt the Viterbi decoder supplied to you, viterbi.pl.  This implementation of viterbi runs as a standalone application, accepting as input a tag vocabulary, and transition and emission probabilities.  It then takes as input on the command line a quoted string that it evaluates, outputting the best tag sequence for that input.  Try the application with the supplied matrices for English to see how it performs.

 

You’ll want to adapt this decoder for your purposes, mostly to integrate it with your code and also to figure out how to present the Korean data to it.  You may need to change the code to support log probabilities since underflows may be possible with a larger input set.  Don’t forget that this decoder will not handle unknown words/morphemes well, an issue you will need to address.

 

Test against the remaining 474 sentences in the corpus, which have been separated into the testing set.  Evaluate your tagger by computing accuracy, and give a count of the number of morphemes evaluated.  Submit in your project1-output.txt the following (in this order):  a. the tagged output assigned to the final 474 sentences (the test set), which should be formatted similar to the input corpus, noting mismatches with asterisks (e.g., **NNX**), b. the total number of morphemes evaluated, and c. the accuracy figure.  Include in comments.txt an evaluation of your tagger’s successes and failures.

 

Write-up due date:  11:59 p.m., Monday, October 13th

Submit the following:

  1. Your analysis and write-up in .txt, .pdf, or .rtf format.  Indicate group members.

 

Project code and output due date:  11:59 p.m.,  Wednesday, October 22nd

Submit minimally the following (in one zip file is fine):

  1. korean-tagger (.pl or (.java and .class) or .py, etc., & any addtl. code
  2. korean-tagger.sh (the shell script to run the tagger): read only from STDIN (thus, we should be able to run your tagger against any input via route or pipe)
  3. project1-output.txt (containing the output as described)
  4. comments.txt – Evaluation (as described), and other comments or explanations.

 

Readings:  There are two readings that me of some relevance, the first more than the second (the second may be of interest to those in 571, since it discusses parsing more than tagging).  The first may only be accessible from on campus or through the library’s off-site proxy:

 

Han, Chung-Hye and Martha Palmer.  2004.  A Morphological Tagger for Korean: Statistical Tagging Combined with Corpus-Based Morphological Rule Application.”  Machine Translation (18)4.

 

Sarkar, Anoop and Chung-hye Han.  2005.  Statistical Morphological Tagging and Parsing of Korean with an LTAG Grammar.  Unpublished ms.


 

 

  Noun      NNC      common noun   

            NNU      numeric noun  

            NNX      dependent noun

            NPN      pronoun       

            NPR      proper noun   

            NFW      foreign word      

  Post-     PCA      case postposition 

 position   PAD      adverbial           

            PAN      adnominal           

            PAU      auxiliary           

            PCJ      conjunctive   

 Predicate  VV       verb          

            VJ       adjective           

            VX       auxiliary predicate

  Verbal    EPF      pre-final ending    

  ending    EFN      final ending  

            ECS      non-final ending  

            EAN      adnominal ending  

            ENM      nominalization ending

  Etc       CO       copula        

            ADV      adverb        

            ADC      conjunctive adverb  

            DAN      adnominal modifier  

            XSF      suffix        

            XPF      prefix        

            XSV      verbalization suffix

            XSJ      adjectivization suffix

            IJ       interjection  

  Symbol    SFN      sentence-final symbols . ? !! ......

            SCM      comma ,

            SLQ      left delimiters: " ' ( < [ {

            SRQ      right delimiters: " ' ) > ] }

            SSY      symbol


Sample output, where **PAU** and **NNX** are mistags:

 

ku/NPN+ngun/PAU

luno/NPR+ngi/PCA

3/NNU

ngwel/NNX+mal/**PAU**+kkaci/PAU

nginswu/NNC+ceynguy/NNC

sihan/NNC+ngul/PCA

kac/VV+ko/ECS

ngiss/VX+ta/**NNX**+ko/PAD

tespwuthngi/VV+ngess/EPF+ta/EFN

./SFN

^EOS

 

Total # of morphemes evaluated:     nn

Accuracy:                           pp.pp %