Linguistics 570
HW #1a
Following is a set of Turkish words and their English glosses:
el '(the) hand'
eller 'hands'
elim 'my hand'
eve 'to (the) house'
ellerimiz 'our hands'
evlerde 'in (the) houses'
evden 'from (the) house'
ellerim 'my hands'
ellerinize 'to your(pl.) hands
evlerim 'my houses'
elin 'your(sing.) hand
evimiz 'our house'
evde 'in (the) house'
evimde 'in my house'
evlerimiz 'our houses'
evlerimizden 'from our houses'
evleriniz 'your(pl.) houses
evim 'my house'
ellerimden 'from my hands'
evler 'houses'
eline 'to your(sing.) hand
ellerin 'your(sing.) hands'
elimden 'from my hand'
evine 'to your(sing.) house'
Design an FST
that will minimally accept the Turkish words shown above,
and that will output the English glosses for each of the morphemes (the glosses
do not need to be output in “English” order).
Be sure to design the transducer to be as efficient as possible (i.e.,
avoid redundant and empty arcs). Test your FST against the following three
strings and show the English output for these two strings: ev, evlerimde,
elinize
(Please note: the level of granularity for the FST can be at the level of the morpheme. In other words, you do not have to have arcs for each of the sounds that compose a morpheme.)
due date: 5 p.m., Friday, September 26th
Submit Hardcopy in
Professor Lewis’s box in the Linguistic’s Office (Padelford A210), including
the output from your test. If you are an
online student or otherwise unable to come to campus, please scan your answer
and submit via
Linguistics 570
HW #1b
For this homework assignment, you will find a website, tokenize the “words” contained on that website, and output a sorted list of the 30 most frequent words, ordered by frequency.
Here’s what you need to do for the assignment:
1. Go to Literature.org, choose a chapter from one of the books listed there. Whatever chapter you choose, be sure that it consists mostly of English text and has at least 3,000 words.
2. Save the text for the page to a file.
3. Write a program in Perl, Python or Java that reads the file, and generates output that contains a list of all the word types on the page with token counts. The output should tab delimited and consist of a separate word type and count on each line, e.g.,
able 5
the 325
to 250
look 10
4. Take the output generated, sort it, and output only the first 30 most frequent words. The sort and truncation functions should be done in a shell script external to your application.
5. Submit a copy of your program using CollectIt.
Although you are encouraged to work with other students on this assignment, and are welcome to ask for help and advice if anything is unclear, the code you turn in must be your own.
due date: 11:59 p.m., Sunday, September 28th
Submit the following
via CollectIt (in one tar, gz, or zip file):