Lab 5 (due 2/7 11:59 pm)

Overview

This is our final lab with the customization system. It is also our first foray into MT. The focus will be on finishing up the choices files (though it's not expected that you will have used every part of the customization system) and on getting one sentence translating from English to your language. You will also work on collecting the MMT sentences for your language and use [incr tsdb()] to compare the initial and final state of your grammar for the week over the testsuites.

This lab entails the following general steps, which are not (fully) ordered with respect to each other.

Begin to collect the translations of the MMT sentences (and add the ones you are able to collect to your testsuite).
Choose three phenomena from labs 2-4 to make further refinements on in your choices file (this can be ones you already worked on, but that need more work!) Alternatively, you can add wh questions as one of the three phenomena.
Only when you have finished working with the customization system, if you have simple tdl fixes I have suggested, add these in.
Process your testsuite using [incr tsdb()], the LKB, and the grammar resulting from your updated choices file.
Examine the results of the second test run for coverage, accuracy and ambiguity, including as a diff to the final lab 4 test run
Try a first translation
Collect a mini test corpus
Write it all up :)

Begin collecting the MMT sentences for your language

We will be working with the sentences in eng.txt, but it is not expected that every grammmar will cover every sentence. For this week, I ask you to:

Find translations (or approximations) for all of the words in the small vocabulary of those sentences.
- For example, if you can't find park, you might replace it with field or beach.
- Translate the first three, and include them as items in your testsuite.txt file
- Update your testsuite skeleton to reflect the new testsuite.text.
- Create a iso.txt file for your language (with iso changed to your language code) with just the line you expect your grammar to parse for each sentence, one per line, and in the same order as eng.txt. For any you haven't found a translation for yet, just write SKIPPED.
- Detemine (and document) whether any of the other sentences will be impossible to translate given your resources and/or impossible to model (involving phenomena you don't expect to get to).

For the write up for this portion, I expect you to tell me about the process you went through and report on item 5 above.

Improve the choices file for three phenomena

For the three phenomena you chose, refine the choices file by hand (through the quesionnaire or via direct editing or some combination). Please be sure to post lots of questions on Canvas as you work on this! I expect the write up of this portion to include copy paste of the specific choices values you changed as well as relevant IGT that I can use to test the effects.

Tdl edits

By now, you have may collected some suggested tdl edits (from lab 2-4 feedback or in class). Once you are all done refining things via the customization system, patch these into your grammar.

The only tdl edits this week should be things that I have suggested as bug fixes or work arounds. You are not expected to come up with tdl edits on your own. If I haven't suggested any to you, this section is a freebie --- nothing to do here!

For the write up, please include the actual tdl changes and an explanation of their purpose.

Try a first translation

Preliminaries

In later labs, we will refine the variable property mapping and create small transfer grammars for each language by using it as the target language in two translation pairs, with English and another language (probably Pite Saami) as the inputs. For now, we'll be attempting to get just one sentence through. This will be one that doesn't actually require any transfer rules.

In all of the instructions below, replace "iso" with the ISO 693-3 code for your language.

Download and unpack mmt.tgz.
- Note, as I tested this on Jan 31, 2025, I couldn't unpack it on the UTM virtual machine. If this happens for you, try unpacking in your host OS and then moving the folder over.
Test eng2sje and sje2eng translation:
```
 cd mmt/
 ./translate-line.sh eng sje 1
```
- Note 1: If you aren't working on the VM, you'll need to fix the path to ace in the file translate-line.sh (and possibly install ace).
- Note 2: If you get "permission denied", it probably means that translate.sh isn't executable. This will fix that problem:
```
 chmod u+x translate-line.sh 
```
Look inside translate-line.sh; try changing which line is not commented out and see what different behaviors you get.

Make a symlink to your grammar in mmt/grammars/iso

    ln -s /path/to/your/grammar mmt/grammars/iso

and compile it afresh with ace:

  cd mmt/grammars/iso
  ace -G iso.dat -g ace/config.tdl

Move the generic transfer grammar mmt/tm/gen to mmt/tm/iso
```
  cd mmt/tm
  mv gen iso
```
Compile that generic transfer grammar:
```
  ace -G iso.dat -g ace/config.tdl
```
Copy your MMT entences to test_sentences/iso.txt.
Try translating the first sentence from eng to your language:
```
 ./translate-line.sh eng iso 1
```
This one should not require any transfer rules. If it doesn't work, there are several possible causes:
- A bug in your MT set up. If you are seeing errors that suggest this might be the problem, post to Canvas.
- Your grammar isn't generating. Confirm that this is the problem by trying monolingual generation (or iso2iso translation). Post to Canvas for help debugging.
- The MRSs don't match. Compare the eng (or sje) MRSs to yours. Can you spot the difference? If you find any, modify your grammar until the MRSs match. Post to Canvas for help. A subcase here is that the PREDs and their arguments match, but the variable properties don't. We might be debugging semi.vpm files. Either way, all changes should be in your grammar, and not eng or sje.
- If you aren't working in the provided VM, your ace version may differ from that used to compile the eng and sje grammars (and also transfer grammars). I this case, recompiling all of those may help.

For your write up for this part, please describe what happened when you tried the steps above. What difficulties did you encounter and how did you resolve them? What output did you get?

Run the testsuite

Following the same procedure as usual, do a test run over your testsuite.

Collect the following information to provide in your write up:

How many items parsed?
What is the average number of parses per parsed item?
How many parses did the most ambiguous item receive?
What sources of ambiguity can you identify?

Test corpus

In order to get a sense of the coverage of our grammars over naturally occurring text, we are going to collect small test corpora. Minimally, these should consist of 5-10 sentences of naturally occurring text. Perhaps your grammar resource has a collection of stories, in which case, 5-10 consecutive sentences. Alternatively, you might locate 5-10 interesting example sentences in your resource that appear to be collected from naturally occurring discourse (rather than looking like simple constructed sentences). As a last resort, you might look for other resources for your language online.

Creating large test corpora is discoraged, unless:

Your language has a simple enough morphophonology that your grammar is directly targeting surface forms.
You have easy access to large digitized texts (i.e., you don't have to type something in by hand).
Someone has already provided the glossing (IGT) for those large digitized texts. You don't necessarily need IGT, but it is much harder to work with unglossed text.

Note: 1,000 sentences is the maximum practical size for any single [incr tsdb()] skeleton. You could of course split your test corpus over multiple skeletons, but I'd be surprised if anyone got close to 1,000 sentences!

Note also that our grammars won't cover anything without lexicon. If you have access to a digitized lexical resource that you can import lexical items from, you can address this to a certain extent. Otherwise, you'll want to limit your test corpus to a size that you are willing to hand-enter vocabulary for. (If you have access to a Toolbox lexicon for your language, contact me about importing via the customization system.)

For Lab 5, your task is to locate your test corpus (5-10 sentences is what is expected, more only if you want and you have access to the resources described above) and format it for [incr tsdb()]. If you have IGT to work with in the first place, it may be convenient to use the make_item script to create the test corpus skeleton. (Note that you want this to be separate from your regular test suite skeleton.) Otherwise, you can use [incr tsdb()]'s own import tool (File | Import | Test items) which expects a plain text file with one item per line. The result of that command is a testsuite profile from which you'll need to copy the item (and relations) file to create a testsuite skeleton.

Check list:

tsdb/skeletons directory should include two subdirectories: one for the test corpus, and one for the test suite.
tsdb/skeletons/Index.lisp should include two items in the list of directories: one for your test corpus and one for your test suite.
When the Skeletons Root is pointed at your tsdb/skeletons directory, File | Create should show two possibilities (test suite and test corpus).
The items in your test corpus should be in the format (standard orthography or transliteration, morpheme segmented or not) that your grammar expects.

Write up

Your write up should be a plain text file (not .doc, .rtf or .pdf) which includes the following:

A description of the phenomena you improved in the choices file, including:
- Prose description of the phenomenon
- Prose description of your analysis
- The specific changes you made to choices (paste in the actual choices)
- Specific IGT I can use to test the analysis / investigate if something isn't working and you need help.
A description of any tdl edits you made and what they are for.
A description of your process for translating the MMT sentences and your documentation about which sentences may be impossible.
A description what happened when you tried the MT set up. What difficulties did you encounter and how did you resolve them? What output did you get?
A description of what you collected for your test corpus and how you collected it.
A description of the performance of your final grammar for this week on the test suite, as compared to your starting grammar (see details above).

Submit your assignment

Be sure your write up and the text-file version of your test suite are included in your grammar directory.
Please do not include the mmt directory this week. I only need your grammar (and definitely don't need a second copy of it inside mmt).
Likewise, make sure that tsdb/home includes two profiles:
1. Final testsuite with initial grammar for the week
2. Final testsuite with final grammar for the week

Create a tarball:

      tar czf iso-lab5.tgz iso-lab5

Upload the tarball to Canvas.

Back to course page

Last modified: