Lab 3 (due 1/20 11:59 pm)
The primary goal of this lab is to take the grammar from last week and improve it, primarily by improving the morphological component (reducing ambiguity, streamlining the position classes, and possibly improving coverage). We will also extend testsuites to illustrate a few more phenomena. You'll be using [incr tsdb()] to test the resulting grammar and compare it to your the grammar you built by the end of Lab 2.
This lab entails the following general steps:
- Testsuite extension: Choose 3 additional phenomena to work on (we'll try to come up with ones that make sense across many of the languages in class).
- Document those phenomena in a testsuite
- This is forward-looking work; we aren't trying to extend the grammar for these phenomena this week
- Process the extended testsuite with your lab 2 grammar.
- Morphological clean-up
- Process the test corpus and your testsuite using [incr tsdb()], the LKB, and the grammar resulting from your updated choices file.
- Examine the results of the second test run for coverage, accuracy and ambiguity, including as a diff to the final test run from last week.
- Write it all up :)
Back to top
Create a small testsuite for three phenomena
Choose 3 phenomena to work on, from the following list:
Create a small testsuite of examples, according to
the general instructions for
testsuites and the formatting
instructions, illustrating the phenomena you worked on above. The
testsuite should have both positive and negative examples, but doesn't
need to be exhaustive (since we're working with test corpora this
year), but you'll want both positive and negative examples for each of
the phenomena you work on in this section. I expect these testsuites
to have about 20-30 examples total by the end of this week, though you
can do more if you find that useful. All examples should be simple enough that your grammar can parse them or fails to parse them because of the one thing that's wrong with them.
Create a test suite skeleton
- Make a subdirectory called lab3 inside
tsdb/skeletons for your test suite.
- Edit tsdb/skeletons/Index.lisp to include a line for this
directory, e.g.:
(
((:path . "matrix") (:content . "matrix: A test suite created automatically from the test sentences given in the Grammar Matrix questionnaire."))
((:path . "corpus") (:content . "IGT provided by the linguist"))
((:path . "lab3") (:content . "Test suite collected for Labs 2-3."))
)
- Download the python script make_item, make
sure it is executable, and run it on your test suite:
make_item testsuite.txt
Notes on make_item:
- This script is going to be pretty picky about the format
of your test suite. If you have questions, please post to Canvas (10 minute rule!).
- It requires python3, which is on the current version of the Ubuntu+LKB appliance.
- Alternatively, you can copy your testsuite and make_item over to patas and run there, or install python3 (from http://python.org/download) on your host OS (mac or windows), and run make_item outside VirtualBox.
- If the above command is successful,
testsuite.txt.item
would be created in the working directory. If the testsuite contains errors, it's possible that a lot of output will appear on stderr. It maybe useful to redirect this into a file that you can use to go through
and correct the errors one at a time. For example:
./make_item testsuite.txt item 2>errs
The command just above attempts to create 'item' in the working directory, and stderr messages are redirected to the file 'errs'.
make_item
contains a default mapping from testsuite line types into particular fields of the [incr_tsdb()]
item file. The default mapping puts 'orth' into 'i-input', the field which the is the input to the grammar. If your grammar targets a different testsuite line, override the default mapping with the -m
/--map
option.
./make_item --map orth-seg i-input testsuite.txt item
The invocation above maps the orth-seg
line into the input field.
You can run make_item
with -h
/--help
to see a summary of the options.
- Copy the .item file which is output by make_item
to tsdb/skeletons/lab3/item.
- Copy tsdb/skeletons/Relations to tsdb/skeletons/lab3/relations (notice the change from R to r).
Back to top
Initial testsuite run
Back to top
Improve the morphotactics in the choices file
The kinds of improvements that might be required here depend on the choices files, but could be things like:
- Reducing ambiguity by combining multiple redundant position classes into one
- Reducing ambiguity by eliminating affixes that don't have any sensible morhposyntactic constraints associated with them
- Renaming the position classes and rules so that they have meaningful names.
- Making the overall system simpler, so that it is easier to work with, even if it has less coverage that way.
- Improving lexical coverage of the grammar by making sure that existing morphemes can attach to appropriate inputs.
- Improving lexical coverage by filling in additional lexical rules within existing position classes.
It may be tempting to rip the whole thing out and start from scratch, but doing so will lead to drastic reductions in coverage. If possible, it's better to improve on what's there.
Depending on the size of the choices file, the customization page might be really slow and you may find it more convenient to edit the choices file directly. This is okay, but proceed with caution: if you remove position classes that other position classes take as input, you'll need to update those other position classes as well.
This task could take up unlimited amounts of time, and we'd like to avoid that! You can get full credit for this part of the lab so long as:
- You have made changes to the choices file involving at least three position classes.
- Your write up clearly describes the changes you made and why.
- The resulting choices file can produce a customized grammar that loads and clearly improves on the previous grammar (more coverage or lower ambiguity without drastic reduction in coverage).
Make sure you can parse individual sentences
Once you have created your starter grammar (or each time you
create one, as you should iterate through grammar creation and
testing a few times as you refine your choices), try it out on a
couple of sentences interactively to see if it works:
- Load the grammar into the LKB.
- Using the parse dialog box (or 'C-c p' in emacs to get the parse
command inserted at your prompt), enter a sentence to parse. Alternatively, you can use Browse | Test items in [incr tsdb()] and just double click the item you want to parse. This sends the item to the LKB for interactive processing, but doesn't store the result in the test suite profile.
- Examine the results. If it does parse, check out the semantics (pop-up menu on the little trees). If it doesn't look at the parse chart to see why not.
- Problems with lexical rules and lexical entries often become apparent here, too: If the LKB can't find an analysis for one of your words, it will say so, and (obviously) fail to parse the sentence.
Note that the questionnaire has a section for test sentences. If
you use this, then the parse dialog will be pre-filled with your test sentences.
Back to top
Run both the test corpus and the testsuite
Following the same procedure as the first time you ran your test corpus, do test runs over both the testsuite and the test corpus.
Again, collect the following information to provide in your write up. Please present 1-4 as some kind of table contrasting before & after.
- How many items parsed?
- What is the average number of parses per parsed item?
- How many parses did the most ambiguous item receive?
- What sources of ambiguity can you identify?
- For 10 items (if you have at least that many parsing), do any of the parses look reasonable in the semantics?
Back to top
Write up
Your write up should be a plain text file (not .doc, .rtf or .pdf)
which includes the following:
- Your answers to the questions about the initial and final [incr tsdb()] runs, for both test corpus and test suite, repeated here:
- How many items parsed?
- What is the average number of parses per parsed item?
- How many parses did the most ambiguous item receive?
- What sources of ambiguity can you identify?
- For 10 items (if you have at least that many parsing), do any of the parses look reasonable in the semantics? [Can be the same items as before.]
- Documentation of the phenomena you have added to your testsuite,
illustrated with examples from the testsuite.
- Documentation of the improvements you made the morphotactic choices. What did you change and why? Please include IGT that illustrate the effects of the changes so I can test them out.
Back to top
Back to top
Back to course page
Last modified: