Lab 8 (Due 2/24 11:45pm)

Preliminaries

These instructions will most likely get edited over the next couple of days. I'll tryt o flag changes.

As usual, check the write up instructions first.

Requirements for this assignment

0. Make sure you have a baseline test suite corresponding to your lab 7 grammar.
1. Add information structure marking constructions. You'll need to post to GoPost early in the week with a description of how these constructions work in your language so I can help with the syntax.
2. Check the semantics of your information structure marking constructions.
3. Make sure your grammar can still generate, and debug as necessary.
4. Use VPM to cut back on range of generation.
5. Test your grammar using [incr tsdb()]. [incr tsdb()] should be part of your test-development cycle. In addition, you'll need to run a final test suite instance for this lab to submit along with your basline.
6. Write up the lab.

Run a baseline test suite

Before making any changes to your grammar for this lab, run a baseline test suite instance. If you decide to add items to your test suite for the material covered here, consider doing so before modifying your grammar so that your baseline can include those examples. (Alternatively, if you add examples in the course of working on your grammar and want to make the snapshot later, you can do so using the grammar you turned in for Lab 7.)

Background

The goal of this lab is to model morphosyntactic (and, if you prefer, pseudo-model prosodic) marking of information structural concepts. Information structure is a pragmatic phenomenon relating to how the speaker/author presents the information contained in an utterance. Only in rare cases do we find grammaticality affected by information structure marking. Rather, marking of information structure constrains the possible interpretations of an utterance.

There is no consensus yet among linguists as to the range of semantic/pragmatic distinctions that should be made in information structure, nor on how to represent these distinctions. Taking the approach of incremental development, we will start with a simple three way distinction between topic, focus, and unmarked. I take these to be properties not of referents, but of the linguistic expressions that refer to referents, and in particular, of semantic indices. Working loosely from Lambrecht 1996, topic and focus are defined as follows:

Topic: The expression referring to a known/given or inferrable entity that the rest of the sentence provides further information about.
Focus: The new information asserted by the speaker, against the background of presupposed information (topic and tail).
Tail: The rest of the sentence, that is neither topic nor focus.

A few things to note:

Every sentence is presumed to have a focus, but not every sentence has focus overtly marked.
Not every sentence has a topic. Furthermore, in many languages at least, topics don't need to be overtly marked as such.
Our "unmarked" means: not overtly marked as topic or focus, not "neither topic nor focus".
Focus is often said to "project" from the constituent on which it is overtly marked to larger constituents containing the focus-marked one. We won't be attempting to model this, but assuming for now that it can be reconstructed out of the MRS downstream if need be. (This possibly isn't true, if syntactic structure guides focus projection, since the syntactic structure---by design---isn't available in the MRS.)

Representations

We will be representing information structure via a new feature within CONT, called ICONS (for "individual constraints"). ICONS will have a diff-list as its value, like HCONS or RELS. The items on the ICONS list will be feature structures of type info-str. Each of these has the features CLAUSE and TARGET, indicating which index has the topic/focus property (the TARGET) and with respect to which clause (CLAUSE). The subtype of the info-str feature structure will indicate which relation (topic or focus) is involved.

Each relation-bearing lexical entry will introduce an underspecified ICONS element into the ICONS list. Because we don't want to go digging around in diff-lists, the lexical entries also each maintain a pointer to the ICONS they introduced via the feature HOOK.--ICONS.

Sanghoun has prepared a new version of matrix.tdl which has the infrastructure you'll need for this new feature. Please download it, and place it in your grammar directory (overwriting the old matrix.tdl). Then try loading your grammar and parsing your test suite. As there have been a couple of other changes to matrix.tdl since you last used the customization system, you may find that you need to make adjustments to your my_language.tdl file unrelated to information structure.

Add information-structure marking constructions

NB: What we're targeting here is constructions that specifically mark information structure, rather than being strongly correlated with it. For example, English subjects tend to be topics, but aren't necessarily so. Therefore, we wouldn't mark subject position in English as [INFO-STR topic].

This section lists a few kinds of topic/focus marking that I'm aware of, with some sketches of how to implement them. It is expected that you will post the details of what's happening in your language to GoPost so I can make more specific suggestions. Please do this as early in the week as possible.

Position in the sentence

In some languages, distinguished positions (e.g., right before the verb, sentence-intial, etc.) are associated with topic or focus. The strategy here is to identify the rules that license elements in the relevant position, and then have the rules constrain the HOOK.--INCONS of the appropriate daughter. In some cases, you may need to create new rules: If there's a sentence-initial "topicalized" position, you may need a head-filler construction. If there's a preverbal "focus" position, you may need to bifurcate the head-final rules to create one series that insists on a lexical verb ([HEAD verb, LIGHT +]) as the head and another that allows larger constituents as the head. Only the former will constrain INFO-str.

Focus/topic clitics or adpositions

These ones are relatively straighforward. They are either heads combining with complements or modifiers combining with heads. The first step is to get the syntax right. Post the details to GoPost if it's not (immediately) clear how to do it (10 minute rule and all that).

Semantically, they constrain the HOOK.--ICONS value of the element they combine with (through either the COMPS list or the MOD list, depending).

Cleft construction

Some languages mark focus with a construction that involves the copula and a relative clause, like English "It was Kim who left." where "Kim" is focused. Since we're not otherwise handling relative clauses, these are outside the scope of this lab.

Focus prosody

In many languages, the primary means for unambiguously marking focus is prosody (intonation). This isn't typically represented in the orthography, so we can only pseudo-model it. The plan here is to make up an affix (-FP, for "Focus Prosody") that attaches to the word bearing the focus marking. This affix should go last in the chain of lexical rules (so make its DTR value be the type of the last existing lexical rule, or a -dtr supertype inherited by the set of last existing lexical rules in case some of those are optional). It should also be optional, which can be achieved by making it lexeme-to-lexeme

More specifically, this rule should be a infl-add-only-no-ccont-ltol-rule, and its only effect besides adding the -FP affix should be to constrain the INDEX.--ICONS to focus.

Check your semantics

Once you're satisified with the syntax of your topic and/or focus marking, take a look at the semantics. If you just look at the MRS the way we have been, you won't see any changes. This is because the ICONS feature is new, and not yet incorporated into the code that does the MRS display. (It's also not yet incorporated into the generator, and so generation will ignore the ICONS information, unfortunately.)

To see the ICONS information, you'll need to look at the feature structure for the top-most node in the tree, and then navigate to the CONT.ICONS. You should see one item on the ICONS list for every (non-semantically empty) lexical item. The ones that are marked as topic or focus should have the correct types, while the others should just be unmarked (i.e., just info-str).

Variable property mapping

Adding topic and/or focus marking is probably increasing your range of generation outputs. Unfortunately, the generator isn't paying attention to ICONS (yet), so we can't constrain this. Instead, this week we'll work on using vpm to constrain other kinds of generator output variation, e.g., multiple different tense/aspects from one input. The basic strategy is to take any underspecified values in variable properties and translate them, via vpm, to something that conflicts with any more specific values your grammar can produce.

The file semi.vpm provides a mapping between grammar-external features of indices (referential indices and events) and their values, and grammar-internal ones. For background on VPM, see the DELPH-IN wiki. As soon as you start using a VPM file, then only variable properties (features on indices) that are handled in the file are actually preserved.

Save the file semi.vpm to your grammar directory. (This starter file should already handle the INFO-STR marking appropriately.)
Edit the file lkb/script to add the following line, right before the comment that starts "Next, the lexicon itself":
```
(mt:read-vpm (lkb-pathname (parent-directory) "semi.vpm") :semi)
```
If your grammar uses a PERNUM feature, you'll need to map separate PER and NUM features from the external (right-hand side) of the VPM to a single PRENUM feature on the internal (left-hand side). See the example under "Properties: An Example" on the DELPH-IN wiki page.
If your grammar encodes aspectual distinctions, you'll need to add an ASPECT section, modeled on tense. This should allow you to create and use specific a default value of ASPECT.
If you have any other features you have added on indices, you will need to provide VPM entries for them as well.
If your language has aspect marked in some sentences but other forms that are just underspecified for aspect, you'll want to have the default aspect be "no-aspect". Define this as a subtype of aspect in your grammar, but don't have anything other than the semi.vpm mention it otherwise.
You can do a similar trick for other kinds of generation ambiguity relating to variable properties.

Test your semi.vpm file by parsing and then generating. You should see fewer strings coming out.

Write up your analyses

Describe the topic and focus marking that you found in your language, including IGT examples that I can test.
Describe how you implemented the topic and focus marking.
If your implementation is incomplete, state how, and provide IGT examples illustrating problems, if you would like me to take a look.
Describe what steps you had to take to make your grammar generate, or, if it's not generating, any ideas you have on where the problem might be. If some examples generate but not others, provide an example of each for me to test & provide feedback.
Describe any changes you needed to make the semi.vpm file, and the effects that including the semi.vpm had on generation.
Describe the current coverage of your grammar over your test suite (using numbers you can get from Analyze | Coverage and Analyze | Overgeneration in [incr tsdb()]) and a comparison between your baseline test suite run and your final one for this lab (see Compare | Competence).

Submit your assignment

If you're using svn for version control, run svn export to make a copy of your lab directory that does not include .svn files.
Remove extraneous [incr tsdb()] profiles from the copied directory. (I'd like the initial baseline and final result for both test suite and test corpus; Only keep intermediate versions that you specifically want to say something about.)
Create a tarball of your grammar, your tsdb directory including both initial and final profiles, and your write up. The best way to do this (so that it unpacks most easily when I download from CollectIt) is to cd into the directory containing your lab (e.g., cd lab8/) and do:
tar czf lab8.tgz *
(When I download your submission from CollectIt, it comes in a directory named with your UWNetID. The above method avoids extra directory structure inside that directory.)
Upload the tarball to CollectIt

Back to main course page

ebender at u dot washington dot edu

Last modified: 2/18/12