Lab 8 (Due 2/26 11:45pm)

Preliminaries

These instructions might get edited a bit over the next couple of days. I'll try to flag changes.

As usual, check the write up instructions first.

Requirements for this assignment

0. Make sure you have a baseline test suite corresponding to your lab 7 grammar.
1. Add information structure marking constructions. You'll need to post to GoPost early in the week with a description of how these constructions work in your language so I can help with the syntax.
2. Check the semantics of your information structure marking constructions.
3. Make sure your grammar can still generate, and debug as necessary.
4. Use VPM to cut back on range of generation.
5. Test your grammar using [incr tsdb()]. [incr tsdb()] should be part of your test-development cycle. In addition, you'll need to run a final test suite instance for this lab to submit along with your basline.
6. Write up the lab.

Run a baseline test suite

Before making any changes to your grammar for this lab, run a baseline test suite instance. If you decide to add items to your test suite for the material covered here, consider doing so before modifying your grammar so that your baseline can include those examples. (Alternatively, if you add examples in the course of working on your grammar and want to make the snapshot later, you can do so using the grammar you turned in for Lab 6.)

Background

The goal of this lab is to model morphosyntactic (and, if you prefer, pseudo-model prosodic) marking of information structural concepts. Information structure is a pragmatic phenomenon relating to how the speaker/author presents the information contained in an utterance. Only in rare cases do we find grammaticality affected by information structure marking. Rather, marking of information structure constrains the possible interpretations of an utterance.

There is no consensus yet among linguists as to the range of semantic/pragmatic distinctions that should be made in information structure, nor on how to represent these distinctions. Taking the approach of incremental development, we will start with a simple three way distinction between topic, focus, and unmarked. I take these to be properties not of referents, but of the linguistic expressions that refer to referents, and in particular, of semantic indices. Working loosely from Lambrecht 1996, topic and focus are defined as follows:

Topic: The expression referring to a known/given or inferrable entity that the rest of the sentence provides further information about.
Focus: The new information asserted by the speaker, against the background of presupposed information (topic and tail).
Tail: The rest of the sentence, that is neither topic nor focus.

A few things to note:

Every sentence is presumed to have a focus, but not every sentence has focus overtly marked.
Not every sentence has a topic. Furthermore, in many languages at least, topics don't need to be overtly marked as such.
Our "unmarked" means: not overtly marked as topic or focus, not "neither topic nor focus".
Focus is often said to "project" from the constituent on which it is overtly marked to larger constituents containing the focus-marked one. We won't be attempting to model this, but assuming for now that it can be reconstructed out of the MRS downstream if need be. (This possibly isn't true, if syntactic structure guides focus projection, since the syntactic structure---by design---isn't available in the MRS.)

Representations

As noted above, we're going to represent information structure as a property of indices. In particular, the ARG0 of a relation that is marked as topic or focus will have its INFO-STR feature constrained to show this information. To do this, we need to add the feature INFO-STR and set up its possible values.

Add the following to my_language.tdl:

individual :+ [ INFO-STR info-str ].

info-str := *top*.
marked := info-str.
unmarked := info-str.
topic := marked.
focus := marked.

By adding this feature to individual we are making it appropriate for both verby indices (events, also appropriate for adjectives, adpositions and adverbs) and nouny indices (ref-ind).

The purpose of the type umarked is to contrast with both topic and focus. The grammar won't actually constrain anything to be this type --- things that are unmarked will be left underspecified as info-str --- but we'll use vpm (see below) to change underspecified to unmarked on the way in to the generator, and cut down on the range of generated outputs.

Add information-structure marking constructions

NB: What we're targeting here is constructions that specifically mark information structure, rather than being strongly correlated with it. For example, English subjects tend to be topics, but aren't necessarily so. Therefore, we wouldn't mark subject position in English as [INFO-STR topic].

This section lists a few kinds of topic/focus marking that I'm aware of, with some sketches of how to implement them. It is expected that you will post the details of what's happening in your language to GoPost so I can make more specific suggestions. Please do this as early in the week as possible.

Position in the sentence

In some languages, distinguished positions (e.g., right before the verb, sentence-intial, etc.) are associated with topic or focus. The strategy here is to identify the rules that license elements in the relevant position, and then have the rules constrain the INDEX.INFO-STR of the appropriate daughter. In some cases, you may need to create new rules: If there's a sentence-initial "topicalized" position, you may need a head-filler construction. If there's a preverbal "focus" position, you may need to bifurcate the head-final rules to create one series that insists on a lexical verb ([HEAD verb, LIGHT +]) as the head and another that allows larger constituents as the head. Only the former will constrain INFO-str.

Focus/topic clitics or adpositions

These ones are relatively straighforward. They are either heads combining with complements or modifiers combining with heads. The first step is to get the syntax right. Post the details to GoPost if it's not (immediately) clear how to do it (10 minute rule and all that).

Semantically, they constrain the INFO-STR value of the element they combine with (through either the COMPS list or the MOD list, depending).

NB: Russian "li" and Nishnaabemwin "na" fall into this category. Note also that if, when the clitic attaches to the verb, you get ambiguity between narrow focus on the verb and whole-sentence focus, it's probably best to model this with two entries for the clitic. One modifies any host, and constrains the INFO-STR of that host. The other modifies only verbs and doesn't constrain INFO-STR at all (but still, in the case of question clitics, introduces a non-empty YNQ value). That way, a question with the clitic on the verb will translate to a question with no particular focus marking in the other languages in the MT set-up.

Cleft construction

Some languages mark focus with a construction that involves the copula and a relative clause, like English "It was Kim who left." where "Kim" is focused. Since we're not otherwise handling relative clauses, these are outside the scope of this lab.

Focus prosody

In many languages, the primary means for unambiguously marking focus is prosody (intonation). This isn't typically represented in the orthography, so we can only pseudo-model it. The plan here is to make up an affix (-FP, for "Focus Prosody") that attaches to the word bearing the focus marking. This affix should go last in the chain of lexical rules (so make its DTR value be the type of the last existing lexical rule, or a -dtr supertype inherited by the set of last existing lexical rules in case some of those are optional). It should also be optional, which can be achieved by making it lexeme-to-lexeme

More specifically, this rule should be a infl-add-only-no-ccont-ltol-rule, and its only effect besides adding the -FP affix should be to constrain the INDEX.INFO-STR to focus.

As-for topics

An "as for" topic is a topic that is only loosely connected to the clause it combines with, rather than filling an argument or adjunct position in that clause. In English, these are expressed with "as for", but in other languages, they can just fill the ordinary topic position or take ordinary topic marking. A Japanese example is given below

Amerika wa supiido suketaa ga hayai
America TOP speed skater NOM fast
`As for America, the speed skaters are fast.'

Semantically, we can model these via a relation topic_p_rel that takes the INDEX of the topic-marked element as its ARG1 and the INDEX of the clause it attaches to (as a modifier, in the case of Japanese).

When as-for topics are marked the same way as "ordinary" topics in a language, and there is also pro-drop, we face a choice about ambiguity. Consider the following Japanese sentence:

Ohno wa hayai
Ohno TOP fast
`Ohno is fast'
`As for Ohno, he is fast.'

On one analysis, there are two parses for this sentence: one in which Ohno is a subject which is also marked as a topic, and one in which Ohno is an as-for topic and the subject is dropped (`As for Ohno, he is fast.') Since we have to privde the as-for topic analysis, the question is whether we let that stand in for the other, or ambiguate.

For present purposes, if your topic marking is sentence-initial position, and as-for topics can go in that spot, and you have pro-drop, I recommend just assimilating everything to the as-for case. That'll mean a little extra work in the transfer rules for MT, but it will save dealing with SLASH (aka GAP) and head-filler rules.

Check your semantics

Once you're satisified with the syntax of your topic and/or focus marking, take a look at the MRS. You should see INFO-STR as a property of both event and referential indices, with values info-str (when nothing has constrained it), topic on variables introduced as indices of topic-marked words or phrases and focus on variables introduced as indices of focus-marked words or phrases.

At this point, we expect to see lots of output on generation: Pretty much any combination of topic and focus marking output from the MRS of pretty much any sentence. The only constrain would be that something with explicit topic or focus marking on the input sentence shouldn't get the opposite marking on the output sentence.

Variable property mapping

Since we don't really want that much flexibility in generation, we're going to use variable property mapping to constrain the outputs so that only elements explicitly marked as foci can surface as focus-marked and only elements explicity marked as topics can surface as topic-marked. The basic strategy is to take the underspecified value ([INFO-STR info-str]) in the input MRS, and translate it via vpm, to something that conflicts with both topic and focus, namely unmarked.

The file semi.vpm provides a mapping between grammar-external features of indices (referential indices and events) and their values, and grammar-internal ones. For background on VPM, see the DELPH-IN wiki. As soon as you start using a VPM file, then only variable properties (features on indices) that are handled in the file are actually preserved.

Save the file semi.vpm to your grammar directory. (This starter file should already handle the INFO-STR marking appropriately.)
Edit the file lkb/script to add the following line, right before the comment that starts "Next, the lexicon itself":
```
(mt:read-vpm (lkb-pathname (parent-directory) "semi.vpm") :semi)
```
If your grammar uses a PERNUM feature, you'll need to map separate PER and NUM features from the external (right-hand side) of the VPM to a single PRENUM feature on the internal (left-hand side). See the example under "Properties: An Example" on the DELPH-IN wiki page.
If your grammar encodes aspectual distinctions, you'll need to add an ASPECT section, modeled on tense. This should allow you to specific a default value of ASPECT as well.
If you have any other features you have added on indices, you will need to provide VPM entries for them as well.
If your language has aspect marked in some sentences but other forms that are just underspecified for aspect, you'll want to have the default aspect be "no-aspect". Define this as a subtype of aspect in your grammar, but don't have anything other than the semi.vpm mention it otherwise.

Test your semi.vpm file by parsing and then generating. You should see fewer strings coming out.

Write up your analyses

Describe the topic and focus marking that you found in your language, including IGT examples that I can test.
Describe how you implemented the topic and focus marking.
If your implementation is incomplete, state how, and provide IGT examples illustrating problems, if you would like me to take a look.
Describe what steps you had to take to make your grammar generate, or, if it's not generating, any ideas you have on where the problem might be. If some examples generate but not others, provide an example of each for me to test & provide feedback.
Describe any changes you needed to make the semi.vpm file, and the effects that including the semi.vpm had on generation.
Describe the current coverage of your grammar over your test suite (using numbers you can get from Analyze | Coverage and Analyze | Overgeneration in [incr tsdb()]) and a comparison between your baseline test suite run and your final one for this lab (see Compare | Competence).

Submit your assignment

Create a tarball of your grammar, your tsdb directory including both initial and final profiles, and your write up. The best way to do this (so that it unpacks most easily when I download from CollectIt) is to cd into the directory containing your lab (e.g., cd lab8/) and do:
tar czf lab8.tgz *
(When I download your submission from CollectIt, it comes in a directory named with your UWNetID. The above method avoids extra directory structure inside that directory.)
Upload the tarball to CollectIt

Back to main course page

ebender at u dot washington dot edu

Last modified: Sun Feb 21 00:22:42 PST 2010