Ling/CSE 472: Assignment 4: Probabilistic Grammars; Feature Structure Grammars

Due 5/15, 5pm

Part 1: Probabilities from a treebank

This part of the assignment concerns counting rule frequency in a treebank, such as would need to be done in order to build a PCFG.

The treebank in question is the Redwoods Treebank (version of 2004). Unlike the Penn Treebank and similar resources, the Redwoods Treebank is built by using a broad-coverage precision grammar (the English Resource Grammar (ERG)) to parse a corpus. Human annotators then choose among the candidate parses proposed by the ERG, potentially rejecting all of them if the ERG does not in fact propose an appropriate parse for a given sentence. The Redwoods system facilitates this annotation by calculating binary discriminators which distinguish sets of trees, and furthermore records the decisions that annotators make for each discriminator. As the grammar evolves, the corpus can be reparsed and the decisions can be rerun to produce a nearly disambiguated treebank consistent with the current grammar version. The additional annotation required is minimized.

Another distinguishing factor of the Redwoods Treebank is that the analyses stored for each sentence are not CFG trees, but rather full-fledge HPSG (feature-structure/unification grammar) representations, including both syntax and semantics. For the purposes of this assignment, we'll be working with simplified representations of those trees which are CFG-like: simple atomic symbols for the nodes. However, rather than use categories like S, NP, VP, I've chosen to export the trees using the rule identifiers for each node. Here is a list of rule identifiers and a brief description for each rule.

This version of the Redwoods treebank covers two domains: customer service email and spoken dialogues regarding travel and meeting planning. The files we'll be working with represent a sample of each. Here is a short snippet of each one, to give you the flavor: customer service email and travel/meeting planning.

The files you will actually be working with have been preprocessed with a script written using Jeremy Kahn's Lingua::Treebank Perl library. This script takes each tree and takes it apart, printing out the CFG rule that would generate each subtree of depth one.

Your tasks

Treating each of the files (email and dialogue) separately, find:

Turn in the above, along with an explanation of how you did it .

(Hint #1: The unix utilities uniq and sort might be helpful in this. sort takes a file and sorts the lines in ascii order. uniq takes a file and condenses consecutive occurrences of the same line down to one. When called with the -c flag, it produces a count for each of these. sort can be made to sort on something other than the first column of text. For example, sort -k 2 foo sorts the contents of the file foo based on the second thing in each line, where things are separated by whitespace (default). For more on sort and unique, see this brief tutorial.)

(Hint #2: Once you've gotten the counts for each of the rewrite rules, you might consider calculating relative frequencies using a Perl script or a spreadsheet program.)

2. Feature structures in the LKB

In order to give you some experience with feature structures and unification, this part of the assignment asks you to do some grammar engineering, using the LKB. In particular, you will implement the feature-based analysis of agreement (subject-verb and determiner-noun) described in the chapter, and a simple feature-based version of subcategorization.

Getting started

Background: the LKB, types, and tdl syntax

grammar1 (used in assignment 3) looked like a CFG, but in fact, it, like grammar2, is a typed feature-structure (TFS) grammar, since that is what the LKB can parse. (The CFGs are just CFGs coded up as TFS grammars.) Furthermore, they are stated in Type Description Language (TDL), with the following syntax:

typename := supertype1 & supertype2 & ... & supertype3 &
 [ FEATURE_1 value_1,
   FEATURE_2 value_2,
      ...
   FEATURE_N value_n ].

Features with complex values look like this:

typename := supertype &
  [ FEATURE_1 [ FEATURE_2 value_2,
                FEATURE_3 value_3 ],
    FEATURE_4 value_4 ].

If you just want to talk about one feature inside the value of another, you can write it like this: (Note the period in between the feature names.)

typename := supertype &
   [ FEATURE_1.FEATURE_2 value2 ].

NB: The following is not equivalent to the above:

typename := supertype &
   [ FEATURE_2 value2 ].

Reentrancy is indicated with variables beginning with #. For example:

typename := supertype &
 [ FEATURE_2 #same,
   FEATURE_3.FEATURE_4 #same ].

Lists are represented as feature structures with two features, FIRST and REST. The value of FIRST is the first element on the list. The value of REST is another list, or *null*. (The relevant types are defined at the bottom of types.tdl.) You can use these features to constrain particular members of lists, or you can use the following notation:

typename := supertype &
  [ LIST_FEATURE < [ FEATURE_1 value_1 ] , ... > ].

typename := supertype &
  [ LIST_FEATURE < othertype , [ FEATURE_1 value_1 ] > ].

The first example above involves a list with at least one element on it. The second involves a list with exactly two elements on it.

The LKB requires that all features be declared for exactly one type (they can be inherited and further constrained by subtypes). If you try to introduce features with the same name on different types (and one is not a subtype of the other) it will complain.

The LKB distinguishes between types and instances, the latter to be found (in this grammar) in rules.tdl and lexicon.tdl. It is only because of this that we can 'overload' symbols like s and use them as both rule names and type names.

Here is a partial LKB/Grammar Engineering FAQ , which may be helpful.

Feature-based analyses of agreement and valence

To turn in your grammar for this part, create a tar archive, and upload it to CollectIt. You can create the tar archive by running the following command in the directory above `grammar2'.

tar czf grammar2.tgz grammar2/

Part 3: Using types to capture generalizations

Make a copy of grammar2 to modify for this part of the assignment:

cp -r grammar2 grammar3

For this part of the assignment, you should modify grammar3.

You may have noticed that you were typing the same thing over and over again in Part 1. This part of the assignment asks you to make use of the types to state each constraint exactly once, and make any instance or type that needs it inherit from that supertype.

Hints:

When you have a grammar that doesn't have any repeated constraints and that accounts for all of the data (both positive and negative) in test.items, create a tar archive of it, and turn it in!


Back to main course page