Laboratory Exercise 5

Computational Methods in Linguistics (Bender/Wassink)

Goals:

svn

Once again, we recommend that you use svn for this assignment. Create a repository as you did for lab4, or simply add these files to that repository.

merge-demo-data.py

Like last week, we have provided a sample starter script, merge-demo-data-starter.py. Download the file, make a copy called merge-demo-data.py and add the copy to your svn repository.

The script is intended to be run on patas, taking one argument which is the complete path to the Switchboard transcript file to add demographic data to, e.g.:

python merge-demo-data.py /corpora/LDC/LDC93T4/trans/phase2/disc01/sw3294.txt

When it is complete, it will output a new file, with lines like this:

B.1:  Well, let's see.  

replaced with lines like this:

"FEMALE",35,"SOUTH MIDLAND",B.1,"  Well, let's see."

It will also group the utterances which are split over multiple lines (with just one speaker tag) into one line:

B.5:  Well, the thing I think that annoys me the most is, I have, I have young
children, a baby in the house and, and inevitably as soon as they're asleep,
someone calls on the phone trying to sell me something.

"FEMALE",35,"SOUTH MIDLAND",B.5,"  Well, the thing I think that annoys me the most is, I have, I have young children, a baby in the house and, and inevitably as soon as they're asleep, someone calls on the phone trying to sell me something."

That is, the demographic data from the file /corpora/LDC/LDC93T4/tables/tables/caller.tab is merged into each line in csv format.

The .doc files in this directory contain metadata of different kinds. The fields in /corpora/LDC/LDC93T4/tables/tables/caller.tab are described in the file /corpora/LDC/LDC93T4/tables/tables/caller.doc. We may wish to merge selected demographic data with particular tokens in the running text in the transcription file to facilitate different kinds of analysis, such as generating frequency counts for linguistic forms this speaker produces. In this way, we are using metadata as annotations.

Read the comments in the file merge-demo-data.py and translate them to python code to create the desired functionality. Some of the comments are just descriptive. The ones that prompt you to write some code end with the line

#~*~
The next section of this web page describes some useful bits of python. Please ask questions, early and often :) !

Useful python

To turn in