Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 5

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Solve an example task related to merging information from two files
- in particular, add metadata for an individual speaker to a text file of running speech
Gain further experience with python
- especially subroutines, regular expressions and the split() function

svn

Once again, we recommend that you use svn for this assignment. Create a repository as you did for lab4, or simply add these files to that repository.

merge-demo-data.py

Like last week, we have provided a sample starter script, merge-demo-data-starter.py. Download the file, make a copy called merge-demo-data.py and add the copy to your svn repository.

The script is intended to be run on patas, taking one argument which is the complete path to the Switchboard transcript file to add demographic data to, e.g.:

python merge-demo-data.py /corpora/LDC/LDC93T4/trans/phase2/disc01/sw3294.txt

When it is complete, it will output a new file, with lines like this:

B.1:  Well, let's see.

replaced with lines like this:

"FEMALE",35,"SOUTH MIDLAND",B.1,"  Well, let's see."

It will also group the utterances which are split over multiple lines (with just one speaker tag) into one line:

B.5:  Well, the thing I think that annoys me the most is, I have, I have young
children, a baby in the house and, and inevitably as soon as they're asleep,
someone calls on the phone trying to sell me something.

"FEMALE",35,"SOUTH MIDLAND",B.5,"  Well, the thing I think that annoys me the most is, I have, I have young children, a baby in the house and, and inevitably as soon as they're asleep, someone calls on the phone trying to sell me something."

That is, the demographic data from the file /corpora/LDC/LDC93T4/tables/tables/caller.tab is merged into each line in csv format.

The .doc files in this directory contain metadata of different kinds. The fields in /corpora/LDC/LDC93T4/tables/tables/caller.tab are described in the file /corpora/LDC/LDC93T4/tables/tables/caller.doc. We may wish to merge selected demographic data with particular tokens in the running text in the transcription file to facilitate different kinds of analysis, such as generating frequency counts for linguistic forms this speaker produces. In this way, we are using metadata as annotations.

Read the comments in the file merge-demo-data.py and translate them to python code to create the desired functionality. Some of the comments are just descriptive. The ones that prompt you to write some code end with the line

#~*~

The next section of this web page describes some useful bits of python. Please ask questions, early and often :) !

Useful python

Defining subroutines If you're going to do the same thing in several places in the code, it's best to create a subfunction that carries out the repeated operations and then call that function. In this week's script, you are prompted to write one such subroutine, called create_demo_data(). create_demo_data() takes a list of fields (taken from one line of the caller table file) and returns a string with just the fields we're interested in concatenated together (and separated by commas).
In python, you define a subroutine with the keyword def, followed by the name of the function, followed by parens containing zero or more variables naming the arguments to the subroutine. Subroutines generally include a return statement specifying what value they should return when called.
Try the following at the python prompt:
```
def add(x,y):
    return x + y

add(1,2)
result = add(1,2)
result
```
Note: We hit return twice above after return x + y to return to the python prompt. Then we issue our subroutine command add(x,y)
Turning a line into a list of fields You can turn a line into a list of fields with the split() function, as we did in freq.py.
Splitting on other than whitespace Note that if split() is given an argument, instead of splitting on whitespace (the default) it splits on its arugment:
```
mystring = 'a::b::c::d'
fields = mystring.split('::')
fields
```
Accessing a field in a list split() returns a list, which can be accessed by position, starting from 0. Continuing the example above, try:
```
fields[0]
fields[1]
fields[0:3]
```
This will be relevant when dealing with the demographic information in the file caller.tab. The file caller.doc explains the fields in caller.tab, in order.
Splitting just once Sometimes, you want to split on only the first instance of a delimiter, but not later ones. split() supports this, too:
```
mystring = 'A: This is what we need: Wombats.'
twofields = mystring.split(':',1)
twofields
```
If you know how many items will be in a list (e.g., such as returned by split when you've told it to only split on the first delimiter), you can assign those values to separate variables in one go:
```
[label, trans] = mystring.split(':',1)
label
trans
```
Testing lines for particular patterns This week, we'll be using another function from the re package: re.search(). re.search() takes two arguments: a pattern (regular expression) and a string to try to match it to. If they match, it returns a "match object". If not, it returns None (a kind of False). When you try this on the command line the match object will print out like this: <_sre.SRE_Match object at 0x2b8778d49cb0>. This is a data structure has several methods defined for it (documented here), but for our purposes this week, it is enough to know that Python treats it as a kind of True. That is, if you use re.match as the test in an if statement, and the pattern matches the string, then the test evaluates to True.
```
import re
mystring = 'A: This is what we need: Wombats.'
re.search(r'Wombats',mystring)
re.search(r'Wimbats',mystring)
```
re.search() will look for the pattern anywhere in the string. If you want to anchor it to the beginning or end of the string, you can use the symbols ^ and $, respectively:
```
re.search(r'Wombats$',mystring)
re.search(r'Wombats\.$',mystring)
re.search(r'^A:',mystring)
re.search(r'^This',mystring)
```
The regular expression '^\s*$' matches any line that only contains white space.

To turn in

Your finished version of merge-demo-data.py
A sample output file