Once again, we recommend that you use svn for this assignment. Create a repository as you did for lab4, or simply add these files to that repository.
Like last week, we have provided a sample starter script, merge-demo-data-starter.py. Download the file, make a copy called merge-demo-data.py and add the copy to your svn repository.
The script is intended to be run on patas, taking one argument which is the complete path to the Switchboard transcript file to add demographic data to, e.g.:
python merge-demo-data.py /corpora/LDC/LDC93T4/trans/phase2/disc01/sw3294.txt
When it is complete, it will output a new file, with lines like this:
B.1: Well, let's see.
replaced with lines like this:
"FEMALE",35,"SOUTH MIDLAND",B.1," Well, let's see."
It will also group the utterances which are split over multiple lines (with just one speaker tag) into one line:
B.5: Well, the thing I think that annoys me the most is, I have, I have young children, a baby in the house and, and inevitably as soon as they're asleep, someone calls on the phone trying to sell me something. "FEMALE",35,"SOUTH MIDLAND",B.5," Well, the thing I think that annoys me the most is, I have, I have young children, a baby in the house and, and inevitably as soon as they're asleep, someone calls on the phone trying to sell me something."
That is, the demographic data from the file /corpora/LDC/LDC93T4/tables/tables/caller.tab is merged into each line in csv format.
The .doc files in this directory contain metadata of different kinds. The fields in /corpora/LDC/LDC93T4/tables/tables/caller.tab are described in the file /corpora/LDC/LDC93T4/tables/tables/caller.doc. We may wish to merge selected demographic data with particular tokens in the running text in the transcription file to facilitate different kinds of analysis, such as generating frequency counts for linguistic forms this speaker produces. In this way, we are using metadata as annotations.
Read the comments in the file merge-demo-data.py and translate them to python code to create the desired functionality. Some of the comments are just descriptive. The ones that prompt you to write some code end with the line
#~*~The next section of this web page describes some useful bits of python. Please ask questions, early and often :) !
In python, you define a subroutine with the keyword def, followed by the name of the function, followed by parens containing zero or more variables naming the arguments to the subroutine. Subroutines generally include a return statement specifying what value they should return when called.
Try the following at the python prompt:
def add(x,y): return x + y add(1,2) result = add(1,2) result
Note: We hit return twice above after return x + y to return to the python prompt. Then we issue our subroutine command add(x,y)
mystring = 'a::b::c::d' fields = mystring.split('::') fields
fields[0] fields[1] fields[0:3]
This will be relevant when dealing with the demographic information in the file caller.tab. The file caller.doc explains the fields in caller.tab, in order.
mystring = 'A: This is what we need: Wombats.' twofields = mystring.split(':',1) twofields
[label, trans] = mystring.split(':',1) label trans
import re mystring = 'A: This is what we need: Wombats.' re.search(r'Wombats',mystring) re.search(r'Wimbats',mystring)
re.search() will look for the pattern anywhere in the string. If you want to anchor it to the beginning or end of the string, you can use the symbols ^ and $, respectively:
re.search(r'Wombats$',mystring) re.search(r'Wombats\.$',mystring) re.search(r'^A:',mystring) re.search(r'^This',mystring)
The regular expression '^\s*$' matches any line that only contains white space.