Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 4

Computational Methods in Linguistics (Bender/Wassink)

Goals:

To get experience using svn
To become familiar with methods for counting strings in unannotated corpora:
- What python (and also R) functions are useful for frequency counts
- How to output frequency counts to a table or other textfile for further manipulation elsewhere
To become familiar with means of manipulating directory structures from within python

Specs

This lab will ask you to develop a python script in several versions to calculate the frequency of words in a corpus.

Version 1: counts the frequency of words (defined as strings of non-white-space charactrs) in a single file. Sends the results of this to an output file, while printing to STDOUT the total number of word types and tokens.
Version 2: same as Version 1, but removes punctuation characters.
Version 3: same as Version 2, but normalizes all words to lower case.
Version 4: handles a corpus that is stored in multiple files.

You will be asked to turn in the separate versions of the script as separate files, so be sure to save them all.

svn

We'll practice using svn for this assignment. The instructions below assume a command-line interface to svn, e.g., on patas or on a mac in a terminal window.

Create a directory called lab4: Navigate in the command line to the directory you want lab4 to be in, and then do:
```
mkdir lab4
```
Create a repository for lab4. Still in the directory above lab4, type (replacing "username" with your actual user name):
```
svn import lab4 svn://lemur.ling.washington.edu/students/username/lab4
```
Now remove the lab4 directory on patas:
```
rmdir lab4
```

And check out lab4 from the svn repository:

svn co svn://lemur.ling.washington.edu/students/username/lab4

Now whenever you add a file to that directory that you want svn to keep track of, use svn add. For example, when you've created freq1.py (see below), add it like this (from inside the lab4 directory):
```
svn add freq1.py
svn commit -m "Started on version 1."
```
Also use the svn commit command to save changes to lemur to existing files:
```
svn commt -m "A working version!"
```
And finally, you can check out the repository on to different machines (e.g., patas and your laptop). To pull down changes checked in from the other machine (again, from within lab4):
```
svn update
```

Version 1

Download the file freq-starter.py, which gives the general structure of the frequency calculating script. Rename the file to something like freq1.py. Fill in the missing pieces as described in the comments in that file. Below are some descriptions of useful pieces of python for this purpose. To get a sense of how these pieces work, you are encouraged to try them out at the python prompt.

Test this script on the file GastonTranscript.txt, and save the output to turn in. Be sure to look at the output to see if it makes sense. Since your output file will be in csv (comma separate values) format, you can open it in Excel or another spread sheet program, if you prefer. You can run it like this:

python freq1.py GastonTranscript.txt

The output file will be called GastonTranscript.txt.counts.

Reading a file line by line: The function readlines(), already included in the starter script, does this for you.
Splitting strings into lists of words: The method split() on strings splits the string into a list of elements. By default (if you don't give it an argument), it splits on white space. Try the following at the python prompt:
```
string = "Hello world\n"
words = string.split()
words
```
Dictionaries: A dictionary is a data structure that stores pairs of keys and values. For example:
```
ex_dict = { "a":1, "b":2, "c":3 }
```
This dictionary has three keys (a, b, and c). The value associated with key a is 1. To ask for the value associated with a, you type:
```
ex_dict["a"]
```
(You can try that at the python prompt.)
The method keys returns a list of all the keys in the dictionary:
```
ex_dict.keys()
```
will return ["a", "b", "c"]. If you want the keys sorted in reverse order of their values, you can use the following:
```
sorted(ex_dict, key=ex_dict.get, reverse=True)
```
To increment (add 1 to) a value, you can use +=:
```
a = 1
a += 1
```
The result is that the value of a is now 2. You can add to the value associated with a key in a dictionary with this operator as well:
```
ex_dict = { "a":1, "b":2, "c",3 }
ex_dict["a"] += 1
```
Now the value of ex_dict["a"] is 2. However, this will give an error if "a" is not already a key in the dictionary (or if its value isn't numeric).

if statements in python look like this:

if ex_dict["a"] == 1:
   print "This is True!"
else:
   print "This is False!"

To test whether an element is in a list (e.g., in the value returned by ex_dict.keys()), use in:
```
"a" in ex_dict.keys()
```
You can write to a file opened for writing (like outfile in the starter script) by using the method write: (NB this one won't work at the python prompt unless you define outfile first.)
```
outfile.write("hello world")
```
Concatenation: The argument to write() is a string. You can construct a string out of smaller parts by using the + operator. When + has two numbers as arguments, it signifies addition. When it has two strings, it signifies concatenation. When it gets one of each, it gives an error.
Converting integers to strings: You can convert an integer to a string with the function str:
```
my_string = str(1)
```
Division: Finally, you will want to use division to find the relative frequency (percent of total tokens) for each word type. The division operator in python is /. The statement from __future__ import division at the top of the file cause this operator to work the way you expect it to.

Version 2

Once you get version 1 working, copy freq1.py to freq2.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts1, since as you run freq2.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq2.py to write to something else.)

Now modify freq2.py to remove non-word characters from the "words" before checking whether the word is in the counts dictionary already. To do this, the re.sub() function will be helpful:

re.sub() takes three arguments: a regular expression to match in the string, a string to replace all instances of that regular expression with, and a string to do the matching in. Its output is a string with the replacement done. To see it in action, try the following at the python prompt
1. Import the re package (this only needs to be done once per interaction session with python; in scripts, this is at the start of the file):
```
import re
```
2. Then try some examples:
```
re.sub(r'a','A','bananas')
re.sub(r'[aeiou]','*','facetiousness')
```
3. Note that the function re.sub() returns a modified string, but does not modify the string it is passed. To see the effect of this try the following at the Python prompt:
```
test_string = 'facetiousness'
re.sub(r'[aeiou]','*',test_string)
test_string
test_string = re.sub(r'[aeiou]','*',test_string)
test_string
```
Regular expressions are very useful, and you can read more about them here (among other places). For this assignment, however, the only special thing you really need to know is that \W matches any non alphanumeric character.

Test your freq2.py on GastonTranscript.txt and compare its output to that of freq1.py.

Version 3

Once you get version 2 working, copy freq2.py to freq3.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts2, since as you run freq3.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq3.py to write to something else.)

Now modify freq3.py to normalize all words to lower case before checking whether the word is in the counts dictionary already. To do this, the lower() method on strings will be helpful.

Case normalization: The method lower() on strings returns a version of the string with all lower case letters. As with re.sub() it does not modify the original string. Try the following at the python prompt to get a sense of how it works:
```
"Hello".lower()
test_string = "bAnAnAs"
test_string.lower()
test_string
test_string = test_string.lower()
test_string
```

Test your freq3.py on GastonTranscript.txt and compare its output to that of freq2.py.

Version 4

Once you get version 3 working, copy freq3.py to freq4.py.

Version 4 will be a bigger change from the previous ones, as the goal here is to generalize the script so that it can work on two different corpora store on patas, each of which is stored across multiple files.

Running counts over an entire corpus is the kind of large job that we need to send to the "compute nodes" via condor, which will be described a bit more below. To develop the script, you will first create directories in your home directory to copy two sample files to. This can be done as follows on patas:

cd
mkdir LDC98T28-sample
mkdir LDC04T15-sample

Then us ls to see what in the following two corpus directories:

/corpora/LDC/LDC98T28
/corpora/LDC/LDC04T15

You'll see that they have some additional structure under those top-level directories. Find the directory within those that has the actual data and copy two files from each into the appropriate -sample directory you made above. (The copy command on unix is cp.)

Modify freq4.py so that it takes a directory, rather than a file, and counts the frequencies of all words in all files in the directory. (You can assume that the directory has files in it, rather than further directories of files.) This will require the following:

Remove the statement infile = open(sys.argv[1],'r') from near the beginning and infile.close() from near the end.
Get the list of files in the directory. That can be done with the following:
```
files = os.listdir(sys.argv[1])
```
Wrap the while loop in a larger while loop that loops through the file names in the list files:
```
for f in files:

    infile = open(sys.argv[1] + "/" + f,'r')

    ... old while loop ...

    infile.close()
```
BE CAREFUL about whitespace as you do this, since python takes the level of indentation as the indication of which loop (or if statment, etc) a line belongs to.

Test your freq4.py by running it on your sample directories:

python freq4.py ~/LDC98T28-sample
python freq4.py ~/LDC04T15-sample

Note that even on just two files, it can take a while. (For the two that I chose for LDC98T28, it takes about 20 seconds to run.) When you are satisfied with how it is working, use Condor to run the script on the actual directories under /corpora.

Documentation on running Condor on patas
A sample Condor submit file:
- Save this to your directory on patas.
- It's set up to run a script called freq4.py on the LDC98T28 directory.
- Look at the .cmd file and adjust it if necessary, then invoke it like this:
```
condor_submit freq.cmd
```
- The output (.counts) file will appear in the directory from which you invoke condor_submit, and you should receive an email notification (to your UW email) when the job is done.

To turn in

freq1.py, freq3.py, freq3.py, freq4.py
Counts from GastonTranscript.txt for each of freq1.py, freq3.py, and freq3.py.
A list of the top 10 most frequent words in each of the two corpora (per freq4.py).
Answers to the following questions:
1. Did all of freq1.py, freq3.py, freq3.py find the same number of word tokens in GastonTranscript.txt? How many?
2. What number of word types did each of freq1.py, freq3.py, and freq3.py find?
3. How many and which of the top 10 most frequent words in your two file sample of LDC98T28 were in the top 10 for the whole corpus? What about for LDC04T15?
4. How does the mark-up (annotation) in the corpus files affect the membership of the top 10 list?
5. How does it affect the relative frequency calculation?