This lab will ask you to develop a python script in several versions to calculate the frequency of words in a corpus.
You will be asked to turn in the separate versions of the script as separate files, so be sure to save them all.
We'll practice using svn for this assignment. The instructions below assume a command-line interface to svn, e.g., on patas or on a mac in a terminal window.
mkdir lab4
svn import lab4 svn://lemur.ling.washington.edu/students/username/lab4
rmdir lab4
svn co svn://lemur.ling.washington.edu/students/username/lab4
svn add freq1.py svn commit -m "Started on version 1."
svn commt -m "A working version!"
svn update
Download the file freq-starter.py, which gives the general structure of the frequency calculating script. Rename the file to something like freq1.py. Fill in the missing pieces as described in the comments in that file. Below are some descriptions of useful pieces of python for this purpose. To get a sense of how these pieces work, you are encouraged to try them out at the python prompt.
Test this script on the file GastonTranscript.txt, and save the output to turn in. Be sure to look at the output to see if it makes sense. Since your output file will be in csv (comma separate values) format, you can open it in Excel or another spread sheet program, if you prefer. You can run it like this:
python freq1.py GastonTranscript.txt
The output file will be called GastonTranscript.txt.counts.
string = "Hello world\n" words = string.split() words
ex_dict = { "a":1, "b":2, "c":3 }
This dictionary has three keys (a, b, and c). The value associated with key a is 1. To ask for the value associated with a, you type:
ex_dict["a"]
(You can try that at the python prompt.)
The method keys returns a list of all the keys in the dictionary:
ex_dict.keys()
will return ["a", "b", "c"]. If you want the keys sorted in reverse order of their values, you can use the following:
sorted(ex_dict, key=ex_dict.get, reverse=True)
a = 1 a += 1
The result is that the value of a is now 2. You can add to the value associated with a key in a dictionary with this operator as well:
ex_dict = { "a":1, "b":2, "c",3 } ex_dict["a"] += 1
Now the value of ex_dict["a"] is 2. However, this will give an error if "a" is not already a key in the dictionary (or if its value isn't numeric).
if ex_dict["a"] == 1: print "This is True!" else: print "This is False!"
"a" in ex_dict.keys()
outfile.write("hello world")
my_string = str(1)
Once you get version 1 working, copy freq1.py to freq2.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts1, since as you run freq2.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq2.py to write to something else.)
Now modify freq2.py to remove non-word characters from the "words" before checking whether the word is in the counts dictionary already. To do this, the re.sub() function will be helpful:
import re
re.sub(r'a','A','bananas') re.sub(r'[aeiou]','*','facetiousness')
test_string = 'facetiousness' re.sub(r'[aeiou]','*',test_string) test_string test_string = re.sub(r'[aeiou]','*',test_string) test_string
Regular expressions are very useful, and you can read more about them here (among other places). For this assignment, however, the only special thing you really need to know is that \W matches any non alphanumeric character.
Test your freq2.py on GastonTranscript.txt and compare its output to that of freq1.py.
Once you get version 2 working, copy freq2.py to freq3.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts2, since as you run freq3.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq3.py to write to something else.)
Now modify freq3.py to normalize all words to lower case before checking whether the word is in the counts dictionary already. To do this, the lower() method on strings will be helpful.
"Hello".lower() test_string = "bAnAnAs" test_string.lower() test_string test_string = test_string.lower() test_string
Test your freq3.py on GastonTranscript.txt and compare its output to that of freq2.py.
Once you get version 3 working, copy freq3.py to freq4.py.
Version 4 will be a bigger change from the previous ones, as the goal here is to generalize the script so that it can work on two different corpora store on patas, each of which is stored across multiple files.
Running counts over an entire corpus is the kind of large job that we need to send to the "compute nodes" via condor, which will be described a bit more below. To develop the script, you will first create directories in your home directory to copy two sample files to. This can be done as follows on patas:
cd mkdir LDC98T28-sample mkdir LDC04T15-sample
Then us ls to see what in the following two corpus directories:
/corpora/LDC/LDC98T28 /corpora/LDC/LDC04T15
You'll see that they have some additional structure under those top-level directories. Find the directory within those that has the actual data and copy two files from each into the appropriate -sample directory you made above. (The copy command on unix is cp.)
Modify freq4.py so that it takes a directory, rather than a file, and counts the frequencies of all words in all files in the directory. (You can assume that the directory has files in it, rather than further directories of files.) This will require the following:
files = os.listdir(sys.argv[1])
for f in files: infile = open(sys.argv[1] + "/" + f,'r') ... old while loop ... infile.close()
BE CAREFUL about whitespace as you do this, since python takes the level of indentation as the indication of which loop (or if statment, etc) a line belongs to.
Test your freq4.py by running it on your sample directories:
python freq4.py ~/LDC98T28-sample python freq4.py ~/LDC04T15-sample
Note that even on just two files, it can take a while. (For the two that I chose for LDC98T28, it takes about 20 seconds to run.) When you are satisfied with how it is working, use Condor to run the script on the actual directories under /corpora.
condor_submit freq.cmd