Laboratory Exercise 4

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Specs

This lab will ask you to develop a python script in several versions to calculate the frequency of words in a corpus.

You will be asked to turn in the separate versions of the script as separate files, so be sure to save them all.

svn

We'll practice using svn for this assignment. The instructions below assume a command-line interface to svn, e.g., on patas or on a mac in a terminal window.

Version 1

Download the file freq-starter.py, which gives the general structure of the frequency calculating script. Rename the file to something like freq1.py. Fill in the missing pieces as described in the comments in that file. Below are some descriptions of useful pieces of python for this purpose. To get a sense of how these pieces work, you are encouraged to try them out at the python prompt.

Test this script on the file GastonTranscript.txt, and save the output to turn in. Be sure to look at the output to see if it makes sense. Since your output file will be in csv (comma separate values) format, you can open it in Excel or another spread sheet program, if you prefer. You can run it like this:

python freq1.py GastonTranscript.txt

The output file will be called GastonTranscript.txt.counts.

Version 2

Once you get version 1 working, copy freq1.py to freq2.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts1, since as you run freq2.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq2.py to write to something else.)

Now modify freq2.py to remove non-word characters from the "words" before checking whether the word is in the counts dictionary already. To do this, the re.sub() function will be helpful:

Test your freq2.py on GastonTranscript.txt and compare its output to that of freq1.py.

Version 3

Once you get version 2 working, copy freq2.py to freq3.py. Also copy GastonTranscript.txt.counts to something like GastonTranscript.txt.counts2, since as you run freq3.py it will write output to that same file name. (Alternatively, you can edit the line defining outfile in freq3.py to write to something else.)

Now modify freq3.py to normalize all words to lower case before checking whether the word is in the counts dictionary already. To do this, the lower() method on strings will be helpful.

Test your freq3.py on GastonTranscript.txt and compare its output to that of freq2.py.

Version 4

Once you get version 3 working, copy freq3.py to freq4.py.

Version 4 will be a bigger change from the previous ones, as the goal here is to generalize the script so that it can work on two different corpora store on patas, each of which is stored across multiple files.

Running counts over an entire corpus is the kind of large job that we need to send to the "compute nodes" via condor, which will be described a bit more below. To develop the script, you will first create directories in your home directory to copy two sample files to. This can be done as follows on patas:

cd
mkdir LDC98T28-sample
mkdir LDC04T15-sample

Then us ls to see what in the following two corpus directories:

/corpora/LDC/LDC98T28
/corpora/LDC/LDC04T15

You'll see that they have some additional structure under those top-level directories. Find the directory within those that has the actual data and copy two files from each into the appropriate -sample directory you made above. (The copy command on unix is cp.)

Modify freq4.py so that it takes a directory, rather than a file, and counts the frequencies of all words in all files in the directory. (You can assume that the directory has files in it, rather than further directories of files.) This will require the following:

Test your freq4.py by running it on your sample directories:

python freq4.py ~/LDC98T28-sample
python freq4.py ~/LDC04T15-sample

Note that even on just two files, it can take a while. (For the two that I chose for LDC98T28, it takes about 20 seconds to run.) When you are satisfied with how it is working, use Condor to run the script on the actual directories under /corpora.

To turn in

  1. freq1.py, freq3.py, freq3.py, freq4.py
  2. Counts from GastonTranscript.txt for each of freq1.py, freq3.py, and freq3.py.
  3. A list of the top 10 most frequent words in each of the two corpora (per freq4.py).
  4. Answers to the following questions:
    1. Did all of freq1.py, freq3.py, freq3.py find the same number of word tokens in GastonTranscript.txt? How many?
    2. What number of word types did each of freq1.py, freq3.py, and freq3.py find?
    3. How many and which of the top 10 most frequent words in your two file sample of LDC98T28 were in the top 10 for the whole corpus? What about for LDC04T15?
    4. How does the mark-up (annotation) in the corpus files affect the membership of the top 10 list?
    5. How does it affect the relative frequency calculation?