Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 2

Computational Methods in Linguistics (Bender/Wassink)

Goals:

To gain hands-on experience running two different types of annotation-producing software
To understand what the output formats look like
To understand the extent to which the programs are reliable
To see what needs to be done to do basic phone-alignment over a soundfile that was generated in a conventional way for sociophonetic analysis together with its orthographic transcription
Further practice with the Unix command line (including managing the input and output of programs)

Due: Friday 4/6, 5pm.

Both parts of this assignment are to be carried out on patas, so step one is logging in to patas.

Automatic parsing

Log in to patas.
Invoke the Stanford parser over a sample from the Brown corpus, which I've stored in a file called lab2-input.txt in my home directory.
```
lexparser.csh ~ebender/lab2-input.txt
```
This will cause the parser output to be printed out on your screen, which is maybe interesting, but not very helpful. (NB: Some of what is printed is error messages about the parser not having enough memory. If we were interested in using the output of the parser as automatic annotations, we would need to adjust its memory settings to get rid of these errors, but for present purposes they are not problematic.)
Now try again, redirecting the parser output to a file in your home directory:
```
lexparser.csh ~ebender/lab2-input.txt > ~/lab2-output.txt
```
This will put the output into the file called lab2-output.txt in your home directory, but still print some messages to the screen.
Try one more time, this time redirecting both the output and the messages to separate files:
```
lexparser.csh ~ebender/lab2-input.txt 1> ~/lab2-output.txt 2> ~/lab2-output.err
```
Use less to examine the contents of the output file and the error file.
You can also copy the file from your patas account to your local machine. If you have a unix terminal on your local machine, you can do this with scp. Issue the following command to your local machine (replacing username with your actual patas username):
```
scp username@patas.ling.washington.edu:~/lab2-output.txt .
```
The computer should prompt you for your patas password and then copy the file to your current directory on your local machine.
This file was also annotated by hand as part of the Penn Treebank. We have it on patas at the following path:
```
/corpora/LDC/LDC99T42/RAW/parsed/mrg/brown/cf/cf01.mrg
```
Use less to examine the contents of the hand-annotated file. (Note that there are more sentences in that file.) You may also want to use a text editor, either on patas or on your local machine, to examine these files.
Answer the following questions:
1. How many sentences were given to the parser?
2. How many sentences did the parser find a parse for?
3. For the first sentence, list all of the NPs, according to the Stanford parser (e.g., the first one is "American romance").
4. For the first sentence, describe the differences between the automatic parse and the hand parse. In particular, provide one example of each of the following:
  1. A places where the structure is the same, but the labels are different.
  2. A places where the structure is different.
  Below the constituent structure for the sentence are a set of lines giving a dependency structure.
5. What do the numbers after the words in these lines indicate? Why would these be necessary?
6. Brainstorm a scenario in which either the constituent structure or the dependency structure could be useful for your research. Would the other structure be useful as well? Why or why not?
7. How would parser errors affect the usefulness of these automatic annotations for that research question?

Forced alignment

Background information

DARPABET: http://www.speech.cs.cmu.edu/cgi-bin/cmudict, http://en.wikipedia.org/wiki/Arpabet
Legend for transcriptional conventions (Dubois 1991): http://depts.washington.edu/sociolab/Documents/transcription%20conventions.pdf

Log into patas
Make a copy of the files we'll be using for this assignment in your home directory on patas:
```
cp -r ~ebender/p2fa-files ~
```
Copy the files from patas to your local machine, too (the following command should be issued from your local machine):
```
scp -r username@patas.ling.washington.edu:~/p2fa-files .
```
To orient yourself to this conversation, open and inspect the long version of the orthographic transcript for this soundfile (in Microsoft Word...horrors!). It is a typical conversation transcript, such as is commonly used in sociolinguistics. Observe how many speakers participated in the conversation.
p2fa requires a plain text transcript, with only the orthographic transcription and no further annotations. Copy the file shortscrappletranscript.txt into a new file called cleanscrappletranscript.txt. You can do this on patas with the text editor emacs, or on your local machine. If you do it on your local machine, be sure to copy the new file back to patas after you've made the modifications below.
- Open cleanscrappletranscript.txt in a text editor (so that when you save it, it is still .txt and not .rtf).
- Remove all coding conventions from cleanscrappletranscript.txt.
- Replace all smart quotes in cleanscrappletranscript.txt with regular quote or apostrophes, as appropriate.
- If you've been working locally, copy cleanscrappletranscript.txt to patas:
```
scp cleanscrappletranscript.txt username@patas.ling.washington.edu:~/p2fa-files/
```
Invoke p2fa on patas with the command align.py (the full path to align.py is /NLP_TOOLS/speech_tools/p2fa/latest/align.py, but we've added it to the $PATH variable on patas so it can be invoked as just align.py).
```
cd ~/p2fa-files
align.py SP19CF2J__SP20CF2J_conversationalshort.wav cleanscrappletranscript.txt scrapple.TextGrid
```
NB: Tab completion can be very useful here. When you type that command, rather than typing the full file name of the .wav file, type S then tab then _ then tab. You can do the same for the transcript file. The third argument there is the name of the output file. Since this doesn't exist until the program runs (the first time; the second time it will overwrite) you'll have to type it in full.
Observe what the program prints to the screen as it runs. Are there any words it doesn't recognize? Why not?
Copy the output file (scrapple.TextGrid) to your local machine. (You might need to switch local machines here to one in the socio lab, if you don't have Praat installed on your laptop...)
```
scp username@patas.ling.washington.edu:~/p2fa-files/scrapple.TextGrid .
```
Open scrapple.TextGrid with Praat, and examine the output.
In your write up, answer the following questions:
1. How many speakers are there in the conversation?
2. What words did p2fa not recognize, and why?
3. Based on the contents of the file /NLP_TOOLS/speech_tools/p2fa/latest/model/dict what would you need to add to get p2fa to recognize those words?
4. Give the timestamp of the first location that the P2FA system gets "off-track?".
5. At what timestamp does P2FA "get back on-track"?
6. Referring to the characteristics of the acoustic signal at the timestamp you noted in (D) above, why do you think it got "lost"?
7. Looking at the phone tier, list two errors that you see in P2FA's transcription.
8. Often we will have to accomplish some hand-realignment to fix phone boundaries, or modify the phone symbols we get from a forced aligner. In a few sentences, describe some of the types of modifications we'd need to accomplish to clean up the TextGrid Results file.
9. Imagine you are working with a corpus that is too large to fix with hand-alignment. Brainstorm a research question which could be asked of forced-aligned data with some noise (i.e., errors) in it.