Laboratory Exercise 2
Computational Methods in Linguistics (Bender/Wassink)
- To gain hands-on experience running two different types
of annotation-producing software
- To understand what the output formats look like
- To understand the extent to which the programs are reliable
- To see what needs to be done to do basic phone-alignment
over a soundfile that was generated in a conventional way for
sociophonetic analysis together with its orthographic transcription
- Further practice with the Unix command line (including managing
the input and output of programs)
Due: Friday 4/6, 5pm.
Both parts of this assignment are to be carried out on patas,
so step one is logging in to patas.
- Log in to patas.
- Invoke the Stanford parser over a sample from the Brown corpus, which I've stored in a file called lab2-input.txt in my home directory.
This will cause the parser output to be printed out on your screen, which
is maybe interesting, but not very helpful. (NB: Some of what is printed
is error messages about the parser not having enough memory. If we
were interested in using the output of the parser as automatic annotations,
we would need to adjust its memory settings to get rid of these errors,
but for present purposes they are not problematic.)
- Now try again, redirecting the parser output to a file in your home directory:
lexparser.csh ~ebender/lab2-input.txt > ~/lab2-output.txt
This will put the output into the file called lab2-output.txt
in your home directory, but still print some messages to the screen.
- Try one more time, this time redirecting both the output and
the messages to separate files:
lexparser.csh ~ebender/lab2-input.txt 1> ~/lab2-output.txt 2> ~/lab2-output.err
- Use less to examine the contents of the output file and
the error file.
- You can also copy the file from your patas account to
your local machine. If you have a unix terminal on your local
machine, you can do this with scp. Issue the following
command to your local machine (replacing username with
your actual patas username):
scp firstname.lastname@example.org:~/lab2-output.txt .
The computer should prompt you for your patas password
and then copy the file to your current directory on your local
- This file was also annotated by hand as part of the Penn Treebank.
We have it on patas at the following path:
- Use less to examine the contents of the hand-annotated
file. (Note that there are more sentences in that file.) You may
also want to use a text editor, either on patas or on your local
machine, to examine these files.
- Answer the following questions:
- How many sentences were given to the parser?
- How many sentences did the parser find a parse for?
- For the first sentence, list all of the NPs, according
to the Stanford parser (e.g.,
the first one is "American romance").
- For the first sentence, describe the differences between the
automatic parse and the hand parse. In particular, provide one example of
each of the following:
- A places where the structure is the same, but the labels are different.
- A places where the structure is different.
Below the constituent structure for the sentence are a set
of lines giving a dependency structure.
- What do the numbers after the words in these lines indicate? Why
would these be necessary?
- Brainstorm a scenario in which either the constituent structure
or the dependency structure could be useful for your research. Would
the other structure be useful as well? Why or why not?
- How would parser errors affect the usefulness of these automatic
annotations for that research question?
- Log into patas
- Make a copy of the files we'll be using for this assignment
in your home directory on patas:
cp -r ~ebender/p2fa-files ~
- Copy the files from patas to your local machine, too (the
following command should be issued from your local machine):
scp -r email@example.com:~/p2fa-files .
- To orient yourself to this conversation, open and inspect the long
version of the orthographic transcript for this soundfile (in
Microsoft Word...horrors!). It is a typical conversation transcript,
such as is commonly used in sociolinguistics. Observe how many speakers
participated in the conversation.
- p2fa requires a plain text transcript, with only the
orthographic transcription and no further annotations. Copy
the file shortscrappletranscript.txt into a new file
called cleanscrappletranscript.txt. You can do
this on patas with the text editor emacs, or on your local
machine. If you do it on your local machine, be sure to
copy the new file back to patas after you've made the modifications
- Open cleanscrappletranscript.txt in a text editor (so that
when you save it, it is still .txt and not .rtf).
- Remove all coding conventions from cleanscrappletranscript.txt.
- Replace all smart quotes in cleanscrappletranscript.txt
with regular quote or apostrophes, as appropriate.
- If you've been working locally, copy cleanscrappletranscript.txt
scp cleanscrappletranscript.txt firstname.lastname@example.org:~/p2fa-files/
- Invoke p2fa on patas with the command align.py (the full
path to align.py is /NLP_TOOLS/speech_tools/p2fa/latest/align.py, but we've added it to the $PATH variable on patas so
it can be invoked as just align.py).
align.py SP19CF2J__SP20CF2J_conversationalshort.wav cleanscrappletranscript.txt scrapple.TextGrid
NB: Tab completion can be very useful here. When you type that
command, rather than typing the full file name of the .wav
file, type S then tab then _ then tab. You can do the same
for the transcript file. The third argument there is the name of
the output file. Since this doesn't exist until the program
runs (the first time; the second time it will overwrite) you'll
have to type it in full.
- Observe what the program prints to the screen as it runs.
Are there any words it doesn't recognize? Why not?
- Copy the output file (scrapple.TextGrid) to your local
machine. (You might need to switch local machines here to one
in the socio lab, if you don't have Praat installed on your laptop...)
scp email@example.com:~/p2fa-files/scrapple.TextGrid .
- Open scrapple.TextGrid with Praat, and examine
- In your write up, answer the following questions:
- How many speakers are there in the conversation?
- What words did p2fa not recognize, and why?
- Based on the contents of the file /NLP_TOOLS/speech_tools/p2fa/latest/model/dict what would you need to add to get p2fa to recognize those words?
- Give the timestamp of the first location that the P2FA system gets "off-track?".
- At what timestamp does P2FA "get back on-track"?
- Referring to the characteristics of the acoustic signal at the timestamp you noted in (D) above, why do you think it got "lost"?
- Looking at the phone tier, list two errors that you see in P2FA's transcription.
- Often we will have to accomplish some hand-realignment to fix phone boundaries, or modify the phone symbols we get from a forced aligner. In a few sentences, describe some of the types of modifications we'd need to accomplish to clean up the TextGrid Results file.
- Imagine you are working with a corpus that is too large to fix
with hand-alignment. Brainstorm a research question which could
be asked of forced-aligned data with some noise (i.e., errors) in it.