Laboratory Exercise 1
Computational Methods in Linguistics (Bender/Wassink)
Goals:
- To become familiar with corpus resources available on patas,
specifically:
- What kinds of corpora we have
- How to use the corpus database to discover locally installed corpora
- How to discover further corpora that could be installed
- Licensing conditions
- Gentle introduction to unix
- Creating accounts on patas, lemur
- Logging in to patas
- Navigating the directory structure using command line
Due: Friday 3/30, 5pm.
Collaboration
This assignment is to be done collaboratively in practicum groups.
Each student should submit their own answer file, but it is fine (and
in fact expected) for practicum groups to discuss answers and come up
with them together. As you work on this collaboratively, we encourage
you to each have a computer so that everyone gets hands-on experience.
1. Create accounts
If you do not already have accounts on patas (for access to corpora)
and lemur (for version control with svn), request them now.
For patas, we have a web form: https://vervet.ling.washington.edu/db/accountrequest-form.php
For lemur: To request an account, email linghelp@u with your UW NetID
and a statement of your affiliation with the Department of Linguistics
or one of the Linguistics laboratories. You will receive an email with
your temporary password when your account has been created. Your
password will be stored in the clear both on the server and on
machines on which you run the client, so don't use a password that you
use for anything else.
To change your password on lemur: https://lemur.ling.washington.edu/cgi-bin/svnpasswd.pl
2. Explore the corpus database
The database of installed corpora (plus others that are available
but not yet installed) can be viewed at this link:
https://vervet.ling.washington.edu/db/livesearch-corpus-form.php
Using that database, find answers to the following questions:
(Hint: For some of them, the "advanced search" page might be
helpful.)
- How many corpora are currently installed on patas?
- Does the LDC General license allow you to make a copy of a corpus
it applies to on your own computer?
- How many different languages are repesented in all the corpora
listed in the database?
- How many installed corpora have both age and gender included
in demographic information?
- How many corpora (installed or available) have phonetic information
listed among their annotations?
3. Find the corpora on patas
These steps can be done once you have your patas account.
- If you are using a mac or linux, open a terminal. If you are on
Windows, you'll need to first install an ssh client, such as PuTTY
- Connect to patas with the following command (replacing
username
with your name:
ssh -l username patas.ling.washington.edu
- Patas will prompt you for your password. Use the password given to
you when your account was set up.
- Change your password using the passwd command:
passwd
- It will prompt you for your old password and then your new password
twice.
- Navigate to the directory that contains the TIMIT corpus (LDC93S1), and
examine the files inside the subdirectory TRAIN/DR1/FCJF0. Answer the following
questions:
- What is in the files whose names end in .TXT?
- What is in the files whose names end in .PHN?
- What is in the files whose names end in .WRD?
- What are contents of the third line of the file SA1.PHN?
Some useful unix commands for carrying this out:
command | example | explanation |
cd | cd /corpora | change the current directory to the named direcotry; with no argument, it defaults to your home directory |
ls | ls TIMIT | list the contents of the named directory; with no argument, it defaults to the current directory |
pwd | pwd | print working directory (as path from root) |
less | less SA1.PHN | display the contents of a file |
Note: You do not have write permission on any files under /corpora, so
you can't break anything there by poking around.
4. Explore the LDC catalogue.
Most of the corpora we have installed come from the Linguistic Data
Consortium (LDC), and we have access in principle to any corpus
they have ever published. The LDC catalogue is available here:
http://www.ldc.upenn.edu/Catalog/
Using that database, find the following:
- A corpus that is not listed in the local database on vervet. (Give
both the name and the catalog number for the corpus.)
- Five different types of annotation that are available on at least
one LDC corpus. (Illustrate each type with one name and catalogue
number.)
- An LDC resource that is not a corpus (i.e., not a collection of
running text or speech), but rather some other type of linguistic
database. Give the name, catalog number, and a brief description of
what it is.
5. Find other collections of linguistic databases
Note that while the LDC is an extremely important
clearing house for linguistic databases, there are others as well.
Accordingly, your final task for the treasure hunt is to find
websites for:
- an organization outside the US that curates and distributes
linguistic databases
- a linguistic archive (distinct from your answer for I) specializing
in resources describing endangered languages
- a site that makes collections of data available to researchers
which could be useful in linguistic work but which is not created
for linguists (at least not in the first instance)
6. Reflect on what you can do with a database.
- Choose one corpus with annotations from any of the collections
noted above (or listed in CorpusList.rtf) and explain what information
is captured in the annotations.
- Draft a research question in some subdiscipline of linguistics that
this corpus might be used to address.
- Describe how the annotations in the corpus can assist in addressing
the research question.