Laboratory Exercise 1

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Due: Friday 3/30, 5pm.

Collaboration

This assignment is to be done collaboratively in practicum groups. Each student should submit their own answer file, but it is fine (and in fact expected) for practicum groups to discuss answers and come up with them together. As you work on this collaboratively, we encourage you to each have a computer so that everyone gets hands-on experience.

1. Create accounts

If you do not already have accounts on patas (for access to corpora) and lemur (for version control with svn), request them now.

For patas, we have a web form: https://vervet.ling.washington.edu/db/accountrequest-form.php

For lemur: To request an account, email linghelp@u with your UW NetID and a statement of your affiliation with the Department of Linguistics or one of the Linguistics laboratories. You will receive an email with your temporary password when your account has been created. Your password will be stored in the clear both on the server and on machines on which you run the client, so don't use a password that you use for anything else.

To change your password on lemur: https://lemur.ling.washington.edu/cgi-bin/svnpasswd.pl

2. Explore the corpus database

The database of installed corpora (plus others that are available but not yet installed) can be viewed at this link: https://vervet.ling.washington.edu/db/livesearch-corpus-form.php

Using that database, find answers to the following questions: (Hint: For some of them, the "advanced search" page might be helpful.)

  1. How many corpora are currently installed on patas?
  2. Does the LDC General license allow you to make a copy of a corpus it applies to on your own computer?
  3. How many different languages are repesented in all the corpora listed in the database?
  4. How many installed corpora have both age and gender included in demographic information?
  5. How many corpora (installed or available) have phonetic information listed among their annotations?

3. Find the corpora on patas

These steps can be done once you have your patas account.

  1. If you are using a mac or linux, open a terminal. If you are on Windows, you'll need to first install an ssh client, such as PuTTY
  2. Connect to patas with the following command (replacing
    username
    with your name:
       ssh -l username patas.ling.washington.edu
    
  3. Patas will prompt you for your password. Use the password given to you when your account was set up.
  4. Change your password using the passwd command:
       passwd
    
  5. It will prompt you for your old password and then your new password twice.
  6. Navigate to the directory that contains the TIMIT corpus (LDC93S1), and examine the files inside the subdirectory TRAIN/DR1/FCJF0. Answer the following questions:
    1. What is in the files whose names end in .TXT?
    2. What is in the files whose names end in .PHN?
    3. What is in the files whose names end in .WRD?
    4. What are contents of the third line of the file SA1.PHN?

Some useful unix commands for carrying this out:
command example explanation
cd cd /corpora change the current directory to the named direcotry; with no argument, it defaults to your home directory
ls ls TIMIT list the contents of the named directory; with no argument, it defaults to the current directory
pwd pwd print working directory (as path from root)
less less SA1.PHN display the contents of a file

Note: You do not have write permission on any files under /corpora, so you can't break anything there by poking around.

4. Explore the LDC catalogue.

Most of the corpora we have installed come from the Linguistic Data Consortium (LDC), and we have access in principle to any corpus they have ever published. The LDC catalogue is available here: http://www.ldc.upenn.edu/Catalog/

Using that database, find the following:

  1. A corpus that is not listed in the local database on vervet. (Give both the name and the catalog number for the corpus.)
  2. Five different types of annotation that are available on at least one LDC corpus. (Illustrate each type with one name and catalogue number.)
  3. An LDC resource that is not a corpus (i.e., not a collection of running text or speech), but rather some other type of linguistic database. Give the name, catalog number, and a brief description of what it is.

5. Find other collections of linguistic databases

Note that while the LDC is an extremely important clearing house for linguistic databases, there are others as well. Accordingly, your final task for the treasure hunt is to find websites for:

  1. an organization outside the US that curates and distributes linguistic databases
  2. a linguistic archive (distinct from your answer for I) specializing in resources describing endangered languages
  3. a site that makes collections of data available to researchers which could be useful in linguistic work but which is not created for linguists (at least not in the first instance)

6. Reflect on what you can do with a database.

  1. Choose one corpus with annotations from any of the collections noted above (or listed in CorpusList.rtf) and explain what information is captured in the annotations.
  2. Draft a research question in some subdiscipline of linguistics that this corpus might be used to address.
  3. Describe how the annotations in the corpus can assist in addressing the research question.