Linguistics 580:
Computational Methods in Linguistic Analysis

Spring 2012

Course Info

Instructor Info

Links

Syllabus

Description

Linguistic research across all subfields involves the testing of hypotheses against data. Computational methods allow linguists to greatly extend the size of datasets and range of data brought to bear on their hypotheses. The purpose of this course is to introduce students to existing computational resources including text and speech corpora, annotations over those corpora and software for manipulating them. The course is project-driven: The target audience is graduate students who are engaged in research projects which can be enhanced with the computational methods we study. No background in computational linguistics or computer programming is assumed.

Course goals

By the end of this course students will:

Practicum groups

We will get "hands-on" experience working with large corpora of interest to students. Students (in pairs or small groups) will form practicum groups, and will work together to accomplish specific database-related tasks across the quarter. Most of this work will be "behind the scenes", and groups will independently arrange how to conduct their work. There are two planned points in the quarter when practicum work will be presented in class: each practicum group will be responsible one syllabus day (a Wednesday) in the first half of the quarter for doing an in-class presentation to introduce us to their chosen database, show us how it is annotated, demonstrate how it is navigated, and show us how it might be of use to linguists. Note: a list of database resources of interest to linguists, CorpusList.rtf is available online in the Sociolinguistics Wiki.

Ideally, each student will have a laptop available to use in class on Wednesdays when we do hands-on exercises.

Practicum group members will:

  1. Choose a database with which to work for the quarter.
  2. Work together to learn how to access the contents of this database.
  3. Present their database to the class, show how to access it, write one research question that this database may be used to address.
  4. Individually annotate some portion of the database.
  5. Measure inter-annotator agreement on this portion selected in step (4).
Exercises: Given out on Mondays in lecture, some of these are practicum assignments to be worked collaboratively on Wednesdays in class, others are individual assignments. All are submitted to Canvas on Fridays by 5pm.

Disability accommodations

To request academic accommodations due to a disability, please contact Disabled Student Services, 448 Schmitz, 206-543-8924 (V/TTY). If you have a letter from Disabled Student Services indicating that you have a disability which requires academic accommodations, please present the letter to the instructor so we can discuss the accommodations you might need in this class.

Requirements

Late homework policy

Schedule of Topics and Assignments (may be updated)

Each week has a technical topic and a theoretical topic. We seek to build skills in linguistic analysis, drawing on some issue from linguistic theory (Mondays) as well as technical computing skills (Wednesdays). You will, therefore, see the days of each week in the schedule labelled either "linguistics" or "technical" depending on their focus. This schedule may change in light of the projects that students are working on.
DateTopicReading(s) to be discussed this weekDue
Wk 1
3/26, 3/28
Linguistics: Intro; Projects of interest; What database resources exist
Technical: Getting around in Unix
  Corpus scavenger hunt
Wk 2
4/2, 4/4
Linguistics: Processing resources: POS taggers, parsers, forced aligners
Technical: Running software
Bender & Langendoen 2010 Running a parser/POS tagger/forced aligner
Presentation by Practicum Groups
Term project: Define research questions
Wk 3
4/9, 4/11
Linguistics: Discussion of project questions and how corpora can be brought to bear on them
Technical: Python basics, SVN
Bird et al 2009 Chapters 0 and 1
[R: Gries 2009, ch. 3]
Hello World
Term project: Identify relevant resources
Wk 4
4/16, 4/18
Linguistics: Lexical frequency
Technical: Python (cont)
Jurafsky (skip or skim pp 63-88)
[R: Gries, 2009, ch. 4]
Simple word frequency counter in Python (run over multiple corpora)
Term project: Identify relevant resources
Wk 5
4/23, 4/25
Linguistics: Metadata (DCMI, OLAC); Publishing data with papers
Technical: Python continued
  Merging demographic data with transcripts (Python exercise)
Wk 6
4/30, 5/2
Linguistics: Annotation: Inter-annotator agreement; Annotation guidelines
Technical: Annotation software (xtdf; Excel; ELAN)
Bird & Liberman 1999
Morgan et al. (skip or skim Sec.5,pgs. 22-31)
Annotation (in pairs to produce dual-annotation)
Wk 7
5/7, 5/9
Interim project reports, open discussion of normalized frequencies for student databases    
Wk 8
5/14, 5/16
Linguistics: Inter-annotator agreement
Technical: Computing Cohen's Kappa and Krippendorff's Alpha
Artstien & Poesio 2008
Clopper 2011
Practicum groups compute Kappa/Alpha (Python)
Wk 9
5/21, 5/23
Linguistics: Sampling; Choosing statistical tests
Technical: Basic exploratory statistics in R
Hay 2011 R assignment
Term project: Choose statistical tests
Wk 10
5/30
No class Monday, Project presentations    
6/6     Final projects due
No late projects accepted.


Last modified: 3/22/12 7:21 PM