Linguistics 473
Computational Linguistics Fundamentals
Summer 2017

ScheduleThis is the schedule for 2016, and is subject to change before the start of class and as the class progresses. Thursday, July 20, 2017Overview, Sets, and Counting
Introduction and overview of the field of computational linguistics. Brief overview of analytical and statistical techniques in NLP. Basic counting: elements can be grouped into collections and counted. Collections can be distinct or nondistinct, and can be ordered or unordered. A 'set' is a distinct, unordered collection. After looking at permutations, we'll derive the Binomial coefficient, which tells us about how many subsets can be formed from a given set.
Tuesday, July 25, 2017Linux, Cluster Computing, Regular Expressions
The UW Computational Linguistics Computing Cluster will be used throughout all CLMS and NLT coursework. This system runs Linux (RHEL) and also supports the Condor batch submission system. We'll review basic Linux use and focus on how to submit files to the Condor system. In addition to being a quick and handy way to express text pattern matching heuristics, regular expressions have an important place in theoretical reasoning about state machines. For now, we'll study the basic patterns of RegEx use that will help with corpus cleaning and analysis. Thursday, July 27, 2017Events, Probability, Independence
In Lecture 1, we looked at counting entities in a set. When we consider how an the outcome of an observational trial—or experiment—partitions a set, we define a probability space. In particular, the probability space is defined by the set of mutually exclusive, collectively exhaustive outcomes of the trial, and an event can be any subset of these. Events are independent when the product of their probabilities equals the probability of their intersection.
Tuesday, August 1, 2017Random Variables and Probability Distributions
We examine the link between measurement and probability. When we consider multiple event trials that behave in a characteristic way, we form a random variable. Random variables can be discrete or continuous. The probabilities associated with the random variable form a probability distribution, which allows us to generalize about unseen sample spaces that may exhibit similar characteristics.
Thursday, August 3, 2017Bayes' Theorem
In this lecture, we will carefully focus on understanding Bayes' Theorem, the theorem that allows us to relate empirical observation to hypothesis.
Tuesday, August 8, 2017Conditional Independence; Finite State Machines
We'll examine the most common discrete and continuous probability distributions, and look at the properties that characterize them: mean, variance, expected value, covariance. Then, to cover material relevant to project 3, we discuss finite state machines (FSMs), which are automata that process (FST) or accept (FSA) input streams. State transitions are a function of the (state,input) tuple, and can be deterministic or nondeterministic.
Thursday, August 10, 2017Probability Distributions
More on finite state machines and a look at C# and enumeration.
Tuesday, August 15, 2017Algorithms for Corpus Processing
Introduction to BigO notation. Text processing uses programming patterns and specialized algorithms that are particular to the task. As a case study, we examine the Trie, a storage pattern that allows efficient retrieval of all matching subsequences in an input.
Thursday, August 17, 2017Language Modeling and POStagging
Building models of language from corpora is fundamental to statistical NLP. Corpora can be annotated via automatic, human, or hybrid analysis. Automatic POS tagging is a key task which benefits from application of Bayes' theorem. Language modeling issues include choice of Ngram, smoothing, underflow, and data sparsity.
Tuesday, August 22, 2017Formal Grammars and Parsing
ContextFree Grammars (CFG), first applied to linguistics by Noam Chomsky in 1956, are the basis for analytically representing the configuration of syntactic constituents. CFGs comprise a bounded class of languages that can be accepted by 'nondeterministic pushdown automata.' Theoretical work in this area establishes the foundation for classifying formal languages according to their expressive power. One reason we're concerned with the formal class of a language is that it determines the tractability of parsing algorithms. Topdown and bottomup parsing will be examined. Thursday, August 24, 2017Chomsky Hierarchy
The Chomsky Hierarchy, Continued Tuesday, August 29, 2017Clustering and Classifiers
Classification and Information Theory
Thursday, August 31, 2017Evaluation
Tuesday, September 5, 2017Deep Processing
Thursday, September 7, 2017Review
