Linguistics 473
Computational Linguistics Fundamentals
Summer 2017

## Schedule

This is the schedule for 2016, and is subject to change before the start of class and as the class progresses.

#### Thursday, July 20, 2017

Overview, Sets, and Counting
Introduction and overview of the field of computational linguistics. Brief overview of analytical and statistical techniques in NLP. Basic counting: elements can be grouped into collections and counted. Collections can be distinct or non-distinct, and can be ordered or unordered. A 'set' is a distinct, unordered collection. After looking at permutations, we'll derive the Binomial coefficient, which tells us about how many subsets can be formed from a given set.

#### Tuesday, July 25, 2017

Linux, Cluster Computing, Regular Expressions
The UW Computational Linguistics Computing Cluster will be used throughout all CLMS and NLT coursework. This system runs Linux (RHEL) and also supports the Condor batch submission system. We'll review basic Linux use and focus on how to submit files to the Condor system. In addition to being a quick and handy way to express text pattern matching heuristics, regular expressions have an important place in theoretical reasoning about state machines. For now, we'll study the basic patterns of RegEx use that will help with corpus cleaning and analysis.

#### Thursday, July 27, 2017

Events, Probability, Independence
In Lecture 1, we looked at counting entities in a set. When we consider how an the outcome of an observational trial—or experiment—partitions a set, we define a probability space. In particular, the probability space is defined by the set of mutually exclusive, collectively exhaustive outcomes of the trial, and an event can be any subset of these. Events are independent when the product of their probabilities equals the probability of their intersection.

#### Tuesday, August 1, 2017

Random Variables and Probability Distributions
We examine the link between measurement and probability. When we consider multiple event trials that behave in a characteristic way, we form a random variable. Random variables can be discrete or continuous. The probabilities associated with the random variable form a probability distribution, which allows us to generalize about unseen sample spaces that may exhibit similar characteristics.

#### Thursday, August 3, 2017

Bayes' Theorem
In this lecture, we will carefully focus on understanding Bayes' Theorem, the theorem that allows us to relate empirical observation to hypothesis.

#### Tuesday, August 8, 2017

Conditional Independence; Finite State Machines
We'll examine the most common discrete and continuous probability distributions, and look at the properties that characterize them: mean, variance, expected value, covariance. Then, to cover material relevant to project 3, we discuss finite state machines (FSMs), which are automata that process (FST) or accept (FSA) input streams. State transitions are a function of the (state,input) tuple, and can be deterministic or non-deterministic.

#### Thursday, August 10, 2017

Probability Distributions
More on finite state machines and a look at C# and enumeration.

#### Tuesday, August 15, 2017

Algorithms for Corpus Processing
Introduction to Big-O notation. Text processing uses programming patterns and specialized algorithms that are particular to the task. As a case study, we examine the Trie, a storage pattern that allows efficient retrieval of all matching subsequences in an input.

#### Thursday, August 17, 2017

Language Modeling and POS-tagging
Building models of language from corpora is fundamental to statistical NLP. Corpora can be annotated via automatic, human, or hybrid analysis. Automatic POS tagging is a key task which benefits from application of Bayes' theorem. Language modeling issues include choice of N-gram, smoothing, underflow, and data sparsity.

#### Tuesday, August 22, 2017

Formal Grammars and Parsing
Context-Free Grammars (CFG), first applied to linguistics by Noam Chomsky in 1956, are the basis for analytically representing the configuration of syntactic constituents. CFGs comprise a bounded class of languages that can be accepted by 'non-deterministic pushdown automata.' Theoretical work in this area establishes the foundation for classifying formal languages according to their expressive power. One reason we're concerned with the formal class of a language is that it determines the tractability of parsing algorithms. Top-down and bottom-up parsing will be examined.

#### Thursday, August 24, 2017

Chomsky Hierarchy
The Chomsky Hierarchy, Continued

#### Tuesday, August 29, 2017

Clustering and Classifiers
Classification and Information Theory

Evaluation

Deep Processing

Review