LING 575 - Declarative Information Extraction
Spring 2015
Course Description and Policy

Course description

The goal of information extraction (IE) is to extract information from unstructured text. IE is an essential component for many applications that leverage unstructured text, including social media analytics, biomedical NLP, financial risk analysis, semantic search, regulatory compliance, legal discovery and many others.

There are two major approaches to IE: rule-based and statistical systems. Both have advantages and limitations. In this seminar, we focus on a new paradigm for IE called "Declarative IE", which has recently emerged as a powerful approach to building high-performance IE systems. One particular system -- SystemT developed by the IBM Research Center -- will be discussed in detail. In particular, we will discuss (1) the theoretical foundations of SystemT, including its underlying algebra and the optimizer, in comparison with earlier systems in terms of expressivity and runtime performance; (2) a detailed description of AQL, the rule language of SystemT; (3) algorithms for learning rules automatically from data.

This seminar is based on a course that the IBM team taught at UC Santa Cruz in Spring 2014. We are revising the course material to fit better with the CLMS curriculum. We are very grateful for the generous support from the IBM team, who not only provides the course material and SystemT, but also offers guest lectures and office hours. SystemT is a commercial product and will have an official release to the world in March 2015, right before the start of this course.


There is no required textbook. Instead, the course readings will be drawn from contemporary articles and tutorials available online. Helpful background material can also be found in:




Course Policies