LING 575 - Declarative Information Extraction
Course Description and Policy
The goal of information extraction (IE) is to extract information from unstructured text. IE is an essential component for many applications that leverage unstructured text, including social media analytics, biomedical NLP, financial risk analysis, semantic search, regulatory compliance, legal discovery and many others.
There are two major approaches to IE: rule-based and statistical systems. Both have advantages and limitations. In this seminar, we focus on a new paradigm for IE called "Declarative IE", which has recently emerged as a powerful approach to building high-performance IE systems. One particular system -- SystemT developed by the IBM Research Center -- will be discussed in detail. In particular, we will discuss (1) the theoretical foundations of SystemT, including its underlying algebra and the optimizer, in comparison with earlier systems in terms of expressivity and runtime performance; (2) a detailed description of AQL, the rule language of SystemT; (3) algorithms for learning rules automatically from data.
This seminar is based on a course that the IBM team taught at UC Santa Cruz in Spring 2014. We are revising the course material to fit better with the CLMS curriculum. We are very grateful for the generous support from the IBM team, who not only provides the course material and SystemT, but also offers guest lectures and office hours. SystemT is a commercial product and will have an official release to the world in March 2015, right before the start of this course.
There is no required textbook. Instead, the course readings will be drawn from
contemporary articles and tutorials available online.
Helpful background material can also be found in:
- (M): Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, 2008. Introduction to Information Retrieval, Cambridge University Press. [pdf]
- (J&M): Daniel Jurafsky and James Martin, 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition.
- (M&S): Christopher D. Manning and Hinrich Schutze, 1999. Foundations of statistical natural language processing,
Cambridge: MIT Press.
- LING 570
- LING 572, Java, and Eclipse are a big plus, but
- SystemT is written in Java, and the IDE is Eclipse-based. Therefore, it would be much easier if you use Java as your programming language. On the other hand, you can choose whatever language you prefer if it works well with the SystemT API.
- For submission, students should tar all the required files and upload the tar file via CollectIt. Please include the shell scripts and note files as explained in class.
- The code must run on Patas. If the code does not work for whatever reasons, please explain your work in the note file to get partial credits.
- 90%: Homework Assignments: Due at 11:45pm on Monday.
Students can work in a team. Each team has at most 3 people.
- 10%: Reading Assignments: Due at 11:45pm on Monday
- 10%: Class participation
- The lowest homework assignment grade will be dropped when computing grades and averages.
- Late assignment submission: There will be a 1% penalty for every hour after the deadline. For instance, suppose the assignment is due at 11:45pm and you turn in the assignment at 1:45am the next morning, you grade would be x * 0.98, where x is the grade you would have gotten if you have turned in before the deadline. No assignments will be accepted two (2) days after the due date.
- Reading assignments: Reading assignments are on teaching slides, not on separate handout. Submit it via CollectIt.
- Incomplete: According to UW policy, "incomplete grades may only be awarded if you are doing satisfactory work up until the last two weeks of the quarter." Therefore, it is crucial for you to hand in your homework on time. An "incomplete" grade is given only under extremely unusual circumstances (e.g., health issues, family emergency).
- GoPost and email: For all the course-related questions, please post your questions to GoPost unless you don't want others to see the message (e.g., questions about your grades). For emails, please start the subject line with "ling572:". If you do not include the prefix, your mail might go unanswered. If you
don't receive a reply from me within 24 hours, please send me a reminder.
- Collaboration: Students are encouraged to collaborate with their classmates in and outside the classroom. For instance, you can post questions about assignments to GoPost, and others are encouraged to reply to your post.
- Online section: Students who do not register for the online section can attend no more than 10% of sessions online.
- Laptop in class: Students are NOT allowed to use laptop in class unless they are using laptop to take notes or go over slides for that day.