Ling 573 - Natural Language Processing Systems and Applications
Spring 2015
Deliverable #2: Baseline Summarization System
Code and Results due: April 24, 2015: 23:59
Updated Project Report due: April 28, 2015: 09:00


Goals

In this deliverable, you will implement a baseline summarization system. You will

Baseline end-to-end system

For this deliverable, you will need to implement a baseline end-to-end system. You should build on approaches presented in class and readings. You may implement any effective strategy, but you are encouraged to implement an extractive summarization strategy. Your system should include:

Since this is a baseline system, it is not expected that your system will be as elaborate as those presented in class. You should concentrate on "connectivity first": get the system to work end-to-end first, and then work on refinements.

Data

We will be focusing on the TAC summarization shared task. We will use one year's data as devtest for most of the term, and then use a new unseen year's data as final evaltest in the last deliverable.

Document Collection

The AQUAINT and AQUAINT-2 Corpora have been employed as the document collections for the summarization task for a number of years, and will form the basis of summarization for this deliverable. The collections can be found on patas in /corpora/LDC/LDC02T31/ (AQUAINT, 1996-2000) and /corpora/LDC/LDC08T25/ (AQUAINT-2, 2004-2006).

Training Data

You may use any of the DUC or TAC summarization data through 2009 for training and developing your system. For previous years, there are prepared document sets and model summaries to allow you to train and tune your summarization system.

All model files appear in /dropbox/14-15/573/Data/models.

All document specification files appear in /dropbox/14-15/573/Data/Documents.

Training data appear in the training subdirectories and devtest data in the devtest directory.

Development Test Data

You should evaluate on the TAC-2010 topic-oriented document sets and their corresponding model summaries. You should only evaluate your system on the the 'A' sets. Development test data appears in the devtest subdirectories.

Evaluation

You will employ the standard automatic ROUGE method to evaluate the results from your baseline end-to-end summarization ystem. Code implementing the ROUGE metric is provided in /dropbox/14-15/573/code/ROUGE/ROUGE-1.5.5.pl. Example configuration files are given.

Outputs

Create a directory D2 under the outputs directory containing the summaries based on running your baseline summarization system on the test data file. You should do this as follows:

Extending the project report

This extended version should include all the sections from the original report (with many still as stubs) and additionally include the following new material:

Please name your report D2.pdf.

Presentation

Your presentation may be prepared in any computer-projectable format, including HTML, PDF, PPT, and Word. Your presentation should take about 10 minutes to cover your main content, including: Your presentation should be deposited in your doc directory, but it is not due until the actual presentation time. You may continue working on it after the main deliverable is due.

Summary