Ling 573 - Natural Language Processing Systems and Applications
Spring 2017
Deliverable #3: Summarization Improvement; Information Ordering
Code and Results due: May 14, 2017: 23:59
Updated Project Report due: May 16, 2017: 09:00a.m.


In this deliverable, you will continue development and improvement of your summarization system. You will

Information Ordering

For this deliverable, one focus will be on improving your base summarization systems through enhanced information ordering. Information ordering can address:

You may build on techniques presented in class, described in the reading list, and proposed in other research articles.

Content Selection Improvement

You should continue to revise and improve your content selection approach to enhance your summarization system. One strategy to do so in the context of TAC is through topic-focused summarization, as discussed below.

Topic-focused summarization

The TAC summarization task is a topic-focused, or "guided", summarization task. Summaries are expected to focus on the topic, specified by the title element given in the test topics XML file, and address the relevant aspects for the corresponding category. Most approaches augment existing content selection strategies to further focus on the desired topics. You may build on approaches presented in lecture or readings.

High-level System Behavior

We will be focusing on the TAC (Text Analytics Conference) summarization shared task. Your system will:

Document Sets

Document sets to be summarized are provided in NIST standard XML files, identifying a set of topics consisting of: NOTE: You should only evaluate your system on the the 'A' document sets. The 'B' sets were designed to evaluate so-called "update" summaries.

NOTE: The format of the document IDs and the organization of the two corpora exhibit some differences. You may include/hardcode information about this structure in your system directly, or in a configuration file as you choose.

Summary Outputs

Your system should produce one summary output file per document set (topic), structured as described below:


You will employ the standard automatic ROUGE method to evaluate the results from your base end-to-end summarization system. Code implementing the ROUGE metric is provided in /dropbox/16-17/573/code/ROUGE/ Example configuration files are given. You will need to modify the configuration file to reference your own system's summary output.


Training, Development Test, and Example Files

We will use one year's data as development test data (devtest) for most of the term, and then use a new unseen year's data as final evaltest in the last deliverable.

Primary Document Collections

The AQUAINT and AQUAINT-2 Corpora have been employed as the document collections for the summarization task for a number of years, and will form the basis of summarization for this deliverable. The collections can be found on patas in

Core Files

The core training, development test, and evaluation files can be found in /dropbox/16-17/573/ on the CL cluster.

<file_set_type> ranges over:

Training Data

You may use any of the DUC or TAC summarization data through 2009 for training and developing your system. For previous years, there are prepared document sets and model summaries to allow you to train and tune your summarization system.

Development Test Data

For Deliverables 2 and 3, you should evaluate on the TAC-2010 topic-oriented document sets and their corresponding model summaries. You should only evaluate your system on the the 'A' sets. Development test data appears in the devtest subdirectories.

Evaluation Example Files

A variety of example files are provided to help you familiarize yourself with the ROUGE evaluation software.

Specific Submission Files

In addition to your source code and resources needed to support your system, your repository should include the following:

Your System Generated Summarization Files

Extended Project Report

../doc/D3.pdf: This extended version should include content for all the sections of the report, though some of it will not be very detailed yet. You should specifically focus on the following new material:


../doc/D3_presentation.{pdf|pptx|etc}: Your presentation may be prepared in any computer-projectable format, including HTML, PDF, PPT, and Word. Your presentation should take about 10 minutes to cover your main content, including: Your presentation should be deposited in your doc directory, but it is not due until the actual presentation time. You may continue working on it after the main deliverable is due.