` University of Washington: Linguistics: Ling 573: Spring 2017: Deliverable #2

Ling 573 - Natural Language Processing Systems and Applications
Spring 2017
Deliverable #2: Base End-to-End Summarization System
Code and Results due: April 23, 2017: 23:59
Updated Project Report due: April 25, 2017: 09:00


Goals

In this deliverable, you will implement a base end-to-end summarization system. You will

Base end-to-end system

For this deliverable, you will need to implement a base end-to-end system. You should build on approaches presented in class and readings. You may implement any effective strategy, but you are encouraged to implement an extractive summarization strategy. Your system should include:

Since this is an initial system, it is not expected that your system will be as elaborate as those presented in class. You should concentrate on "connectivity first": get the system to work end-to-end first, and then work on refinements.

High-level System Behavior

We will be focusing on the TAC (Text Analytics Conference) summarization shared task. Your system will:

Document Sets

Document sets to be summarized are provided in NIST standard XML files, identifying a set of topics consisting of: NOTE: You should only evaluate your system on the the 'A' document sets. The 'B' sets were designed to evaluate so-called "update" summaries.

NOTE: The format of the document IDs and the organization of the two corpora exhibit some differences. You may include/hardcode information about this structure in your system directly, or in a configuration file as you choose.

Summary Outputs

Your system should produce one summary output file per document set (topic), structured as described below:

Evaluation

You will employ the standard automatic ROUGE method to evaluate the results from your base end-to-end summarization system. Code implementing the ROUGE metric is provided in /dropbox/16-17/573/code/ROUGE/ROUGE-1.5.5.pl. Example configuration files are given. You will need to modify the configuration file to reference your own system's summary output.

Files

Training, Development Test, and Example Files

We will use one year's data as development test data (devtest) for most of the term, and then use a new unseen year's data as final evaltest in the last deliverable.

Primary Document Collections

The AQUAINT and AQUAINT-2 Corpora have been employed as the document collections for the summarization task for a number of years, and will form the basis of summarization for this deliverable. The collections can be found on patas in

Core Files

The core training, development test, and evaluation files can be found in /dropbox/16-17/573/ on the CL cluster.

<file_set_type> ranges over:

Training Data

You may use any of the DUC or TAC summarization data through 2009 for training and developing your system. For previous years, there are prepared document sets and model summaries to allow you to train and tune your summarization system.

Development Test Data

For Deliverables 2 and 3, you should evaluate on the TAC-2010 topic-oriented document sets and their corresponding model summaries. You should only evaluate your system on the the 'A' sets. Development test data appears in the devtest subdirectories.

Evaluation Example Files

A variety of example files are provided to help you familiarize yourself with the ROUGE evaluation software.

Specific Submission Files

In addition to your source code and resources needed to support your system, your repository should include the following:

Your System Generated Summarization Files

Extended Project Report

../doc/D2.pdf: This extended version should include content for all the sections of the report (no more lorem ipsums), though some of it will not be very detailed yet. You should specifically focus on the following:

Presentation

../doc/D2_presentation.{pdf|pptx|etc}: Your presentation may be prepared in any computer-projectable format, including HTML, PDF, PPT, and Word. Your presentation should take about 10 minutes to cover your main content, including: Your presentation should be deposited in your doc directory, but it is not due until the actual presentation time. You may continue working on it after the main deliverable is due.

Summary