University of Washington: Linguistics: Ling 573: Spring 2017: Deliverable #4

Ling 573 - Natural Language Processing Systems and Applications
Spring 2017
Deliverable #4: Final Summarization Systems
Code and Results due: May 28, 2017: 23:45
Final Report due: June 5, 2017: 17:00

Goals

In this deliverable, you will complete development of your summarization system. You will

Refine and finalize your end-to-end summarization system.
Improve content realization, for improved content or enhanced readability.
Exploit information from any source to improve your overall system.
Perform final evaluation on a held-out test set, involving new document corpora and new test topics.

System Enhancement

This final deliverable must include substantive enhancements beyond your baseline system and further extensions over your D3 system.

Content Realization

For this deliverable, one focus will be on improving your systems through enhanced content realization. Content realization can address:

sentence compression to remove extraneous content, either before or after content selection, or
sentence reformulation focussing on enhancing readability.

You may build on techniques presented in class, described in the reading list, and proposed in other research articles.

We will also be conducting a manual readability evaluation in addition to the ROUGE content scoring, to give improvements in this area due credit.

High-level System Behavior

We will be focusing on the TAC (Text Analytics Conference) summarization shared task. Your system will:

Perform multi-document summarization:
- Over newswire document sets, where each set is associated with a topic, as specified below.
- For each document set:
  - Produce one high quality text summary.
Evaluate the summaries output by your system with respect to human model summaries, using the standard ROUGE metrics.

Document Sets

Document sets to be summarized are provided in NIST standard XML files, identifying a set of topics consisting of:

topic title,
(in some files) topic narrative,
docsetA, and
docsetB, where
- docsets provide sets of document ids, referring to documents in the AQUAINT, AQUAINT-2, and Gigaword (see below) corpora, available on patas.
- Document IDs specify:
  - publication source: e.g. APW, NYT
  - publication date: as YYYYMMDD
  - detail specifier: a digit sequence

NOTE: You should only evaluate your system on the the 'A' document sets. The 'B' sets were designed to evaluate so-called "update" summaries.

NOTE: The format of the document IDs and the organization of the three corpora exhibit some differences. You may include/hardcode information about this structure in your system directly, or in a configuration file as you choose.

Summary Outputs

Your system should produce one summary output file per document set (topic), structured as described below:

Each summary can be no longer than 100 words (whitespace-delimited tokens). Summaries over the size limit will be truncated.
Each summary should be well-organized, in English, using complete sentences. It should have one sentence per line. (Other formats can be used, but require modifications to the scoring configuration.) A blank line may be used to separate paragraphs, but no other formatting is allowed (such as bulleted points, tables, bold-face type, etc.).
Summaries should be based only on the 'A' group of documents for each of the topics in the specification file.
All processing of documents and generation of summaries must be automatic.
Please include a file for each summary, even if the file is empty.
Each file will be read and assessed as a plain text file, so no special characters or markups are allowed.

Evaluation

You will employ the standard automatic ROUGE method to evaluate the results from your base end-to-end summarization system.

You should provide results for ROUGE-1, ROUGE-2, ROUGE-3,and ROUGE-4 which have, in aggregate, been shown to correlate well with human assessments of responsiveness. This can be done with the "-n 4" switch in ROUGE.

Code implementing the ROUGE metric is provided in /dropbox/16-17/573/code/ROUGE/ROUGE-1.5.5.pl. Example configuration files are given. You will need to modify the configuration file to reference your own system's summary output.

You will need to change the "PEER-ROOT" to point to your own outputs.
You will also need to adjust the "PEERS" filenames to handle differences in file naming.
If you choose to develop on alternative data set, you will need to make similar changes to the "MODEL" specifications.

You should use the following flag settings for your official evaluation runs:
-e ROUGE_DATA_DIR -a -n 4 -x -m -c 95 -r 1000 -f A -p 0.5 -t 0 -l 100 -s -d CONFIG_FILE_WITH_PATH
- where, ROUGE_DATA_DIR is /dropbox/16-17/573/code/ROUGE/data
- CONFIG_FILE_WITH_PATH is the location of your revised configuration file
Output is written to standard output by default.
Further usage information can be found using the -H flag or invoking ROUGE with no parameters.

Files

Training, Development Test, Example and Evaluation Test Files

We are focusing on the TAC summarization shared task. Since this is the final deliverable, you will evaluate not only on the 2010 devtest data you have used all term, but also on held-out test data.

Devtest Document Collections

The AQUAINT and AQUAINT-2 Corpora have been employed as the document collections for the summarization task for a number of years, and will form the basis of summarization for this deliverable. The collections can be found on patas in

/corpora/LDC/LDC02T31/ (AQUAINT, 1996-2000) and
/corpora/LDC/LDC08T25/ (AQUAINT-2, 2004-2006).

Evaltest Document Collection

The held-out document sets for the final evaluation are drawn from the English Gigaword corpus, from years 2007 and 2008. This collection may be found on patas in:

/corpora/LDC/LDC11T07.

(Note: Given the size of this corpus, it's still fine if you use the main corpus as your background corpus.)

Core Files

The core training, development test, and evaluation files can be found in /dropbox/16-17/573/ on the CL cluster.

/dropbox/16-17/573/Data/Documents/<file_set_type>/*.xml: All document set specification files
/dropbox/16-17/573/Data/models/<file_set_type>/*: All human-created gold-standard model summary files. The training data sets are further placed in subdirectories by year.
/dropbox/16-17/573/Data/peers/<file_set_type>/*: All automatically created official submission and baseline system summary files for the corresponding Shared Task event.

<file_set_type> ranges over:

training
devtest
evaltest

Training Data

You may use any of the DUC or TAC summarization data through 2009 for training and developing your system. For previous years, there are prepared document sets and model summaries to allow you to train and tune your summarization system.

Development Test Data

You should evaluate on the TAC-2010 topic-oriented document sets and their corresponding model summaries. You should only evaluate your system on the the 'A' sets. Development test data appears in the devtest subdirectories.

Evaluation Test Data

You should also evaluate on the TAC-2011 topic-oriented document sets and their corresponding model summaries, again only on the 'A' sets. This evaluation test data appears in the evaltest subdirectories.

Evaluation Example Files

A variety of example files are provided to help you familiarize yourself with the ROUGE evaluation software.

/dropbox/16-17/573/code/ROUGE/ROUGE-1.5.5pl: Script implementing the ROUGE evaluation measure.
/dropbox/16-17/573/code/ROUGE/rouge_run_ex.xml: Example configuration file to be used with the ROUGE script. The directories and models for the model files are set correctly for the TAC 2010 evaluation. You would point them to alternative files/directories if you wish to use other data, such as the 2009 data.
/dropbox/16-17/573/code/ROUGE/rouge_run_ex_2011.xml: Example configuration file to be used with the ROUGE script. The directories and models for the model files are set correctly for the TAC 2011 evaluation (aka evaltest).
/dropbox/16-17/573/code/ROUGE/rouge_example.out: Output of ROUGE evaluation script using example configuration on example summaries specified below.
/dropbox/16-17/573/Data/mydata/*: Example summary files for practice runs of the evaluation scripts.

Specific Submission Files

In addition to your source code and resources needed to support your system, your repository should include the following:

Dx.cmd: Top-level Condor file, where X is the number of the deliverable, here "4".
README: File explaining anything we'll need to know to be able to run and review your system.

Your System Generated Summarization Files

Create two directories under the outputs directory containing the summaries based on running your final summarization system as below:

.../outputs/D4_devtest/: directory containing the summaries based on running your summarization system on the devtest data files.
.../outputs/D4_evaltest/: directory containing the summaries based on running your summarization system on the evaltest data files.
You should name your output files as:
- Given topic ID e.g. D0901A
- Split into:
  - id_part1 = D0901, and
  - id_part2 = A
- Output file name should be:
  [id_part1]-A.M.100.[id_part2].[some_unique_alphanum]
The names must match the peer file names in your ROUGE configuration file.
.../results/D4_devtest_rouge_scores.out: file containing scores from running ROUGE evaluation on the summaries from your final summarization system on the devtest data files.
.../results/D4_evaltest_rouge_scores.out: file containing scores from running ROUGE evaluation on the summaries from your final summarization system on the evaltest data files.

Extended Project Report

../doc/D4.pdf: This final version should include all required sections, as well as a complete system architecture description and proper bibliography including all and only the papers you have actually referenced. See this document for full details. The final version of your project report must explicitly include:

a substantive error analysis, and
tables presenting the ROUGE-1 and ROUGE-2 scores of your final system on the devtest and evaltest data and comparisons with your D3 results. You should present ROUGE recall values: the "average_R" row in the ROUGE evaluation output.

Presentation

../doc/D4_presentation.{pdf|pptx|etc}: Your presentation may be prepared in any computer-projectable format, including HTML, PDF, PPT, and Word. Your presentation should take about 10 minutes to cover your main content, including:

Improvements in content selection, information ordering, and/or content realization
Issues and successes
Related reading which influenced your approach

Your presentation should be deposited in your doc directory, but it is not due until the actual presentation time. You may continue working on it after the main deliverable is due.

Summary

Finish coding and document all code.
Verify that all code runs successfully on patas using Condor.
Add any specific execution or other notes to the README.
Create your D4.pdf and add it to the doc directory.
Verify that all components have been added and any changes checked in.

If using GIT, remember to tag your deliverable: D4, for the code/implementation, and D4.1 when you add your report document.

Ling 573 - Natural Language Processing Systems and Applications Spring 2017 Deliverable #4: Final Summarization Systems Code and Results due: May 28, 2017: 23:45 Final Report due: June 5, 2017: 17:00