Ling 573 - Natural Language Processing Systems and Applications
Spring 2014
Deliverable #2: Baseline Question-Answering System
Code and Results due: April 25, 2014: 23:59
Updated Project Report due: April 29, 2014: 09:00
Goals
In this deliverable, you will implement a baseline question-answering
system. You will
- Create an end-to-end question-answering system going from questions to short passages, supported by a document from the AQUAINT corpus.
- Implement a document retrieval component, based on techniques discussed in class and in the readings to identify supporting documents.
- Identify the resources - software and corpus - that can support this task.
Baseline end-to-end system
For this deliverable, you will need to implement a baseline end-to-end system. You should build on approaches presented in class and readings.
You may implement any effective strategy, but you are
encouraged to implement a redundancy based/ or web-based boosting
strategy such as those in the AskMSR or ARANEA systems. Your system should include:
- Simple query formulation, to prepare the TREC 'Question' for web search
- Search to retrieve passages/snippets
- Answer extraction, such as the n-gram generation and filtering approaches presented in lecture
Since this is a baseline system, it is not expected that your system will be
as elaborate as those presented in class You should concentrate on "connectivity first": get the system to work end-to-end first, and then work on refinements.
Retrieval
For this deliverable, you will need to implement a retrieval system. TREC QA 'strict' evaluation requires document collection
support for all answers. Thus, web-based answers can be projected onto documents in the collection, and collection retrieved results can be confirmed or
reranked based on web-based results. You may build on techniques presented in class,
described in the reading list, and proposed in other research articles.
You system must include indexing and retrieval based on
a standard IR engine, such as those described in the resource list.
In addition, you system may exploit:
- Query expansion: semantically-based, corpus-based, or based on
your redundancy-based results
- Passage selection, such as sentence-based retrieval, fixed window-based retrieval, or some other segmentation approach
- Passage reranking: redundancy-based, heuristic, rule-based, or machine-learning based
Data
Document Collection
The AQUAINT Corpus was employed as the document
collection for the question-answering task for a number of years,
and will form the basis of retrieval for this deliverable.
The collection can be found on patas in /corpora/LDC/LDC02T31/.
Training Data
You may use any of the TREC question collections through 2005
for training your system. For 2003, 2004, and 2005 there are prepared gold standard
documents and answer patterns to allow you to train and tune your
Q/A system.
All pattern files appear in /dropbox/13-14/573/Data/patterns.
All question files appear in /dropbox/13-14/573/Data/Questions.
Training data appear in the training subdirectories.
Development Test Data
You should evaluate on the TREC-2006 questions and their corresponding documents and answer string patterns. You are only required to test on the factoid questions. Development test data appears in the devtest subdirectories.
Evaluation
You will employ the standard mean reciprocal rank (MRR) measure to evaluate
the results from your baseline end-to-end question-answering ystem.
These scores should be
placed files called D2.results_strict and D2.results_lenientin the results directory. A simple script for calculating MRR based on the Litkowski pattern files
and your outputs is provided in /dropbox/13-14/573/code/compute_mrr.py.
It should be called as follows:
python2.6 compute_mrr.py pattern_file D2.outputs {type} where
- pattern_file is the factoid Litkowski pattern file,
- D2.outputs is your passage retrieval output file, and
- type is "strict" or "lenient". If you omit the type, it will default to strict
Outputs
Create one output file in the outputs directory, based on running your baseline question-answering system on
the test data file.
You should do this as follows:
- Answer Extraction and Ranking
- You should return the top 20 answer candidates, where a candidate is
no longer than 250 characters in length. The required format for
the answer extraction phase appears here.
The file should be named D2.outputs and should appear in the outputs directory.
Extending the project report
This extended version should include all the sections from the
original report (with many still as stubs) and additionally
include the following new material:
- Approach
- System architecture
- Query processing
- Retrieval
- Answer candidate extraction and ranking
- Evaluation
- Baseline results: this subsection should describe the results of your baseline system, using both strict and lenient measures. Some error analysis will
help to motivate future improvements.
Please name your report D2.pdf.
Presentation
Your presentation may be prepared in any computer-projectable format,
including HTML, PDF, PPT, and Word. Your presentation should take
about 10 minutes to cover your main content, including:
- System architecture
- Query processing
- Retrieval
- Answer candidate extraction and ranking
- Issues and successes
- Related reading which influenced your approach
Your presentation should be deposited in your doc directory,
but it is not due until the actual presentation time. You may continue
working on it after the main deliverable is due.
Summary
- Finish coding and document all code.
- Verify that all code runs effectively on patas using Condor.
- Add any specific execution or other notes to a README.
- Create your D2.pdf and add it to the doc directory.
- Verify that all components have been added and any changes checked in.