Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 7

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Gain hands-on experience with computing inter-annotator agreement
(Work with XML in python)
Understand how to transform annotation files to the required form
Explore differences in annotation between annotators
Reflect on implications for the reliability of annotated data

Steps:

Each of these is expanded below.

Calculate Cohen's kappa and Krippendorff's alpha scores and the confusion matrix for a toy example
Calculate kappa scores for your Lab 6 + your practicum groupmate's Lab 6 (for MUC7 named entity tags only), as follows:
- Whether the same words were tagged as entities
- For each word tagged by both annotators, whether the same types were chosen
- Both combined
Calculate the confusion matrix
Calculate how many times each annotator used each label
Print out points of disagreement, with enough info that they can be examined in the ELAN file

Calculate kappa, alpha and confusion matrix for a toy example

In this step, we'll use the nltk.metrics.agreement module, which is partly documented here. This module takes in data in the form of a list of triples, where each triple contains an annotator label, an item label and a tag. For example, the following is a snippet of what our data could look like:

[['1', 5723, 'ORG'],
 ['2', 5723, 'ORG'],
 ['1', 55829, 'LOC'],
 ['2', 55829, 'LOC'],
 ['1', 259742, 'PER'],
 ['2', 259742, 'LOC'],
 ['1', 269340, 'PER'],
 ['2', 269340, 'LOC']]

Here we have four items, each labeled by two different annotators. In two cases, the annotators agree. In two cases they don't.

Using the python interpreter and the nltk metrics package, calculate inter-annotator agreement (both kappa and alpha) for this example. Note that AnnotationTask is a type of object, with methods kappa() and alpha(). When you call nltk.metrics.AnnotationTask() it returns an object of that type, which in the example below is stored in the variable task.

import nltk
toy_data = [['1', 5723, 'ORG'],['2', 5723, 'ORG'],['1', 55829, 'LOC'],['2', 55829, 'LOC'],['1', 259742, 'PER'],['2', 259742, 'LOC'],['1', 269340, 'PER'],['2', 269340, 'LOC']]
task = nltk.metrics.agreement.AnnotationTask(data=toy_data)
task.kappa()
task.alpha()

The nltk metrics package also provides for calculating and printing confusion matrices, a way of displaying which labels were 'mistaken' for which other ones. Unfortunately, this functionality requires a different format for the input. In particular, it wants two lists of labels (in the same order).

import nltk #Don't need to do this twice in the same python session
toy1 = ['ORG','LOC','PER','PER']
toy2 = ['ORG','LOC','LOC','LOC']
cm = nltk.metrics.ConfusionMatrix(toy1,toy2)
print cm

calculate-iaa.py

The rest of the lab will be done by writing a script to take the two .eaf files, extract the annotations from them and format them as nltk.metrics.agreement expects, and calculate the two measures. In addition, this script will print out the points of disagreement so you can examine them by hand.

Trade .eaf files with your practicum group partner, so that you have two. Open these up with a text editor, and look at their structure. Note what kind of information is where in the file, and what happened with words that didn't get MUC annotations.

Download the starter script (calculate-iaa.py), and read through the comments and existing code to get a sense of what the script is doing, and what it is asking you to do. As before, the symbol

# ~*~

indicates a place where you need to fill in code to implement what's in the comment above.

The main subtasks of the script are as follows:

(Extract the information we're interested in out of the .eaf files (XML). --- This has been done for you, but it is worth looking through that part of the code to see how it works.)
Format the information as annotation tasks (differently depending on what we're measuring)
Calculate kappa and alpha for each task (and print to output file)
Calculate and print the confusion matrix
Calculate and print the numer of times each annotator used each label
Print list of differing annotations

Since we need to extract the information from two files the same way, and since we need to format the information for three different tasks, the first two subtasks above are conceptualized as subroutines in the model script. There are a couple of other subroutines to define at the top of the script. You can wait to define those until you've reach the portion of the main body of the script that uses them.

Useful python

This script makes heavy use of xml.etree for handing XML, but the assignment does not ask you to modify that part of the code.

It also uses nltk.metrics. In addition to the notes above, the final thing you need to know is how to get the confusion matrix to print with the write() method on the output file object. Confusion matrix objects have a pp() method ("pretty print") which works together with the write method as follows:

outfile.write(cm.pp())

Appending to a list You can append two lists like so:

list1 = [1, 2, 3]
list2 = [4, 5, 6]
list1 = list1 + list2

to append a single element to a list, make a list of it:

list1 = list1 + [7]

Getting the keys to a dictionary, sorted: The sorted() function and iterkeys() methods are helpful here:

dict1 = {1:'a', 5:'b', 2:'c', 9:'a'}
sorted(dict1.iterkeys())

To turn in

Your calculate-iaa.py script
The output file of your calculate-iaa.py script.
Answers to the following questions:
- Which of the ways of measuring inter-annotator agreement (kappa, alpha) do you think best reflects the consistency of the annotations, and why?
- What kinds of patterns of disagreement do you find? What refinements to the annotation guidelines would you propose to reduce the amount of disagreement?
- How does this experience affect your perception of the reliability of annotated data? What would you look for in the description of a dataset in order to gauge its reliability?