## Laboratory Exercise 7

### Goals:

• Gain hands-on experience with computing inter-annotator agreement
• (Work with XML in python)
• Understand how to transform annotation files to the required form
• Explore differences in annotation between annotators
• Reflect on implications for the reliability of annotated data

### Steps:

Each of these is expanded below.

• Calculate Cohen's kappa and Krippendorff's alpha scores and the confusion matrix for a toy example
• Calculate kappa scores for your Lab 6 + your practicum groupmate's Lab 6 (for MUC7 named entity tags only), as follows:
• Whether the same words were tagged as entities
• For each word tagged by both annotators, whether the same types were chosen
• Both combined
• Calculate the confusion matrix
• Calculate how many times each annotator used each label
• Print out points of disagreement, with enough info that they can be examined in the ELAN file

### Calculate kappa, alpha and confusion matrix for a toy example

In this step, we'll use the nltk.metrics.agreement module, which is partly documented here. This module takes in data in the form of a list of triples, where each triple contains an annotator label, an item label and a tag. For example, the following is a snippet of what our data could look like:

```[['1', 5723, 'ORG'],
['2', 5723, 'ORG'],
['1', 55829, 'LOC'],
['2', 55829, 'LOC'],
['1', 259742, 'PER'],
['2', 259742, 'LOC'],
['1', 269340, 'PER'],
['2', 269340, 'LOC']]
```

Here we have four items, each labeled by two different annotators. In two cases, the annotators agree. In two cases they don't.

Using the python interpreter and the nltk metrics package, calculate inter-annotator agreement (both kappa and alpha) for this example. Note that AnnotationTask is a type of object, with methods kappa() and alpha(). When you call nltk.metrics.AnnotationTask() it returns an object of that type, which in the example below is stored in the variable task.

```import nltk
toy_data = [['1', 5723, 'ORG'],['2', 5723, 'ORG'],['1', 55829, 'LOC'],['2', 55829, 'LOC'],['1', 259742, 'PER'],['2', 259742, 'LOC'],['1', 269340, 'PER'],['2', 269340, 'LOC']]
```

The nltk metrics package also provides for calculating and printing confusion matrices, a way of displaying which labels were 'mistaken' for which other ones. Unfortunately, this functionality requires a different format for the input. In particular, it wants two lists of labels (in the same order).

```import nltk #Don't need to do this twice in the same python session
toy1 = ['ORG','LOC','PER','PER']
toy2 = ['ORG','LOC','LOC','LOC']
cm = nltk.metrics.ConfusionMatrix(toy1,toy2)
print cm
```

### calculate-iaa.py

The rest of the lab will be done by writing a script to take the two .eaf files, extract the annotations from them and format them as nltk.metrics.agreement expects, and calculate the two measures. In addition, this script will print out the points of disagreement so you can examine them by hand.

Trade .eaf files with your practicum group partner, so that you have two. Open these up with a text editor, and look at their structure. Note what kind of information is where in the file, and what happened with words that didn't get MUC annotations.

Download the starter script (calculate-iaa.py), and read through the comments and existing code to get a sense of what the script is doing, and what it is asking you to do. As before, the symbol

`# ~*~`

indicates a place where you need to fill in code to implement what's in the comment above.

The main subtasks of the script are as follows:

1. (Extract the information we're interested in out of the .eaf files (XML). --- This has been done for you, but it is worth looking through that part of the code to see how it works.)
2. Format the information as annotation tasks (differently depending on what we're measuring)
3. Calculate kappa and alpha for each task (and print to output file)
4. Calculate and print the confusion matrix
5. Calculate and print the numer of times each annotator used each label
6. Print list of differing annotations

Since we need to extract the information from two files the same way, and since we need to format the information for three different tasks, the first two subtasks above are conceptualized as subroutines in the model script. There are a couple of other subroutines to define at the top of the script. You can wait to define those until you've reach the portion of the main body of the script that uses them.

### Useful python

This script makes heavy use of xml.etree for handing XML, but the assignment does not ask you to modify that part of the code.

It also uses nltk.metrics. In addition to the notes above, the final thing you need to know is how to get the confusion matrix to print with the write() method on the output file object. Confusion matrix objects have a pp() method ("pretty print") which works together with the write method as follows:

```outfile.write(cm.pp())
```

Appending to a list You can append two lists like so:

```list1 = [1, 2, 3]
list2 = [4, 5, 6]
list1 = list1 + list2
```

to append a single element to a list, make a list of it:

```list1 = list1 + [7]
```

Getting the keys to a dictionary, sorted: The sorted() function and iterkeys() methods are helpful here:

```dict1 = {1:'a', 5:'b', 2:'c', 9:'a'}
sorted(dict1.iterkeys())
```