Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 6

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Create a multi-level annotation using ELAN software.
- annotation layers include phones, named entities, and subjective comments
Learn 2 simple annotation schemes:
- MUC-7 for annotation of named entities
- An invented one for annotation of subjective comments
Gain experience applying annotation schemes to naturally occurring data.
Develop materials that can be used in next week's lab, which will introduce students to Kappa scoring for measuring inter-annotator agreement.

Get the necessary files for this lab

For this lab, you should copy to your local working directory the following files, which are stored in my home directory on patas:

lab6.wav

lab6.TextGrid

lab6phones.TextGrid

SimpleNamedEntityGuidelinesV6.4.pdf

It is recommended that you create a new folder called "lab6" in the working directory on your local computer. Working from this directory, copy the files from my patas account to your local machine. Recall that you issue the following command to your local machine (replacing username with your actual username):

scp username@patas.ling.washington.edu:~wassink/lab6.wav .

The computer should prompt you for your patas password and then copy the file to your current directory on your local machine. Then do the same for the other files.

Introduction to ELAN-Linguistic Annotator

For this lab, we will use ELAN annotation software. If you do not have ELAN installed on your computer, you can download it for free from here. You can also download a manual for ELAN, or you can download the Sociolinguistics Lab's Quick Guide to ELAN. (This guide was written for ELAN 4.0., and its screenshots are appropriate for this version, but it works fine as a general reference for 4.3.) ELAN stands for EUDICO Linguistic Annotator, and EUDICO for European Distributed Corpora Project. ELAN is an annotation tool that allows you to create, edit, visualize and search annotations for video and audio data. This lab will also allow us to take advantage of ELAN's interoperability with Praat signal analysis software.

ELAN takes one or more media files, and one or more annotation files (.eaf format). Information generated in ELAN work sessions is saved to annotation files, never to your media file.

For Lab 2, we worked with an audiofile in Praat, and with an orthographic transcript in MS Word. We also generated a phone-level alignment using the Penn Phonetics Lab Forced Aligner for a very short portion of the "scrapple" recording. For this week's lab, we will be using a longer stretch of the same conversation, about 10 minutes. We can work with the soundfile and transcription in ELAN. ELAN can read audiofiles and Praat text grids (a Praat text grid is a time-aligned system of multiple annotated text tiers). However, ELAN wants to work with interval (not point) tiers. lab6.TextGrid contains an orthographic transcription of utterances for both PNWEnglish Speakers 19 and 20. lab6phones.TextGrid was generated by the Penn Phonetics Lab Forced Aligner, and contains both word- and phone-level transcriptions of the entire excerpt. All transcriptions (for Speakers 19, 20, and the two interviewers) were merged for the phone-level alignment. We will add 2 new tiers for named entities and participants' subjective comments.

Part 1:ELAN basics

1. Launch ELAN.

2. Since we have a media file, but no .eaf file yet, we will create a new .eaf file:

From the File menu, choose New...

In the new window, click "Add media file..." and browse to the directory that contains the media file for this exercise, lab6.wav; select it.

Click OK.

The ELAN Main Window is displayed. At about the middle of the screen are the buttons for audio playback, a long thin bar (the "Annotation Density Viewer"), the Waveform Viewer (which shows timestamps, and the waveform for lab6.wav), and the area for annotation tiers. Since we have not yet created any annotation tiers, the only annotation tier we have is "default0".

3. Try playing out selected portions of the audiofile. Use your cursor to select a portion of the soundfile, then click the button with the RIGHT-TRIANGLE + S symbol in the central bank of controls. (This is ELAN's 'play selection button'. You'll typically want to use this one, and NOT the unmarked arrow in the left bank of controls, which plays out the entire file till you stop it.)

Note: If you can't see the waveform, it could be because the recording has a low signal-to-noise ratio. Audio playback will still work. However, you can use Vertical Zoom to increase the visibility of the waveform. Click with the right mouse button on the Waveform Viewer or the Timeline Viewer. A dropdown menu appears:

Select Vertical Zoom.
Select desired percentage, e.g., 3000%.

4. To zoom into and out of the waveform along the time dimension, Click with the right mouse button on either the Waveform Viewer or the Timeline Viewer. A dropdown menu appears:

Select Zoom.
Select desired percentage, or:
If you have a selection highlighted, choose Zoom to selection.

Part 2: Import our pre-existing Praat TextGrids into ELAN

1. In ELAN, click File > Import > Praat TextGrid File...

2. Browse to the folder containing lab6.TextGrid, select this file, select UTF-16 encoding, and click Open.

(note: You must ensure ELAN uses the same file encoding as was set in Praat's 'write text settings', in this case, UTF-16, or your import will fail, and you will get the error message 'operation interrupted, no tiers detected in the textgrid file').

3. You will be returned to the window 'Select Praat Text Grid containing interval tiers'. Click Next. The next window will allow you to select Linguistic Type. This is irrelevant for us at this stage. Just click Next. Click Finished on the next screen.

4. To see the transcription contents, select 'Grid' from the row of buttons at the top of ELAN's main viewer. Select 'OrthTrans_Spkr19' to view the numbered turns for this speaker. Note that when you click on a turn in the Grid viewer, ELAN moves the cursor automatically to the location of this turn in the Waveform Viewer.

(Note: The small number below each tier label (e.g., 101 for OrthTrans_Spkr19) indicates the number of annotations associated with that tier.)

5. Repeat steps 1-4 to import lab6phones.TextGrid, to add the phone- and word-level annotation tiers.

Part 3: Create new tiers to hold our new annotations

We are now ready to add tiers to contain our named entity and subjective comment tags.

1. Click on Tier > Add New Tier....

The Add Tier dialog window appears.

2. Define tier attributes:

Go to Tier name. Enter a name for each tier, as follows:

Create a tier called MUC-7ENTITY

Create another tier called SUBJ_COMMENTS

Note: For our purposes, we are treating these as recording-level annotations, therefore we will not create separate MUC-7ENTITY and SUBJ_COMMENTS tiers for each speaker.

Your tiers will be visible in the area below the waveform, and each one can be independently selected, colors may be selected, etc. You can also drag and move tiers to reorder them (or right-click to determine their ordering)!

Part 4: Add Named Entity Annotations

We will use a simple implementation of the MUC-7 NE Guidelines (see SimpleNamedEntityGuidelinesV6.4.pdf).

Review the annotation values in the table below.

Working in the MUC7-ENTITY tier, created above, supply an annotation (the 3-letter abbreviation only) for each named entity that occurs in the orthographic transcriptions for Speaker 19, 20 and appears to satisfy each of the type descriptions.

Create annotations at the word-level. For our purposes, a word is a string of non-white space characters surrounded by white space. If one named entity spans two words (e.g., New York), annotate both words separately with the type for that entity.

To add a new annotation:

1. Working in the Grid Viewer, select the Word tier from the pulldown. Scroll through the tier to locate the desired word. Click to highlight the word. Notice that the word and the stretch of the waveform with which it is associated are now highlighted in the Timeline and Waveform Viewers.

2. Hover your mouse over the tier to which the annotation is to be added (in the annotation tier area below the Waveform Viewer), and right-click. A pulldown menu appears. Select 'New annotation here.' A selection-length box will appear based upon the duration of the word.

(Alternatively, you may find it works better for you to double-click in this tier, in the area to the right of the tier name. A selection-length window should now appear. If it appears in color (not white), click it once again to get a blank white box containing a text-entry crosshair cursor.)

3. Type your annotation into this box.

4. You must type CMD + RETURN to save each new annotation created. (Note: Annotations may be deleted by double-clicking to select the desired annotation, then right-click and select 'delete annotation'.)

Entity Type Abbreviation Description Example

Person (PER) Person entities are limited to humans identified by name, nickname or alias. Mom, Bill Smith

Title/Role (TTL) Named personal titles or roles. These are restricted to titles that occur directly before or after the person name they describe. Vice President, Mr., Dr.

Organization (ORG) Organization entities are limited to corporations, institutions, government agencies and other groups of people defined by an established organizational structure. University of Washington

Location (LOC) Location entities include names of politically or geographically defined places (cities, provinces, countries, international regions, bodies of water, mountains, etc.). Locations also include man- made structures like airports, highways, streets, factories and monuments. Arizona, Philadelphia streets, Philadelphia

Entity Type	Abbreviation	Description	Example
Person	(PER)	Person entities are limited to humans identified by name, nickname or alias.	Mom, Bill Smith
Title/Role	(TTL)	Named personal titles or roles. These are restricted to titles that occur directly before or after the person name they describe.	Vice President, Mr., Dr.
Organization	(ORG)	Organization entities are limited to corporations, institutions, government agencies and other groups of people defined by an established organizational structure.	University of Washington
Location	(LOC)	Location entities include names of politically or geographically defined places (cities, provinces, countries, international regions, bodies of water, mountains, etc.). Locations also include man- made structures like airports, highways, streets, factories and monuments.	Arizona, Philadelphia streets, Philadelphia

Part 5: Add Subjective Comment Annotations

At present, there appears to be no generally-accepted annotation scheme in sociolinguistics for annotating speakers' subjective comments in running speech. We will use the following provisional one.

Review the annotation values in the table below.

Working in the SUBJ_COMMENTS tier, created above, supply an annotation (the 3-letter abbreviation only) for each comment type that occurs in the orthographic transcriptions for Speaker 19, 20 and appears to satisfy each of the type descriptions.

This time, create annotations at the turn-level, using the turns in the orthographic tiers as the basis for the annotations you add to SUBJ_COMMENTS.

Comment Type Abbreviation Description Example

Metalinguistic Comment (MET) Metalinguistic comments are descriptive words or phrases intended to characterize how a linguistic form, or variety used by a group of speakers, sounds. nasal, twangy, broad, harsh, smooth, fast, guttural, loud

Identity Claim (IDC) An identity claim is a statement made in the 1st person (singular or plural), that names a group or identity the respondent avows for him- or herself. I consider myself..., I am a..., We are...

Historical Statement (HIS) Historical statements relate events, facts, or narrative material from the respondent's personal, familial, or community history. Grandpa worked on the railroad..., When Norwegians first settled Ballard

Comment Type	Abbreviation	Description	Example
Metalinguistic Comment	(MET)	Metalinguistic comments are descriptive words or phrases intended to characterize how a linguistic form, or variety used by a group of speakers, sounds.	nasal, twangy, broad, harsh, smooth, fast, guttural, loud
Identity Claim	(IDC)	An identity claim is a statement made in the 1st person (singular or plural), that names a group or identity the respondent avows for him- or herself.	I consider myself..., I am a..., We are...
Historical Statement	(HIS)	Historical statements relate events, facts, or narrative material from the respondent's personal, familial, or community history.	Grandpa worked on the railroad..., When Norwegians first settled Ballard

Save Your Work

To save an ELAN annotation file, including your new annotation tiers and any changes made to the imported tiers:

Click on File menu.
Click on Save or Save as...
(Or use the shortcut key CTRL+S.)
Name your file yourname.eaf
Click OK.

Note: no alterations are made to your media file or the original imported annotation files.

Answer these Questions in your write-up:

Consider the MUC7 Named Entity types. Were any of the types difficult to apply? If so, which ones? Why?
Were there any tokens in the transcription for which you had trouble deciding on a suitable tag? What did you decide to do? Why?
We treated our metalinguistic comments as recording-level comments for this lab. Formulate a research question for which it might be important to annotate these at the speaker-level instead.

To turn in...

Your ELAN file, containing all the components generated in this lab.
A pdf file responding to the questions above.