Linguistics 580: Computational Methods in Linguistic Analysis

Laboratory Exercise 8

Computational Methods in Linguistics (Bender/Wassink)

Goals:

Use descriptive statistical techniques to learn about the structure of your dataset(s)
Calculate normalized frequency for your targeted variables

Due 6/1 (Note: you have an additional week to complete this lab)

Explore your datasets

At this point in the quarter, you have chosen a research question on which to focus for your term project. You have located one or more corpora of interest that contain data appropriate for addressing your research question, and you have carefully considered how these data will be annotated. You have also figured out a method for extracting the tokens of interest from that dataset. In this lab, we want to explore the structure of the dataset further. Your lab work will be focused on generating exploratory statistics, to show the general properties of your dataset.

You may do your work in any software program with which you are familiar (e.g., SPSS, R, Python, etc.). Here is the R code for the simple examples used in class.

Summary Descriptives

Normalized frequencies What is the appropriate unit for your counts? Of course, your research question defines your focus. Does your research question require you to count word frequencies? phones? POS instantiations? Any large corpus *could* be described in terms of the lexical count of the corpus, as is common in corpus linguistics, and our work in Week 7 focused on these. In this lab, we will adapt techniques discussed in Week 7 and combine these with other types of exploratory techniques from general descriptive statistics.

1a. What is the total size (n=) of your corpus? (what are the units for this level of analysis? abstract syntactic constituents, semantic types, words, phones, etc.?) For this lab, we will first calculate the word count for each dataset used. So, state your corpus size(s) in words. Then, go to 1b.
1b. Base of normalization. What is an appropriate base of normalization for the type of dataset you are analyzing? Justify your decision in 1-2 sentences.
1c. Next, calculate the raw token count(s) for your dependent variable(s) of interest. Do this separately for each dataset you are using.
1d. Calculate normalized frequency (nf) for your tokens.
1e. How frequent in your databases is/are your token(s) of interest? Does this surprise you, given your expectations about the frequency of this form/these forms? Are there any other types of corpora in which you would expect the frequency count(s) to be wildly different? Discuss in a few (no more than 10) sentences.
1f. Calculate the corpus to corpus ratio for your dependent variable. (If you have more than one dependent variable, choose a subset--one or two will suffice for this lab.)
1g. Independent variables. As we saw in Week 7, there are factors that may affect the measures you wish to take (whether "tune" occurs as a noun or a verb, whether /t/ occurs word-medially or word-finally). These factors are independent variables, that may help us to account for or predict systematic differences in the outcome variable. What is/are your independent variable(s) of interest (again guided by your research question)? Recalculate your token counts partitioned on these independent variables.
Exploratory Statistics

2a. Choose an appropriate graphical method (frequency table, histogram, box plot, scatterplot) for representing your datasets. Make sure you choose a method that allows for observations to be clustered within levels of any grouping variables used to partition the data.

2b. Examine the shape of your datasets' distributions. You may either calculate these by hand (e.g., in R, we can separately call functions for mean or standard deviation, or IQR "interquartile range"). Or we can use a function, such as discussed in class, which will provide a frequency table, or evaluate whether our sample distribution is normal (e.g., we can invoke the command summary(), or ks.test()). Do your distributions satisfy our assumptions for:
2c. Given your responses in 2b, what type of inferential tests do you believe are most appropriate for your study data (parametric vs. non-parametric)?

Big-Picture Questions
3. Talk briefly in your write-up about the differences noted thus far between your corpora. How are they different in size? Does this combined dataset give you enough data for your ideal comparison of this phenomenon? What limitations do you see thus far in the ability of this dataset to allow you to address your research question?

To turn in