- Use descriptive statistical techniques to learn about the structure of your dataset(s)
- Calculate normalized frequency for your targeted variables

**Due 6/1** (Note: you have an additional week to complete this lab)

You may do your work in any software program with which you are familiar (e.g., SPSS, R, Python, etc.). Here is the R code for the simple examples used in class.

**Normalized frequencies**What is the appropriate unit for your counts? Of course, your research question defines your focus. Does your research question require you to count word frequencies? phones? POS instantiations? Any large corpus *could* be described in terms of the lexical count of the corpus, as is common in corpus linguistics, and our work in Week 7 focused on these. In this lab, we will adapt techniques discussed in Week 7 and combine these with other types of exploratory techniques from general descriptive statistics.1a. What is the total size

`(n=)`of your corpus? (what are the units for this level of analysis? abstract syntactic constituents, semantic types, words, phones, etc.?) For this lab, we will first calculate the word count for each dataset used. So, state your corpus size(s) in words. Then, go to 1b.1b. Base of normalization. What is an appropriate base of normalization for the type of dataset you are analyzing? Justify your decision in 1-2 sentences.

1c. Next, calculate the raw token count(s) for your dependent variable(s) of interest. Do this separately for each dataset you are using.

1d. Calculate normalized frequency

for your tokens.*(nf)*1e. How frequent in your databases is/are your token(s) of interest? Does this surprise you, given your expectations about the frequency of this form/these forms? Are there any other types of corpora in which you would expect the frequency count(s) to be wildly different? Discuss in a few (no more than 10) sentences.

1f. Calculate the corpus to corpus ratio for your dependent variable. (If you have more than one dependent variable, choose a subset--one or two will suffice for this lab.)

1g. Independent variables. As we saw in Week 7, there are factors that may affect the measures you wish to take (whether "tune" occurs as a noun or a verb, whether /t/ occurs word-medially or word-finally). These factors are independent variables, that may help us to account for or predict systematic differences in the outcome variable. What is/are your independent variable(s) of interest (again guided by your research question)? Recalculate your token counts partitioned on these independent variables.

**Exploratory Statistics**2a. Choose an appropriate graphical method (frequency table, histogram, box plot, scatterplot) for representing your datasets. Make sure you choose a method that allows for observations to be clustered within levels of any grouping variables used to partition the data.

2b. Examine the shape of your datasets' distributions. You may either calculate these by hand (e.g., in R, we can separately call functions for mean or standard deviation, or IQR "interquartile range"). Or we can use a function, such as discussed in class, which will provide a frequency table, or evaluate whether our sample distribution is normal (e.g., we can invoke the command

`summary(), or ks.test()`). Do your distributions satisfy our assumptions for:- Normality?

- Homogeneity of variance?

- Independence of observations?

2c. Given your responses in 2b, what type of inferential tests do you believe are most appropriate for your study data (parametric vs. non-parametric)?

### Big-Picture Questions

- 3. Talk briefly in your write-up about the differences noted thus far between your corpora. How are they different in size? Does this combined dataset give you enough data for your ideal comparison of this phenomenon? What limitations do you see thus far in the ability of this dataset to allow you to address your research question?

- Your answers to the questions above (1a-g, 2a-c, 3).
- A .pdf file containing tables and graphical output produced (frequency tables, boxplots, histograms, etc.)
- A .txt file, formatted appropriately for your project (clear column headers, 1 case per row, or other format approved for your study), containing the toy dataset with which you worked for this lab.