Due 6/1 (Note: you have an additional week to complete this lab)
You may do your work in any software program with which you are familiar (e.g., SPSS, R, Python, etc.). Here is the R code for the simple examples used in class.
1a. What is the total size (n=) of your corpus? (what are the units for this level of analysis? abstract syntactic constituents, semantic types, words, phones, etc.?) For this lab, we will first calculate the word count for each dataset used. So, state your corpus size(s) in words. Then, go to 1b.
1b. Base of normalization. What is an appropriate base of normalization for the type of dataset you are analyzing? Justify your decision in 1-2 sentences.
1c. Next, calculate the raw token count(s) for your dependent variable(s) of interest. Do this separately for each dataset you are using.
1d. Calculate normalized frequency (nf) for your tokens.
1e. How frequent in your databases is/are your token(s) of interest? Does this surprise you, given your expectations about the frequency of this form/these forms? Are there any other types of corpora in which you would expect the frequency count(s) to be wildly different? Discuss in a few (no more than 10) sentences.
1f. Calculate the corpus to corpus ratio for your dependent variable. (If you have more than one dependent variable, choose a subset--one or two will suffice for this lab.)
1g. Independent variables. As we saw in Week 7, there are factors that may affect the measures you wish to take (whether "tune" occurs as a noun or a verb, whether /t/ occurs word-medially or word-finally). These factors are independent variables, that may help us to account for or predict systematic differences in the outcome variable. What is/are your independent variable(s) of interest (again guided by your research question)? Recalculate your token counts partitioned on these independent variables.
2a. Choose an appropriate graphical method (frequency table, histogram, box plot, scatterplot) for representing your datasets. Make sure you choose a method that allows for observations to be clustered within levels of any grouping variables used to partition the data.
2b. Examine the shape of your datasets' distributions. You may either calculate these by hand (e.g., in R, we can separately call functions for mean or standard deviation, or IQR "interquartile range"). Or we can use a function, such as discussed in class, which will provide a frequency table, or evaluate whether our sample distribution is normal (e.g., we can invoke the command summary(), or ks.test()). Do your distributions satisfy our assumptions for:
2c. Given your responses in 2b, what type of inferential tests do you believe are most appropriate for your study data (parametric vs. non-parametric)?