# SurveyAnalysis.R
# This script shows you how to calculate some descriptive statistics from our survey. Below I've
# supplied R code for 5 examples. After this, you can check out 'SurveyHypothesisTest.R' which
# shows you how to run hypothesis tests on these results.
install.packages("ggplot2")
library(ggplot2)
library(broom)
##
# First we'll clear the workspace and load in the survey data:
rm(list = ls())
survey <-read.csv("http://www.courses.washington.edu/psy315/datasets/Psych315W19survey.csv")
# Our new variable 'survey' has a bunch of fields associated with it that
# correspond to your answers to each of the questions. A good way to see the
# list of fields is, if you're using R Studio' to go to the 'Data' window, find
# the 'survey' variable and click on the blue triangle. You'll see
# things like:
#
# gender : Factor w /2 levels "Female", "Male": 1 2 2 ...
#
# This means that there is a field 'gender' which you can access with the
# dollar sign (survey$gender)
#
# 'Factor' means that this field is nominal data, and you can see that the 2 levels
# are 'Female' and 'Male'.
#
# Other fields are either 'int' (integers) or 'num' (decimals), which are both ratio
# scale data for our survey.
## Example 1: Is the mean height of women in our class different from 64 inches?
# This will be a study of the field 'height', which is a ratio scale. Ratio scale data is
# best visualized with a histogram with class intervals that you can set.
# We can look at the heights for female students like this:
ratio.data <- survey$height[survey$gender == "Female"]
# define class intervals based in the min and max:
class.interval <- seq(min(ratio.data),
max(ratio.data),
1)
hist(ratio.data,
main=sprintf('Mean height: %5.2f inches',mean(ratio.data)),
xlab="Height (in)",
col="blue",
xaxt='n',
yaxt = 'n',
breaks =class.interval
)
# and then adding your own axes with the 'axis' function
# Axis 1 is 'x' and 2 is 'y':
axis(1, at=class.interval)
axis(2, at=seq(0,100,5),las = 1)
# We can summarize a ratio-scale value with means and standard deviations.
mean(ratio.data)
sd(ratio.data)
# Wikipedia says that the average US woman is 64 inches tall. Later on
# in the quarter we'll use a 't-test' to determine the probability of
# drawing our mean from the survey by chance, if the true mean is
# 64 inches.
## Example 2: Is there an equal number of men and women in our class?
# This requires visualizing the distribution of responses from a single nominal
# scale question.
#
# nominal scale data can be visualized with a histogram too, but
# the x-axis categories are names, not numbers. R has a function
# 'table' that counts up the frequencies for nominal data. For
# example, for Gender
nominal.data <- survey$gender
freqs <- table(nominal.data)
freqs
# gender.freqs is a special list of numbers where the columns
# have names. In our case, the names the genders.
# You can visualize the frequencies for nominal data with 'barplot'
barplot(freqs)
# You can color your bars using the 'col' option. Let's color the genders
# by their stereotypical colors:
barplot(freqs,
col = c("pink","blue"))
# Is there an equal ratio of men to women in this class? Later on
# we'll run a 'Chi-squared' test for frequency to determine the
# probability of getting a distribution like this by chance.
# Now that we've seen how to visualize the frequency distribution for
# ratio scale and nominal scale data, we'll move on to visualizing how
# to compare different variables. With these two types of data, there
# are three kinds of comparisons we can make: ratio to ratio, ratio to nominal,
# and nominal to nominal.
## Example 3: Does where you sit in class depend on gender?
#
# This requires the comparison of two nominal scale variables, 'sit' and 'gender'.
# Comparing nominal data to nominal data is typically asking if the
# distribution of frequencies for one variable depends on the level
# of another.
#
# R's 'table' function conveniently tabulates frequencies for more than
# one nominal variable:
myTable <- table(survey$gender,survey$sit)
# The result is a table with both rows and columns, with labels:
myTable
# The labels can be pulled out using 'row.names' and 'colnames' (note
# the inconsistency using '.' in the function names)
row.names(myTable)
colnames(myTable)
# You may or may not see a dependency on where you sit with gender
# To visualize these frequencies, use 'barplot' again.
barplot(myTable,
beside=TRUE,
legend = row.names(myTable),
col = c("Pink","Blue"))
# I prefer 'beside=TRUE' over the default which stacks the bars on top
# of each other (try it)
# Can you see a difference in the frequency of where you sit for the two
# genders? Later on in the quarter we'll run a 'Chi-squared test for
# independence' which will determine the probability of getting results
# like this by chance.
## Example 4: Is there a correlation between mother's and father's heights?
# This requires comparing two ratio scale variables, 'pheight' and 'mheight'.
#
# Comparing ratio scale data to ratio scale data is best done with
# a scatterplot, and summarized with a correlation. For example, to compare
# your father's heights to your mother's heights, use:
ratio.data.x <- survey$mheight
ratio.data.y <- survey$pheight
plot(ratio.data.x,ratio.data.y,
xlab = "Father's Height",
ylab = "Mother's Height",
pch = 20,
col = "blue",
as = 1,
cex = 2)
# We quantify this relation with the Pearson Correlation
cor(ratio.data.x,ratio.data.y, use = "complete.obs")
# Later on we'll find the probability of obtaining a correlation this
# large by chance using a 'correlation test for r=0'.
## Example 5: Does your expected score on Exam 1 depend on where you like to sit
# in class?
# This example is a comparison of a ratio scale ('Exam1') values across levels
# of a nominal scale ('sit')
#
# Specifically, we'll calculate the mean value for a ratio scale variable for
# students that fall into each level of an nominal scale variable.
#
# R has a function 'tapply' that calculates a summary statistic for each
# level of a nominal variable:
means <- tapply(survey$Exam1,survey$sit,mean)
# The result is a 'labeled' list, showing the mean predicted Exam 1 score
# for each of our three levels of 'sit'
means
# We can use 'barplot' to plot these means:
barplot(means,
main = "Expected score for Exam 1")
# We can also plot these means with error bars representing the standard error
# of the mean. This requires a bit more work, but here's some code
# that will do it for you:'
# First, create a 'data frame' that we'll call 'summary' that holds the
# means, sample sizes, sd's and sem's
summary <- data.frame(
mean <- tapply(survey$Exam1,survey$sit,mean,na.rm = TRUE),
n <- tapply(survey$Exam1,survey$sit,function(x) sum(!is.na(x))),
sd <- tapply(survey$Exam1,survey$sit,sd,na.rm = TRUE))
summary$sem <- summary$sd/sqrt(summary$n)
colnames(summary) = c("mean","n","sd","sem")
summary
# The levels in 'sit' come out in alphabetical order. To re-order them
# from 'front' to 'midde' to 'back' we'll define a list in right order
# and use this list in ggplot'
levels <- row.names(summary)
levels <- levels[c(2,1,3)]
# Bar graph:
# Define y limits for the bar graph
ylimit <- c(min(summary$mean-1.5*summary$sem),
max(summary$mean+1.5*summary$sem))
# Plot bar graph with error bar as one standard error (standard error of the mean/SEM)
ggplot(summary, aes(x = row.names(summary), y = mean)) +
xlab("Where do you like to sit in class?") +
geom_bar(position = position_dodge(), stat="identity", fill="blue") +
geom_errorbar(aes(ymin=mean-sem, ymax=mean+sem),width = .5) +
scale_y_continuous(name = "Predicted Exam 1 score") +
scale_x_discrete(limits = levels) +
coord_cartesian(ylim=ylimit)
# Does it look like the exected Exam 1 score varies across where you
# like to sit?
#
# Later on in the quarter we'll run an 'ANOVA', or analysis of variance,
# to determine the probability of getting means this far apart from each
# other by chance.
# In summary, we've gone through ways of summarizing and plotting data in 5 ways:
# 1) means histograms for ratio scale data
# 2) frequencies and bar graphs for nominal scale data
# 3) frequencies and bar graphs of nominal vs. nominal scale data
# 4) correlations and scatterplots of ratio vs. ratio scale data
# 5) means and bar graphs of means for ratio scale across nominal scale levels
# Each of these 5 plots has a corresponding statistical test that we'll
# be covering during in the quarter.