Project 2 - Data Modeling
For due date, see syllabus
Objective:
The purpose of this project is to provide you with experience in mapping between
phenomena in the real world (and data) and the possible distribution functions that can
model the data.
General Task:
You task is to compare a possible model for a random variable with data for the random
variable. This should seem familiar - it is what we were doing in class when we looked at
data concerning the number of hat wearing people and compared the data with the
predictions from a Geometric model.
Procedure:
Below, I have described the process for accomplishing the overall task. In addition, I
have indicated time estimates for each of the steps in the process. The time
estimates are based, in part, on your documenting your work in the report as you do it.
- Brainstorming Potential Random Variables (0.5 hours): Before committing
to a specific random variable, we would like you to generate a list of at least six
potential random variables. For each possible random variable, you should state the
definition of the random variable and suggest how data could be obtained (e.g., from the
web, through data collection, etc.). Your list should include at least three discrete
random variables and 3 continuous random variables.
- Identify the Random Variable for Study: Select a random variable from
your list of potential random variables. Ensure that your random variable is clearly
defined.
- Obtain data for the random variable: For the chosen random variable,
you should obtain some data. This could be done in at least two ways:
(a) Web repositories (0.5 hours): You might obtain this data by going to one of the web
sites containing data sets. The resource component of the course web page links to two
sites that contain numerous datasets. For example, I found the scores of all NCAA
basketball championship games (quarter-final, semi-finals, and finals) since 1935.
(b) Data Collection (1 hour): You might simply collect some data. For example, if you were
interested in the rate at which people enter the dining hall, you could record the number
of people entering the dining hall during two-minute intervals. You could obtain 30 data
points during a one-hour lunch.
- Describe the Data using Descriptive Statistics (1 hour): Describe the
data using the tools & concepts from chapter 2. At minimum, your description should
include a relative frequency diagram (a histogram variation), values for the sample mean
and variance, and a discussion of outliers.
- Propose and Develop a Model for the Random Variable (2 hours): Based on
the definition of the random variable and the characteristics of the data you collected,
identify a potential model for your data (e.g., Normal, Geometric, etc.) and explain your
choice. Present details of your model (e.g., a graph of the probability density function).
- Compare the Model and the Data (0.5 hours): Compare your model to your
data. Discuss how well the model fits to the data. Think about the following questions -
Where do the data and model agree? Where do the data and model disagree? What might
explain observed disparity?
- Write (1 hour). Write your report and submit it.
Report Guidelines:
You should submit a report describing your activities. Your report should contain the
exact sections described below. The point values that will be assigned to the sections are
listed to the right of the section title. Note that the report starts with step 1 of the
process -- the brainstorming results go in an appendix.
- Problem Statement (10): The problem statement should state the goal of
your modeling efforts. The statement should include the definition of the random variable.
In addition, you should provide one reason why we might want to model the random variable.
- Procedure (5): The procedure should describe the steps as you executed
them (e.g., how did you brainstorm, how did you do your analysis). It is particularly
important that you state how you obtained your data.
- Data Description (20): This section should include the results of your
descriptive analysis. See the comments earlier for details.
- Proposed Model (25): This section should present your proposed model
and justification. See the comments earlier for details.
- Discussion of the "Fit" (10): In this section, you should
comment on the fit between the model and the data. See comments earlier for
details.
- Summary and Conclusions (10): In this section, you should summarize the
process of the lab and then provide the concluding statement concerning the type of model,
and the fit of the specific model you used, for your data.
- "What I learned" Statements (10): This section should contain
brief reflections (1-2 paragraphs) on what was learned from the modeling project. For
example, you might comment on the effort required to identify a random variable, the
amount of time involved, and/or difficulty of judging the "fit" or quality of
the model. You might particularly focus on things that you did not expect to learn - what
surprised you, frustrated you, made you curious, etc. For example, did you have difficulty
in finding data on the web or in interpreting the data? As before, this component of the
report is to be done individually. If you are completing the project with another student,
this section should contain individual statements from each student.
- Appendix I. Possible Random Variables for Study (10). The results of
your brainstorming session should be included in an appendix at the end of the report.
Hints:
Some hints that we have already identified:
1. Choosing a Random Variable: Identify a random variable in which you
are interested and curious. This is an opportunity to learn about something in which you
are simply interested. (Did you know that 112 NCAA final four games have been one by only
1 point and that 78 have been lost by 50+ points?)
2. Identifying Random Variables: When generating your list of
potential random variables, you might want to check out the datasets available through the
course web page. You might also look through the sample problems in the book. Further, you
might look to the supplemental homework problems in which you identified random variables
for your discipline.
3. Size of Dataset: The size of the dataset is up to you. In general,
the larger the better. If you find your data from a web source, then you may be
constrained in that way (although many are large). If you are collecting data, you may be
constrained by time. If you have questions about a specific situation, then send email or
come by office hours.
4. Creating a Model: Excel has a series of functions that can help you
create your model. The functions have names like NORMSDIST, CHIDIST, EXPONDIST,
HYPGEODIST, BINOMDIST, and POISSON. They take random variable values as parameters and
return values for either the probability distribution function or the cumulative
distribution function.
Extra Credit:
If you are willing to have us publish your report on the web and would like a little
extra credit, send us your report electronically (as well as submitting the paper copy).
Other students may be interested in seeing your analysis.