Project 2 Description
Home ] Up ] Project 1 Description ] Project 1 Data Collection ] Project 1 Data ] [ Project 2 Description ] Project 2 - Survey Data ] Project 3 ]

 

Project 2 -  Data Modeling

For due date, see syllabus

Objective:

The purpose of this project is to provide you with experience in mapping between phenomena in the real world (and data) and the possible distribution functions that can model the data.

General Task:

You task is to compare a possible model for a random variable with data for the random variable. This should seem familiar - it is what we were doing in class when we looked at data concerning the number of hat wearing people and compared the data with the predictions from a Geometric model.

Procedure:

Below, I have described the process for accomplishing the overall task. In addition, I have indicated time estimates for each of the steps in the process.  The time estimates are based, in part, on your documenting your work in the report as you do it.  

  1. Brainstorming Potential Random Variables (0.5 hours): Before committing to a specific random variable, we would like you to generate a list of at least six potential random variables. For each possible random variable, you should state the definition of the random variable and suggest how data could be obtained (e.g., from the web, through data collection, etc.). Your list should include at least three discrete random variables and 3 continuous random variables.
  2. Identify the Random Variable for Study: Select a random variable from your list of potential random variables. Ensure that your random variable is clearly defined.
  3. Obtain data for the random variable: For the chosen random variable, you should obtain some data. This could be done in at least two ways:

    (a) Web repositories (0.5 hours): You might obtain this data by going to one of the web sites containing data sets. The resource component of the course web page links to two sites that contain numerous datasets. For example, I found the scores of all NCAA basketball championship games (quarter-final, semi-finals, and finals) since 1935.

    (b) Data Collection (1 hour): You might simply collect some data. For example, if you were interested in the rate at which people enter the dining hall, you could record the number of people entering the dining hall during two-minute intervals. You could obtain 30 data points during a one-hour lunch.
  4. Describe the Data using Descriptive Statistics (1 hour): Describe the data using the tools & concepts from chapter 2. At minimum, your description should include a relative frequency diagram (a histogram variation), values for the sample mean and variance, and a discussion of outliers.
  5. Propose and Develop a Model for the Random Variable (2 hours): Based on the definition of the random variable and the characteristics of the data you collected, identify a potential model for your data (e.g., Normal, Geometric, etc.) and explain your choice. Present details of your model (e.g., a graph of the probability density function).
  6. Compare the Model and the Data (0.5 hours): Compare your model to your data. Discuss how well the model fits to the data. Think about the following questions - Where do the data and model agree? Where do the data and model disagree? What might explain observed disparity?
  7. Write (1 hour). Write your report and submit it.

Report Guidelines:

You should submit a report describing your activities. Your report should contain the exact sections described below. The point values that will be assigned to the sections are listed to the right of the section title. Note that the report starts with step 1 of the process -- the brainstorming results go in an appendix.

  1. Problem Statement (10): The problem statement should state the goal of your modeling efforts. The statement should include the definition of the random variable. In addition, you should provide one reason why we might want to model the random variable.
  2. Procedure (5): The procedure should describe the steps as you executed them (e.g., how did you brainstorm, how did you do your analysis). It is particularly important that you state how you obtained your data.
  3. Data Description (20): This section should include the results of your descriptive analysis. See the comments earlier for details.
  4. Proposed Model (25): This section should present your proposed model and justification. See the comments earlier for details.
  5. Discussion of the "Fit" (10): In this section, you should comment on the fit between the model and the data.   See comments earlier for details.
  6. Summary and Conclusions (10): In this section, you should summarize the process of the lab and then provide the concluding statement concerning the type of model, and the fit of the specific model you used, for your data.
  7. "What I learned" Statements (10): This section should contain brief reflections (1-2 paragraphs) on what was learned from the modeling project. For example, you might comment on the effort required to identify a random variable, the amount of time involved, and/or difficulty of judging the "fit" or quality of the model. You might particularly focus on things that you did not expect to learn - what surprised you, frustrated you, made you curious, etc. For example, did you have difficulty in finding data on the web or in interpreting the data? As before, this component of the report is to be done individually. If you are completing the project with another student, this section should contain individual statements from each student.
  8. Appendix I. Possible Random Variables for Study (10). The results of your brainstorming session should be included in an appendix at the end of the report.

Hints:

Some hints that we have already identified:

1. Choosing a Random Variable: Identify a random variable in which you are interested and curious. This is an opportunity to learn about something in which you are simply interested. (Did you know that 112 NCAA final four games have been one by only 1 point and that 78 have been lost by 50+ points?)

2. Identifying Random Variables: When generating your list of potential random variables, you might want to check out the datasets available through the course web page. You might also look through the sample problems in the book. Further, you might look to the supplemental homework problems in which you identified random variables for your discipline.

3. Size of Dataset: The size of the dataset is up to you. In general, the larger the better. If you find your data from a web source, then you may be constrained in that way (although many are large). If you are collecting data, you may be constrained by time. If you have questions about a specific situation, then send email or come by office hours.

4. Creating a Model: Excel has a series of functions that can help you create your model. The functions have names like NORMSDIST, CHIDIST, EXPONDIST, HYPGEODIST, BINOMDIST, and POISSON. They take random variable values as parameters and return values for either the probability distribution function or the cumulative distribution function.

Extra Credit:

If you are willing to have us publish your report on the web and would like a little extra credit, send us your report electronically (as well as submitting the paper copy). Other students may be interested in seeing your analysis.