Assignment 3: Exploratory Data Analysis
Description
In this assignment you will use visualization software to perform exploratory data analysis on a real-world dataset related to your final project. The goal is to gain practice formulating and answering questions through visual analysis and to learn and critique a leading visualization tool.
This assignment will also serve as a "proof of concept" or "stress test" for the data sets and initial ideas your team is developing for your final projects. This does not mean that this assignment will necessarily serve as a prototype for your final project, but is rather an opportunity to use exploratory data analysis techniques as a means to familiarize yourself with the data, the types of questions/insights it can support, and to ensure that the data set(s) you have selected for your final project are sufficiently large and complex.
You must work in pairs formed from your project group for this assignment. The assignment should be 3,000 – 5,000 words in length (10 – 20 pages with images). You'll be turning it in online by uploading it to the class dropbox.
Assignment
For this assignment, you will use a visualization tool to analyze a data set. You may use Tableau Software or an alternate tool (see below).
OVERVIEW
First steps:
Step 1. Identify and select candidate data sets from final project domain & data
Step 2. Profile and clean the data (see data section below)
Step 3. Pose initial questions (see exploration process section below)
Iterate as needed.
Create visualizations:
Interact with the data and create intermediate views
Refine each of your 4 initial questions
Use the results of these initial explorations to develop a final question
For your writeup:
Keep a record of your analysis and the views you create
Discuss the process and results/findings of your 4 intermediate explorations including the graphs you have made
Prepare at least one final graphic and caption to answer an interesting question that has emerged from your exploratory analysis of your 4 initial questions
EXPLORATION PROCESS
During your exploration of the data, we encourage you to create and record various types of views, including bar charts, scatter plots, maps and time series as appropriate for the data and question you are exploring. Note how different views support different questions and may reveal areas for further questions or exploration.
- Look at the data and/or its description. Write down at least four initial questions that you think the data may answer, including a comparative question, a correlation question, a geographically-oriented question, or a time-related or trend question.
- Use the visualization tool to examine the data for answers to your initial questions. You may wish to look for (for example)
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
- You may wish to refine your initial questions based on what you find as you explore. For example, you may wish to pose a related question about a subset of the data. Filtering, sorting, or other operations may be helpful.
- Use the visualization tool to explore the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention/focus. Try to highlight or otherwise isolate the subset of the data that contains an interesting feature.
- For some datasets it can be helpful to transform some of the data (e.g. by computing averages or medians, by converting numbers to percentages, etc.). You may do some of these kinds of transformations if you feel it is necessary or helpful (Tableau supports this), but if it is not needed than leave the data as is.
- Write up a discussion of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one interesting or surprising piece of information. Be sure to illustrate your points with screenshots, but please scale them so they aren't too large.
- For each of your questions, create a visualization that answers it. Be sure to label your axes and include an appropriate caption.
- In your discussion, comment on your use of Tableau (or other tool). What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?
Data
Please be proactive and plan to invest a fair amount of time for identifying and securing appropriate datasets for this assignment, as well as keeping in mind that these will most likely also be used for your final projects. Do not underestimate the effort involved in the process of acquiring good datasets. They should be sufficiently rich for you to be able to discover interesting information by exploring it. These sets should ideally contain a mix of nominal, ordinal, quantitative, geographical, and temporal data.
Keep in mind that your data may require a significant amount of pre-processing, cleaning or coding on your part; the data may contain empty fields, spelling errors, or other incomplete or faulty data, which you may need to address as you explore. It is very important to budget sufficient time for data cleaning, particularly if you are using data cleaning tools that are new or unfamiliar.
You are encouraged to look at the data to get a feel for its contents, structure and scale before beginning your analysis. You may use Tableau to look at the underlying data after connecting to it (use the spreadsheet shaped button at the upper left under the word Data) or you may open the file in Excel, a text editor, or other appropriate software depending on your data source, before loading it into the visualization tool.
Since your dataset should include a large amount of data, a large number of questions could be asked at many levels of detail. Using congressional candidate spending as an example, one might want to investigate spending and contributions at an aggregated level, breaking down the data by political parties at the national level. Alternatively, it is equally valid to filter out many of the attributes or entire sections of the data and explore, say, finances at a finer granularity, for example by investigating one's own state and local and neighboring congressional districts.
Tools
For data cleaning, you may use any of a variety of tools, including but not limited to: Excel, Trifacta Wrangler, OpenRefine, or scripting tools such as Python.
You may use Tableau or other visualization tools of your choosing for this assignment. Possible choices include Ggobi, R, d3.
As a part of your final writeup, please provide a brief critique of the tool that you used.
Grading (50 pts)
Assignments will be graded based on:
- Clear questions and applicable dataset to support those questions
- Basic description (profile) of dataset contents, size & perceived quality
- The description of your visual exploration process
- Major view types included (bar charts, scatter plots, maps and time series) with appropriate related questions
- The depth of your analysis
- The design of your final visualizations
- Instructive image (does it answer the question?)
- Appropriate caption and description
- Expressiveness/effectiveness of the visualization
- Comments and evaluation of the visualization tool including any improvements you might make.