Assignment 3: Exploratory Data Analysis

Description

In this assignment you will use visualization software to perform exploratory data analysis on a real-world dataset related to your final project. The goal is to gain practice formulating and answering questions through visual analysis and to learn and critique a leading visualization tool.

This assignment will also serve as a "proof of concept" or "stress test" for the data sets and initial ideas your team is developing for your final projects. This does not mean that this assignment will necessarily serve as a prototype for your final project, but is rather an opportunity to use exploratory data analysis techniques as a means to familiarize yourself with the data, the types of questions/insights it can support, and to ensure that the data set(s) you have selected for your final project are sufficiently large and complex.

You must work in pairs formed from your project group for this assignment. The assignment should be 3,000 – 5,000 words in length (10 – 20 pages with images). You'll be turning it in online by uploading it to the class dropbox.

Assignment

For this assignment, you will use a visualization tool to analyze a data set.  You may use Tableau Software or an alternate tool (see below).

OVERVIEW

First steps:

Step 1. Identify and select candidate data sets from final project domain & data

Step 2. Profile and clean the data (see data section below)

Step 3. Pose initial questions (see exploration process section below)

Iterate as needed.

Create visualizations:

Interact with the data and create intermediate views

Refine each of your 4 initial questions

Use the results of these initial explorations to develop a final question

For your writeup:

Keep a record of your analysis and the views you create

Discuss the process and results/findings of your 4 intermediate explorations including the graphs you have made

Prepare at least one final graphic and caption to answer an interesting question that has emerged from your exploratory analysis of your 4 initial questions

EXPLORATION PROCESS

During your exploration of the data, we encourage you to create and record various types of views, including bar charts, scatter plots, maps and time series as appropriate for the data and question you are exploring.  Note how different views support different questions and may reveal areas for further questions or exploration.

Data

Please be proactive and plan to invest a fair amount of time for identifying and securing appropriate datasets for this assignment, as well as keeping in mind that these will most likely also be used for your final projects. Do not underestimate the effort involved in the process of acquiring good datasets. They should be sufficiently rich for you to be able to discover interesting information by exploring it. These sets should ideally contain a mix of nominal, ordinal, quantitative, geographical, and temporal data.

Keep in mind that your data may require a significant amount of pre-processing, cleaning or coding on your part; the data may contain empty fields, spelling errors, or other incomplete or faulty data, which you may need to address as you explore. It is very important to budget sufficient time for data cleaning, particularly if you are using data cleaning tools that are new or unfamiliar.

You are encouraged to look at the data to get a feel for its contents, structure and scale before beginning your analysis. You may use Tableau to look at the underlying data after connecting to it (use the spreadsheet shaped button at the upper left under the word Data) or you may open the file in Excel, a text editor, or other appropriate software depending on your data source, before loading it into the visualization tool.

Since your dataset should include a large amount of data, a large number of questions could be asked at many levels of detail. Using congressional candidate spending as an example, one might want to investigate spending and contributions at an aggregated level, breaking down the data by political parties at the national level. Alternatively, it is equally valid to filter out many of the attributes or entire sections of the data and explore, say, finances at a finer granularity, for example by investigating one's own state and local and neighboring congressional districts.

Tools

For data cleaning, you may use any of a variety of tools, including but not limited to: Excel, Trifacta Wrangler, OpenRefine, or scripting tools such as Python.

You may use Tableau or other visualization tools of your choosing for this assignment.  Possible choices include Ggobi, R, d3.

As a part of your final writeup, please provide a brief critique of the tool that you used.

Grading (50 pts)

Assignments will be graded based on: