BUSA 3600 Data Challenge Project – 30% of your final grade
Fall 2019,
last updated 9/26/2019
Overview
There are several publicly available data sets available to encourage “citizen data scientists” to use data to answer questions, investigate problems, or explore relationships. For instance, several questions that could be investigated with the consumer complaint database include:
Which banks provide the best service for mortgages in a specific state? (Service could be defined as having a low number of complaints relative to their total number of customers and providing a timely response to complaints that occur)
Looking at the open-ended responses provided by consumers, which are the top complaints related to mortgages? Do these complaints suggest additional subcategories of complaints that could be added?
How have complaints changed over time?
How does demographic information change the nature of the complaints (compare wealthy vs. poor areas, areas with high level of education vs. low, urban vs. suburban, etc.)?
Project Components and Due Dates
For this project, you will initially
work alone on Part 1, and then will
complete Part 2 with 1 other person. You will demonstrate your knowledge of BI tools, and analyze a data set with a specific question or questions in mind.
Getting Started In-Class Activity – Monday, September 30
Find a suitable and
Primary Fact Table
that is based on a personal area of interest by
Wednesday, Oct 2nd
11:59
. Aim for a table with 5000 rows, and at least 6-8 fields (columns) with both numbers and categorical (text) attributes. Less than 5000 records is “OK”, but fewer than 8 columns will be
Use Google searches to gain access to large data sets that are supportive of your interests, and include in you search terms which may render unique fields (usually categorical) that can be joined with other tables.
For instance, Google “Violent crime statistics by zip code”, “Education outcomes by University ID”, “NFL players by college institution”, “Section 8 housing statistics by FIPS code”.
FIPS codes, GEOID codes, hospital ID, county and State, Vehicle Identification Number (VIN), Airport Code are all unique identifiers. And the .gov websites have 200,000 data sets that you are free to use, as well as others that you may find at work or other sources during your research project.
Avoid using Year as your unique identifier. It will work, but your data set will end up summarized by year. Even with 10 years data, when you join with your dimension table you have an N of 10. There are 4000 unique FIPS codes in the US, 40,000 unique university ID’s…..
The important thing to keep in mind is that you are demonstrating skills learned in class using power pivot. If you search in vain for hours on a topic of interest, change your topic.
Start the following:
Load data into PowerPivot
Identify questions that would be of interest to you that may be supported
Identify unique fields that could be candidates for table joins with a dimension table
Part 1: Individual Data Model and Metadata – 8% of term grade
Due Sunday, October 27th by 11:59 pm. Accepted late with 20% penalty through Wednesday, October 29th
by 11:59 pm
This is an Individual Assignment! Even if you are already in a group, you will work by yourself for this part of the assignment. You will combine the work from both partners in Part 2.
The Excel file you submit should
Include Primary Fact table
Add at least 4 calculated columns to the Primary Fact table that could be relevant for your reporting.
Add 1 dimension table
of your choice to model. There must be a valid relationship between the table you add and the Primary Fact Table.
You cannot use a data set that I provided for you in class.
Add at least 4 calculated columns to the Dimension table that could be relevant for your reporting.
Create 3 PivotTables.
On each PivotTable, add textboxes describing what the tables mean
On each PivotTable, use appropriate formatting for the data type. For example, if working with money, change to use Currency data type. When displaying averages, show a reasonable number of numbers after the decimal point.
Use conditional formatting on at least 1 of the Pivot Tables
Use sorting on at least 1 of the Pivot Tables
Use filtering on at least 1 of the Pivot Tables
The Word document you submit should
Your research questions that you wish to investigate using your data.
Describe the metadata for your Primary Fact Table, and provide URL for your dataset
Describe the metadata for dimension table you added and provide the URL for your dataset
For the metadata, for each column in your data set list the Column Name, Data type, range of valid values or a sample list of values, and a brief description. For data sets with 1600 columns, I would delete at least 1550 that you will not use. Below is an excerpt of a well documented data set about Universities.
Column Name
|
Data Type
|
Valid Values
|
Description
|
Unit ID
|
Whole Number
|
ID > 10000
|
Universities Identifier
|
Universities Name
|
Text
|
no more than 50 characters
|
The name of the university
|
Address
|
Text
|
no more than 50 characters
|
The location of the university
|
City
|
Text
|
no more than 25 characters
|
City of the location of the university
|
State
|
Text
|
States abbrev (only 2 characters)
|
States within the Unites States
|
Zip Code
|
Whole Number
|
1 001- 99929
|
The zip code of the location of the university
|
School Address
|
Text
|
No more than 100 characters
|
Address of the university
|
Southern States
|
Text
|
True or False
|
State identifier
|
Title
|
Text
|
No more than 50 characters
|
The name of the in charge of the school
|
Public and private
|
Text
|
No more than 25 characters
|
Public and private identifiers
|
Pell grant value
|
Whole Number
|
1-100
|
Percentage of undergraduates receiving a Pell Grant
|