Programming Exercise: Create a Small Data Warehouse
Part One Dataset:U.S. Gazetteer Files Link(you will need to download the files)
Part Two Dataset: original-airport-codes.csv
In this exercise, you will create a small data warehouse using Spark to save data as if it were a table in a typical relational database. Once you create this data warehouse, you can query the tables you created using Structured Query Language (SQL).
For this exercise, you will be creating tables usingU.S. Gazetteerfiles provided by the United States Census Bureau. These files provide a listing of geographic areas for selected areas. You can find the Gazetteer files for2017 and 2018in the data directory under the gazetteer folder. These directories contain data for:
ocongressional districts
ocore statistical areas
ocounties
ocounty subdivisions
oschools
ocensus tracts
ourban areas
ozip codes
oplaces of interest
You will combine the data (17 files from each year) from 2017 and 2018, and create tables with the filename of the source (e.g., places.csv is saved in the places table).
1. Gazetteer Data
a. Create Unmanaged Tables
The first step of this assignment involves loading the data from the CSV files, combining the file with the file for the other year, and saving it to disk as a table. The following code should provide a template to help you combine tables and save them to the warehouse directory.
Click on the image to download the sample code.
For each CSV file in the 2017 and 2018 directories, load the data into Spark, combine it with the corresponding data from the other year and save it to disk. Once you have finished saving all of the files as tables, verify that you have loaded the files properly by loading the tables into Spark, and performing a simple row count on each table.
The following Python code should provide you a template for loading the tables as an external table in Spark.
Click on the image todownload the sample code.
As stated previously, in a typical Hadoop distribution, you could save these tables as persistent tables in Apache Hive, but since we are not introducing Hive in this class, we need to load these tables into Spark and query them using SQL within Python.
The following code shows how to count the number of rows in the places table and show the results.
Click on the image to download the code.
As an aside, spark.catalog module offers useful utility functions such as spark.catalog.listTables() to list all of the currently available tables. These are useful for inspecting the Spark SQL warehouse.
b. Load and Query Tables
Now that we have saved the data to external tables, we will load the tables back into Spark and create a report using Spark SQL. For this report, we will create a report on school districts for the states of Nebraska and Iowa using the elementary_schools, secondary_schools and unified_school_districts tables. Using Spark SQL, create a report with the following information.
This table contains the number of elementary, secondary, and unified school districts in each state for each year. Note that the numbers in this table are notional and do not represent the actual results.
2. Flight Data
In the previous exercise, you joined data from flights and airport codes to create a report. Create an external table for airport_codes and domestic_flights from the domestic-flights/flights.parquet and airport-codes/airport-codes.csv files. Recreate the report of top ten airports for 2008using Spark SQLinstead of dataframes.
Provide the results via your notebook or export your code for each Exercise.