Background
Amazing Prime loves the dataset and wants to keep it updated on a daily basis. Britta needs your help to create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. You’ll need to refactor the code from this module to create one function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data—and performs the ETL process by adding the data to a PostgreSQL database.
What You're Creating
This new assignment consists of four technical analysis deliverables. You will submit the following:
- Deliverable 1: Write an ETL Function to Read Three Data Files
- Deliverable 2: Extract and Transform the Wikipedia Data
- Deliverable 3: Extract and Transform the Kaggle data
- Deliverable 4: Create the Movie Database
Files
Use the following links to download the Challenge starter codes.
ETL Deliverable 1 starter code(Links to an external site.)
ETL Deliverable 2 starter code(Links to an external site.)
ETL Deliverable 3 starter code(Links to an external site.)
Deliverable 1: Write an ETL Function to Read Three Data Files (25 points)
Deliverable 1 Instructions
Using your knowledge of Python, Pandas, the ETL process, and code refactoring, write a function that reads in the three data files and creates three separate DataFrames.
REWINDFor this deliverable, you’ve already done the following in this module:
Download theETL_Deliverable1_starter_code.ipynb
file, add it to your Movies-ETL GitHub folder, and rename the fileETL_function_test.ipynb
. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.
- In Step 1, create a function to read in the three files and give it a name.
NOTEYou do not need to pass any variables inside this function yet. In Step 7, you'll use the provided code to set the three variables for the files you'll use equal to the function created in Step 1. This is done in order to reassign the variable names in Step 8, which will allow you to display each DataFrame in steps 9-11.
- In Step 2, read in the Kaggle metadata and MovieLens ratings CSV files as Pandas DataFrames.
- In Step 3, open the Wikipedia JSON file and use the
json.load()
function to convert the JSON data to raw data.
- In Step 4, read in the raw Wikipedia movie data as a Pandas DataFrame.
- In Step 5, use the code provided to return the three DataFrames.
- In Step 6, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
- In Step 7, set the three variables in Step 6 equal to the function created in Step 1.
- In Step 8, set the DataFrames from the return statement equal to the file names in Step 6. In this step, you are reassigning the variables created in Step 6 to the variables in the return statement.
- In Steps 9-11, check that all three files are converted to a DataFrame. See the images below for confirmation:
- The
wiki_movies_df
DataFrame
- The
kaggle_metadata
DataFrame
- After you confirm that all three DataFrames are correct, save the
ETL_function_test.ipynb
file in your Movies-ETL GitHub folder.
Deliverable 1 Requirements
You will earn a perfect score for Deliverable 1 by completing all requirements below:
- An ETL function is written to read in the three data files.(10 pt)
- The function converts the Wikipedia JSON file to a Pandas DataFrame, and the DataFrame is displayed in the
ETL_function_test.ipynb
file.(5 pt)
- The function converts the Kaggle metadata file to a Pandas DataFrame, and the DataFrame is displayed in the
ETL_function_test.ipynb
file.(5 pt)
- The function converts the MovieLens ratings data file to a Pandas DataFrame, and the DataFrame is displayed in the
ETL_function_test.ipynb
file.(5 pt)
Deliverable 2: Extract and Transform the Wikipedia Data (30 points)
Deliverable 2 Instructions
Using your knowledge of Python, Pandas, the ETL process, and code refactoring, extract and transform the Wikipedia data so you can merge it with the Kaggle metadata. While extracting the IMDb IDs using a regular expression string and dropping duplicates, use atry-except
block to catch errors.
REWINDFor this deliverable, you’ve already done the following in this module:
Download theETL_Deliverable2_starter_code.ipynb
file, add it to your Movies-ETL GitHub folder, and rename the fileETL_clean_wiki_movies.ipynb
. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.
- In Step 1, add the code from this module for the clean movie function that takes in the argument "movie".
- In Step 2, add the function you created in Deliverable 1 that reads in the three data files.
- In Step 3, inside the function you created in Deliverable 1, remove the code that creates the
wiki_movies_df
DataFrame from thewiki_movies_raw
file, then write a list comprehension that filters out TV shows from thewiki_movies_raw
file.
- In Step 4, write a list comprehension to iterate through the cleaned wiki movies list that you created in Step 3.
- In Step 5, read in the cleaned movies list from Step 4 as a DataFrame.
- In Step 6, write a
try-except
block that will catch errors while extracting the IMDb IDs with a regular expression string and dropping anyimdb_id
duplicates. If there is an error, capture and print the exception.
- In Step 7, write a list comprehension to keep the columns that have non-null values from the DataFrame created in Step 5, then create a
wiki_movies_df
DataFrame from the list.
- In Step 8, create a variable that will hold all the non-null values from the "Box office" column.
- In Step 9, convert the box office data created in Step 8 to string values using the lambda and join functions.
- In Step 10, write a regular expression to match the six elements of
form_one
of the box office data.
- In Step 11, write a regular expression to match the three elements of
form_two
of the box office data.
- In Step 12, add the
parse_dollars()
function.
- In Step 13, add the code that cleans the box office column in the
wiki_movies_df
DataFrame using theform_one
andform_two
lists created in Steps 10 and 11, respectively.
- In Step 14, add code that cleans the budget column in the
wiki_movies_df
DataFrame.
- In Step 15, add code that cleans the release date column in the
wiki_movies_df
DataFrame.
- In Step 16, add code that cleans the running time column in the
wiki_movies_df
DataFrame.
- In Step 17, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
- In Step 18, set the three variables in Step 17 equal to the function created in Deliverable 1.
- In Step 19, set the
wiki_movies_df
equal to thewiki_file
variable.
- In Step 20, check that your
wiki_movies_df
DataFrame looks like this image:
- In Step 21, add the columns from
wiki_movies_df
DataFrame to a list, and confirm that they are the same as this image:
- After you confirm that the
wiki_movies_df
DataFrame is correct, save theETL_clean_wiki_movies.ipynb
file in your Movies-ETL GitHub folder.
Deliverable 2 Requirements
You will earn a perfect score for Deliverable 2 by completing all requirements below:
- The TV shows are filtered out, and the
wiki_movies_df
DataFrame is created.(3 pt)
- A
try-except
block is used to catch errors while extracting the IMDb IDs with a regular expression and dropping duplicate IDs.(5 pt)
- The extraction and transformation of the Wikipedia data in the ETL function does the following:
- A list comprehension is used to keep columns with non-null values.(3 pt)
- The non-null box office data is converted to string values using the lambda and join functions.(3 pt)
- A regular expression is used to match the six elements of "form_one" of the box office data.(2 pt)
- A regular expression is used to match the three elements of "form_two" of the box office data.(2 pt)
- The following columns are cleaned in the Wikipedia DataFrame:(8 pt)
- The box office column
- The budget column
- The release date column
- The running time column
- The cleaned Wikipedia data is converted to a Pandas DataFrame, and the DataFrame is displayed in the
ETL_clean_wiki_movies.ipynb
file.(4 pt)
Deliverable 3: Extract and Transform the Kaggle Data (30 points)
Deliverable 3 Instructions
Using your knowledge of Python, Pandas, the ETL process, and code refactoring, extract and transform the Kaggle metadata and MovieLens rating data, then convert the transformed data into separate DataFrames. Then, you’ll merge the Kaggle metadata DataFrame with the Wikipedia movies DataFrame to create themovies_df
DataFrame. Finally, you’ll merge the MovieLens rating data DataFrame with themovies_df
DataFrame to create themovies_with_ratings_df
.
REWINDFor this deliverable, you’ve already done the following in this module:
Download theETL_Deliverable3_starter_code.ipynb
file, add it to your Movies-ETL GitHub folder, and rename the fileETL_clean_kaggle_data.ipynb
. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.
- In Step 1, add the function you created in Deliverable 1 that reads in the three data files and creates the
kaggle_metadata
andratings
DataFrames.
- Before Step 2, add all the code you wrote for Deliverable 2.
- In Step 2, below the code that cleans the running time column in the
wiki_movies_df
DataFrame from Deliverable 2, add the code that cleans the Kaggle metadata.
- In Step 3, merge the
wiki_movies_df
DataFrame and thekaggle_metadata
DataFrames, then name the new DataFrame,movies_df
.
- In Step 4, drop unnecessary columns from the
movies_df
DataFrame.
- In Step 5, add the
fill_missing_kaggle_data()
function that fills in the missing Kaggle data on themovies_df
DataFrame.
- In Step 6, call the
fill_missing_kaggle_data()
function with themovies_df
DataFrame and the Kaggle and Wikipedia columns to be cleaned as the arguments.
- In Step 7, filter the
movies_df
DataFrame to keep the necessary columns.
- In Step 8, rename the columns in the
movies_df
DataFrame.
- In Step 9, transform and merge the ratings DataFrame with the
movies_df
DataFrame, name the new DataFramemovies_with_ratings_df
, then clean themovies_with_ratings_df
DataFrame.
- In Step 10, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
- In Step 11, set the three variables from Step 17 of Deliverable 2 equal to the function created in Deliverable 1.
- In Step 12, set the DataFrames from the return statement after Step 9 equal to the file names in Step 11.
- In Step 13, check that your
wiki_movies_df
DataFrame is the same as in Deliverable 2.
- In Step 14, check that your
movies_with_ratings_df
DataFrame looks like this image:
- In Step 15, check that your
movies_df
DataFrame looks like this image:
- After you confirm that all three DataFrames are correct, save the
ETL_clean_kaggle_data.ipynb
file in your Movies-ETL GitHub folder.
Deliverable 3 Requirements
You will earn a perfect score for Deliverable 3 by completing all requirements below:
- The extraction and transformation of the Kaggle metadata using the ETL function does the following:
- The Kaggle metadata is cleaned.(4 pt)
- The Wikipedia and Kaggle DataFrames are merged.(3 pt)
- The following is performed on the merged Wikipedia and Kaggle DataFrames to create the
movies_df
:(8 pt)
- Unnecessary columns are dropped.
- A function is used to fill in the missing Kaggle data.
- The
movies_df
DataFrame is filtered to keep specific columns.
- The
movies_df
DataFrame columns are renamed.
- The extraction and transformation of the MovieLens ratings data using the ETL function does the following:
- The ratings counts are cleaned.(3 pt)
- The
movies_df
DataFrame is merged with the cleaned ratings DataFrame to create themovies_with_ratings_df
DataFrame.(4 pt)
- The empty values in the
movies_with_ratings_df
DataFrame are filled with “0”.(3 pt)
- The
movies_with_ratings_df
and themovies_df
DataFrames are displayed in theETL_clean_kaggle_data.ipynb
file.(5 pt)
Deliverable 4: Create the Movie Database (15 points)
Deliverable 4 Instructions
Use your knowledge of Python, Pandas, the ETL process, code refactoring, and PostgreSQL to add themovies_df
DataFrame and MovieLens rating CSV data to a SQL database.
REWINDFor this deliverable, you’ve already done the following in this module:
Lesson 8.5.1:
Create and connect to the database, then import data
Make a copy of theETL_clean_kaggle_data.ipynb
file in the Movies-ETL GitHub, and rename the fileETL_create_database.ipynb
. Follow the instructions below to add themovies_df
DataFrame and MovieLens rating CSV data to a SQL database.
- In the first cell, uncomment the
# from config import db_password
so this code is working.
- Remove the return statement,
return wiki_movies_df, movies_with_ratings_df, movies_df
.
- After Step 9,
Transform and merge the ratings DataFrame
, add the code to create the connection to the PostgreSQL database, then add themovies_df
DataFrame to a SQL database.
Hint:Use'replace'
for theif_exists
parameter so that themovies_df
DataFrame data won't be added to the table again.
- Before reading in the MovieLens rating CSV data, drop the
ratings
table in pgAdmin.
- Add the code that prints out the elapsed time to import each row.
- Refactor Step 11 of Deliverable 3 so that you pass in the variables for the files created in Step 10 of Deliverable 3 in the function created in Deliverable 1.
- Run the program.
- After the program has finished, run a query on the PostgreSQL database that retreives the number of rows for the
movies
andratings
tables.
- After you confirm that the
movies
table has 6,052 rows and theratings
table has 26,024,289 rows, take a screenshot of each query and the output, then save them asmovies_query.png
andratings_query.png
, respectively.
- Save the
ETL_create_database.ipynb
file in your Movies-ETL GitHub folder.
- Save the
movies_query.png
andratings_query.png
files in the Resources folder.
Deliverable 4 Requirements
You will earn a perfect score for Deliverable 4 by completing all requirements below:
- The data from the
movies_df
DataFrame replaces the current data in the movies table in the SQL database, as determined by themovies_query.png
.(5 pt)
- The data from the MovieLens rating CSV file is added to the
ratings
table in the SQL database, as determined by theratings_query.png
.(5 pt)
- The elapsed time to add the data to the database is displayed in the
ETL_create_database.ipynb
file.(5 pt)
Submission
Once you’re ready to submit, make sure to check your work against the rubric to ensure you are meeting the requirements for this Challenge one final time. It’s easy to overlook items when you’re in the zone!
As a reminder, the deliverables for this Challenge are as follows:
- Deliverable 1: Write an ETL function to read three data files
- Deliverable 2: Extract and Transform the Wikipedia Data
- Deliverable 3: Extract and Transform the Kaggle Data
- Deliverable 4: Create the Movie Database
IMPORTANTDon’t clear the output of your Jupyter Notebook files. Doing so will result in a lower score.
Upload the following to your Movies-ETL GitHub repository:
- The
ETL_function_test.ipynb
file
- The
ETL_clean_wiki_movies.ipynb
file
- The
ETL_clean_kaggle_data.ipynb
file
- The
ETL_create_database.ipynb
file
- The Resources folder with the
wikipedia.movies.json
,movies_metadata.csv
,movies_query.png
, andratings_query.png
files.
- A README.md that describes the purpose of the repository. Although there is no graded written analysis for this Challenge, it is encouraged and good practice to add a brief description of your project.
To submit your Challenge assignment in Canvas, click Submit, then provide the URL of your Movies-ETL GitHub repository for grading. Comments are disabled for graded submissions in BootCampSpot. If you have questions about your feedback, please notify your instructional staff or the Student Success Manager. If you would like to resubmit your work for an improved grade, you can use theRe-Submit Assignmentbutton to upload new links. You may resubmit up to 3 times for a total of 4 submissions.
IMPORTANTOnce you receive feedback on your Challenge, make any suggested updates or adjustments to your work. Then, add this week’s Challenge to your professional portfolio.
NOTEYou are allowed to miss up to two Challenge assignments and still earn your certificate. If you complete all Challenge assignments, your lowest two grades will be dropped. If you wish to skip this assignment, click Submit then indicate you are skipping by typing “I choose to skip this assignment” in the text box.
Rubric
Module-8 RubricModule-8 Rubric
Criteria |
Ratings |
Pts |
---|
This criterion is linked to a learning outcomeDeliverable 1: Write an ETL function to read three data files |
25to >23.0PtsMasteryThe ETL function does the following: ✓ The three data files are passed into the function. ✓ All three data sets are converted to DataFrames, and the DataFrames are correct and displayed. |
23to >19.0PtsApproaching MasteryThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is converted to DataFrame, and the DataFrame is displayed. ✓ The Kaggle metadata and MovieLens ratings data are converted to a DataFrames, but the wrong DataFrames are displayed. |
19to >16.0PtsProgressingThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is converted to DataFrame, but it is not displayed. ✓ The Kaggle metadata and MovieLens ratings data are converted to DataFrames, but the DataFrames are incorrect. |
16to >0.0PtsEmergingThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is ONLY converted to a raw data file. ✓ The Kaggle metadata and MovieLens ratings data are converted to DataFrames, but they are not displayed. |
0PtsIncomplete |
|
25pts
|
This criterion is linked to a learning outcomeDeliverable 2: Extract and Transform the Wikipedia Data |
30to >27.0PtsMastery✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is used successfully. ✓ All of the tasks for the extraction & transformation of the Wikipedia data are completed. ✓ The cleaned Wikipedia data is converted to a DataFrame, and the DataFrame is displayed. |
27to >25.0PtsApproaching Mastery✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓A try-except block is used successfully. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ THREE of the FOUR columns are cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed. |
25to >20.0PtsProgressing✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is used successfully. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ TWO of the FOUR columns are cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed. |
20to >0.0PtsEmerging✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is written but doesn’t catch errors. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ ONE of the FOUR columns is cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed. |
0PtsIncomplete |
|
30pts
|
This criterion is linked to a learning outcomeDeliverable 3: Extract and Transform the Kaggle Data |
30to >27.0PtsMasteryDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, and all FOUR tasks are performed. ✓ All THREE tasks are completed during the extraction & transformation of the MovieLens rating data. ✓ The Kaggle and ratings DataFrames are correct and displayed. |
27to >22.0PtsApproaching MasteryDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, but only THREE of the FOUR tasks are performed. ✓ All THREE tasks are completed during the extraction & transformation of the MovieLens rating data. ✓ The Kaggle and MovieLens rating DataFrames are displayed, but the "movies" DataFrame is incorrect. |
22to >16.0PtsProgressingDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, but only TWO of the FOUR tasks are performed. During the extraction & transformation of the MovieLens rating data, the following are done: ✓ The ratings counts are cleaned. ✓ The two DataFrames are merged. ✓ There is an attempt to fill the empty values with “0”. ✓ The Kaggle and MovieLens rating DataFrames are displayed but incorrect. |
16to >0.0PtsEmergingDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged, but there is an error. ✓ The "movies" DataFrame is created, but only ONE of the FOUR tasks is performed. During the extraction & transformation of the MovieLens rating data, the following are done: ✓ The ratings counts are cleaned. ✓ The two DataFrames are merged, but there is an error ✓ There is an attempt to fill the empty values with “0”. ✓ The Kaggle and MovieLens rating DataFrames are displayed but incorrect. |
0PtsIncomplete |
|
30pts
|
This criterion is linked to a learning outcomeDeliverable 4: Create the Movie Database |
15to >14.0PtsMastery✓ The data in the movies table in the SQL database is replaced. ✓ The ratings table is dropped, and the MovieLens rating CSV file is added to the SQL ratings table. ✓ The elapsed time to add the data to the database is displayed. |
14to >11.0PtsApproaching Mastery✓ The data in the movies table in the SQL database is replaced. ✓ The ratings table is dropped, but not all of the MovieLens rating CSV file is added to the ratings table. ✓ The elapsed time to add the data to the database is displayed. |
11to >8.0PtsProgressing✓ The data in the movies table in the SQL database is not replaced. ✓ The ratings table is not dropped, but the MovieLens rating CSV file is added to the ratings table. ✓ All of the elapsed time to add the data to the database is not displayed |
8to >0.0PtsEmerging✓ There is an error adding the movies table in the database. ✓ The ratings table is not dropped, and there is an error adding the MovieLens rating CSV data. ✓ The elapsed time to add the data to the database is partially displayed with an error. |
0PtsIncomplete |
|
15pts
|
Total points:100 |
8.5.1: Connect Pandas and SQL" style="float: left;">
PreviousModule 8 Career Connection" style="float: right;">
Next© 2020 - 2021 Trilogy Education Services, a 2U, Inc. brand. All Rights Reserved.