Background Amazing Prime loves the dataset and wants to keep it updated on a daily basis. Britta needs your help to create an automated pipeline that takes in new data, performs the appropriate...

1 answer below »

Background


Amazing Prime loves the dataset and wants to keep it updated on a daily basis. Britta needs your help to create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. You’ll need to refactor the code from this module to create one function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data—and performs the ETL process by adding the data to a PostgreSQL database.


What You're Creating


This new assignment consists of four technical analysis deliverables. You will submit the following:



  • Deliverable 1: Write an ETL Function to Read Three Data Files

  • Deliverable 2: Extract and Transform the Wikipedia Data

  • Deliverable 3: Extract and Transform the Kaggle data

  • Deliverable 4: Create the Movie Database


Files


Use the following links to download the Challenge starter codes.



ETL Deliverable 1 starter code(Links to an external site.)



ETL Deliverable 2 starter code(Links to an external site.)



ETL Deliverable 3 starter code(Links to an external site.)


Deliverable 1: Write an ETL Function to Read Three Data Files (25 points)


Deliverable 1 Instructions


Using your knowledge of Python, Pandas, the ETL process, and code refactoring, write a function that reads in the three data files and creates three separate DataFrames.


REWIND

For this deliverable, you’ve already done the following in this module:



Download theETL_Deliverable1_starter_code.ipynbfile, add it to your Movies-ETL GitHub folder, and rename the fileETL_function_test.ipynb. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.



  1. In Step 1, create a function to read in the three files and give it a name.


NOTE

You do not need to pass any variables inside this function yet. In Step 7, you'll use the provided code to set the three variables for the files you'll use equal to the function created in Step 1. This is done in order to reassign the variable names in Step 8, which will allow you to display each DataFrame in steps 9-11.



  1. In Step 2, read in the Kaggle metadata and MovieLens ratings CSV files as Pandas DataFrames.

  2. In Step 3, open the Wikipedia JSON file and use thejson.load()function to convert the JSON data to raw data.

  3. In Step 4, read in the raw Wikipedia movie data as a Pandas DataFrame.

  4. In Step 5, use the code provided to return the three DataFrames.

  5. In Step 6, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.

  6. In Step 7, set the three variables in Step 6 equal to the function created in Step 1.

  7. In Step 8, set the DataFrames from the return statement equal to the file names in Step 6. In this step, you are reassigning the variables created in Step 6 to the variables in the return statement.

  8. In Steps 9-11, check that all three files are converted to a DataFrame. See the images below for confirmation:



  • Thewiki_movies_dfDataFrame



 The first five rows of the wiki_movies_df DataFrame.



  • Thekaggle_metadataDataFrame



 The first five rows of the kaggle_metadata DataFrame.



  • TheratingsDataFrame



The ratings DataFrame.



  1. After you confirm that all three DataFrames are correct, save theETL_function_test.ipynbfile in your Movies-ETL GitHub folder.


Deliverable 1 Requirements


You will earn a perfect score for Deliverable 1 by completing all requirements below:



  • An ETL function is written to read in the three data files.(10 pt)

  • The function converts the Wikipedia JSON file to a Pandas DataFrame, and the DataFrame is displayed in theETL_function_test.ipynbfile.(5 pt)

  • ​The function converts the Kaggle metadata file to a Pandas DataFrame, and the DataFrame is displayed in theETL_function_test.ipynbfile.(5 pt)

  • ​The function converts the MovieLens ratings data file to a Pandas DataFrame, and the DataFrame is displayed in theETL_function_test.ipynbfile.(5 pt)


Deliverable 2: Extract and Transform the Wikipedia Data (30 points)


Deliverable 2 Instructions


Using your knowledge of Python, Pandas, the ETL process, and code refactoring, extract and transform the Wikipedia data so you can merge it with the Kaggle metadata. While extracting the IMDb IDs using a regular expression string and dropping duplicates, use atry-exceptblock to catch errors.


REWIND

For this deliverable, you’ve already done the following in this module:



Download theETL_Deliverable2_starter_code.ipynbfile, add it to your Movies-ETL GitHub folder, and rename the fileETL_clean_wiki_movies.ipynb. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.



  1. In Step 1, add the code from this module for the clean movie function that takes in the argument "movie".

  2. In Step 2, add the function you created in Deliverable 1 that reads in the three data files.

  3. In Step 3, inside the function you created in Deliverable 1, remove the code that creates thewiki_movies_dfDataFrame from thewiki_movies_rawfile, then write a list comprehension that filters out TV shows from thewiki_movies_rawfile.

  4. In Step 4, write a list comprehension to iterate through the cleaned wiki movies list that you created in Step 3.

  5. In Step 5, read in the cleaned movies list from Step 4 as a DataFrame.

  6. In Step 6, write atry-exceptblock that will catch errors while extracting the IMDb IDs with a regular expression string and dropping anyimdb_idduplicates. If there is an error, capture and print the exception.

  7. In Step 7, write a list comprehension to keep the columns that have non-null values from the DataFrame created in Step 5, then create awiki_movies_dfDataFrame from the list.

  8. In Step 8, create a variable that will hold all the non-null values from the "Box office" column.

  9. In Step 9, convert the box office data created in Step 8 to string values using the lambda and join functions.

  10. In Step 10, write a regular expression to match the six elements ofform_oneof the box office data.

  11. In Step 11, write a regular expression to match the three elements ofform_twoof the box office data.

  12. In Step 12, add theparse_dollars()function.

  13. In Step 13, add the code that cleans the box office column in thewiki_movies_dfDataFrame using theform_oneandform_twolists created in Steps 10 and 11, respectively.

  14. In Step 14, add code that cleans the budget column in thewiki_movies_dfDataFrame.

  15. In Step 15, add code that cleans the release date column in thewiki_movies_dfDataFrame.

  16. In Step 16, add code that cleans the running time column in thewiki_movies_dfDataFrame.

  17. In Step 17, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.

  18. In Step 18, set the three variables in Step 17 equal to the function created in Deliverable 1.

  19. In Step 19, set thewiki_movies_dfequal to thewiki_filevariable.

  20. In Step 20, check that yourwiki_movies_dfDataFrame looks like this image:



The ratings DataFrame.



  1. In Step 21, add the columns fromwiki_movies_dfDataFrame to a list, and confirm that they are the same as this image:



 The columns of the wiki_movies_df DataFrame.



  1. After you confirm that thewiki_movies_dfDataFrame is correct, save theETL_clean_wiki_movies.ipynbfile in your Movies-ETL GitHub folder.


Deliverable 2 Requirements


You will earn a perfect score for Deliverable 2 by completing all requirements below:



  • The TV shows are filtered out, and thewiki_movies_dfDataFrame is created.(3 pt)

  • Atry-exceptblock is used to catch errors while extracting the IMDb IDs with a regular expression and dropping duplicate IDs.(5 pt)

  • The extraction and transformation of the Wikipedia data in the ETL function does the following:

    • A list comprehension is used to keep columns with non-null values.(3 pt)

    • The non-null box office data is converted to string values using the lambda and join functions.(3 pt)

    • A regular expression is used to match the six elements of "form_one" of the box office data.(2 pt)

    • A regular expression is used to match the three elements of "form_two" of the box office data.(2 pt)

    • The following columns are cleaned in the Wikipedia DataFrame:(8 pt)

      • The box office column

      • The budget column

      • The release date column

      • The running time column





  • ​The cleaned Wikipedia data is converted to a Pandas DataFrame, and the DataFrame is displayed in theETL_clean_wiki_movies.ipynbfile.(4 pt)


Deliverable 3: Extract and Transform the Kaggle Data (30 points)


Deliverable 3 Instructions


Using your knowledge of Python, Pandas, the ETL process, and code refactoring, extract and transform the Kaggle metadata and MovieLens rating data, then convert the transformed data into separate DataFrames. Then, you’ll merge the Kaggle metadata DataFrame with the Wikipedia movies DataFrame to create themovies_dfDataFrame. Finally, you’ll merge the MovieLens rating data DataFrame with themovies_dfDataFrame to create themovies_with_ratings_df.


REWIND

For this deliverable, you’ve already done the following in this module:



Download theETL_Deliverable3_starter_code.ipynbfile, add it to your Movies-ETL GitHub folder, and rename the fileETL_clean_kaggle_data.ipynb. Follow the instructions below to refactor the code from this module as indicated by the numbered comments in the starter code file.



  1. In Step 1, add the function you created in Deliverable 1 that reads in the three data files and creates thekaggle_metadataandratingsDataFrames.

  2. Before Step 2, add all the code you wrote for Deliverable 2.

  3. In Step 2, below the code that cleans the running time column in thewiki_movies_dfDataFrame from Deliverable 2, add the code that cleans the Kaggle metadata.

  4. In Step 3, merge thewiki_movies_dfDataFrame and thekaggle_metadataDataFrames, then name the new DataFrame,movies_df.

  5. In Step 4, drop unnecessary columns from themovies_dfDataFrame.

  6. In Step 5, add thefill_missing_kaggle_data()function that fills in the missing Kaggle data on themovies_dfDataFrame.

  7. In Step 6, call thefill_missing_kaggle_data()function with themovies_dfDataFrame and the Kaggle and Wikipedia columns to be cleaned as the arguments.

  8. In Step 7, filter themovies_dfDataFrame to keep the necessary columns.

  9. In Step 8, rename the columns in themovies_dfDataFrame.

  10. In Step 9, transform and merge the ratings DataFrame with themovies_dfDataFrame, name the new DataFramemovies_with_ratings_df, then clean themovies_with_ratings_dfDataFrame.

  11. In Step 10, use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.

  12. In Step 11, set the three variables from Step 17 of Deliverable 2 equal to the function created in Deliverable 1.

  13. In Step 12, set the DataFrames from the return statement after Step 9 equal to the file names in Step 11.

  14. In Step 13, check that yourwiki_movies_dfDataFrame is the same as in Deliverable 2.

  15. In Step 14, check that yourmovies_with_ratings_dfDataFrame looks like this image:



The first five rows of the movies_with_ratings_df DataFrame.



  1. In Step 15, check that yourmovies_dfDataFrame looks like this image:



The first five rows of the movies_df DataFrame.



  1. After you confirm that all three DataFrames are correct, save theETL_clean_kaggle_data.ipynbfile in your Movies-ETL GitHub folder.


Deliverable 3 Requirements


You will earn a perfect score for Deliverable 3 by completing all requirements below:



  • The extraction and transformation of the Kaggle metadata using the ETL function does the following:

    • The Kaggle metadata is cleaned.(4 pt)

    • The Wikipedia and Kaggle DataFrames are merged.(3 pt)

    • The following is performed on the merged Wikipedia and Kaggle DataFrames to create themovies_df:(8 pt)

      • Unnecessary columns are dropped.

      • A function is used to fill in the missing Kaggle data.

      • Themovies_dfDataFrame is filtered to keep specific columns.

      • Themovies_dfDataFrame columns are renamed.





  • The extraction and transformation of the MovieLens ratings data using the ETL function does the following:

    • The ratings counts are cleaned.(3 pt)

    • Themovies_dfDataFrame is merged with the cleaned ratings DataFrame to create themovies_with_ratings_dfDataFrame.(4 pt)

    • The empty values in themovies_with_ratings_dfDataFrame are filled with “0”.(3 pt)



  • Themovies_with_ratings_dfand themovies_dfDataFrames are displayed in theETL_clean_kaggle_data.ipynbfile.(5 pt)


Deliverable 4: Create the Movie Database (15 points)


Deliverable 4 Instructions


Use your knowledge of Python, Pandas, the ETL process, code refactoring, and PostgreSQL to add themovies_dfDataFrame and MovieLens rating CSV data to a SQL database.


REWIND

For this deliverable, you’ve already done the following in this module:





  • Lesson 8.5.1:
    Create and connect to the database, then import data


Make a copy of theETL_clean_kaggle_data.ipynbfile in the Movies-ETL GitHub, and rename the fileETL_create_database.ipynb. Follow the instructions below to add themovies_dfDataFrame and MovieLens rating CSV data to a SQL database.



  1. In the first cell, uncomment the# from config import db_passwordso this code is working.

  2. Remove the return statement,return wiki_movies_df, movies_with_ratings_df, movies_df.

  3. After Step 9,Transform and merge the ratings DataFrame, add the code to create the connection to the PostgreSQL database, then add themovies_dfDataFrame to a SQL database.



Hint:Use'replace'for theif_existsparameter so that themovies_dfDataFrame data won't be added to the table again.



  1. Before reading in the MovieLens rating CSV data, drop theratingstable in pgAdmin.

  2. Add the code that prints out the elapsed time to import each row.

  3. Refactor Step 11 of Deliverable 3 so that you pass in the variables for the files created in Step 10 of Deliverable 3 in the function created in Deliverable 1.

  4. Run the program.

  5. After the program has finished, run a query on the PostgreSQL database that retreives the number of rows for themoviesandratingstables.

  6. After you confirm that themoviestable has 6,052 rows and theratingstable has 26,024,289 rows, take a screenshot of each query and the output, then save them asmovies_query.pngandratings_query.png, respectively.

  7. Save theETL_create_database.ipynbfile in your Movies-ETL GitHub folder.

  8. Save themovies_query.pngandratings_query.pngfiles in the Resources folder.


Deliverable 4 Requirements


You will earn a perfect score for Deliverable 4 by completing all requirements below:



  • The data from themovies_dfDataFrame replaces the current data in the movies table in the SQL database, as determined by themovies_query.png.(5 pt)

  • The data from the MovieLens rating CSV file is added to theratingstable in the SQL database, as determined by theratings_query.png.(5 pt)

  • The elapsed time to add the data to the database is displayed in theETL_create_database.ipynbfile.(5 pt)


Submission


Once you’re ready to submit, make sure to check your work against the rubric to ensure you are meeting the requirements for this Challenge one final time. It’s easy to overlook items when you’re in the zone!


As a reminder, the deliverables for this Challenge are as follows:



  • Deliverable 1: Write an ETL function to read three data files

  • Deliverable 2: Extract and Transform the Wikipedia Data

  • Deliverable 3: Extract and Transform the Kaggle Data

  • Deliverable 4: Create the Movie Database


IMPORTANT

Don’t clear the output of your Jupyter Notebook files. Doing so will result in a lower score.


Upload the following to your Movies-ETL GitHub repository:



  1. TheETL_function_test.ipynbfile

  2. TheETL_clean_wiki_movies.ipynbfile

  3. TheETL_clean_kaggle_data.ipynbfile

  4. TheETL_create_database.ipynbfile

  5. The Resources folder with thewikipedia.movies.json,movies_metadata.csv,movies_query.png, andratings_query.pngfiles.

  6. A README.md that describes the purpose of the repository. Although there is no graded written analysis for this Challenge, it is encouraged and good practice to add a brief description of your project.


To submit your Challenge assignment in Canvas, click Submit, then provide the URL of your Movies-ETL GitHub repository for grading. Comments are disabled for graded submissions in BootCampSpot. If you have questions about your feedback, please notify your instructional staff or the Student Success Manager. If you would like to resubmit your work for an improved grade, you can use theRe-Submit Assignmentbutton to upload new links. You may resubmit up to 3 times for a total of 4 submissions.


IMPORTANT

Once you receive feedback on your Challenge, make any suggested updates or adjustments to your work. Then, add this week’s Challenge to your professional portfolio.


NOTE

You are allowed to miss up to two Challenge assignments and still earn your certificate. If you complete all Challenge assignments, your lowest two grades will be dropped. If you wish to skip this assignment, click Submit then indicate you are skipping by typing “I choose to skip this assignment” in the text box.


Rubric

Module-8 RubricModule-8 Rubric
































CriteriaRatingsPts
This criterion is linked to a learning outcomeDeliverable 1: Write an ETL function to read three data files










25to >23.0PtsMasteryThe ETL function does the following: ✓ The three data files are passed into the function. ✓ All three data sets are converted to DataFrames, and the DataFrames are correct and displayed.23to >19.0PtsApproaching MasteryThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is converted to DataFrame, and the DataFrame is displayed. ✓ The Kaggle metadata and MovieLens ratings data are converted to a DataFrames, but the wrong DataFrames are displayed.19to >16.0PtsProgressingThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is converted to DataFrame, but it is not displayed. ✓ The Kaggle metadata and MovieLens ratings data are converted to DataFrames, but the DataFrames are incorrect.16to >0.0PtsEmergingThe ETL function does the following: ✓ The three data files are passed into the function. ✓ The Wikipedia JSON file is ONLY converted to a raw data file. ✓ The Kaggle metadata and MovieLens ratings data are converted to DataFrames, but they are not displayed.0PtsIncomplete

25pts

This criterion is linked to a learning outcomeDeliverable 2: Extract and Transform the Wikipedia Data










30to >27.0PtsMastery✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is used successfully. ✓ All of the tasks for the extraction & transformation of the Wikipedia data are completed. ✓ The cleaned Wikipedia data is converted to a DataFrame, and the DataFrame is displayed.27to >25.0PtsApproaching Mastery✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓A try-except block is used successfully. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ THREE of the FOUR columns are cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed.25to >20.0PtsProgressing✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is used successfully. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ TWO of the FOUR columns are cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed.20to >0.0PtsEmerging✓ TV shows are filtered out, and the wiki_movies DataFrame is created. ✓ A try-except block is written but doesn’t catch errors. During the extraction & transformation of the Wikipedia data, the following are done: ✓ Columns with null values are dropped. ✓ The non-null box office data is converted to string values. ✓ Regular expression codes for "form_one" and "form_two" of the box office data are correct. ✓ ONE of the FOUR columns is cleaned. ✓ Wikipedia data is not cleaned but is converted to a DataFrame and displayed.0PtsIncomplete

30pts

This criterion is linked to a learning outcomeDeliverable 3: Extract and Transform the Kaggle Data










30to >27.0PtsMasteryDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, and all FOUR tasks are performed. ✓ All THREE tasks are completed during the extraction & transformation of the MovieLens rating data. ✓ The Kaggle and ratings DataFrames are correct and displayed.27to >22.0PtsApproaching MasteryDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, but only THREE of the FOUR tasks are performed. ✓ All THREE tasks are completed during the extraction & transformation of the MovieLens rating data. ✓ The Kaggle and MovieLens rating DataFrames are displayed, but the "movies" DataFrame is incorrect.22to >16.0PtsProgressingDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged ✓ The "movies" DataFrame is created, but only TWO of the FOUR tasks are performed. During the extraction & transformation of the MovieLens rating data, the following are done: ✓ The ratings counts are cleaned. ✓ The two DataFrames are merged. ✓ There is an attempt to fill the empty values with “0”. ✓ The Kaggle and MovieLens rating DataFrames are displayed but incorrect.16to >0.0PtsEmergingDuring the extraction & transformation of the Kaggle metadata, the following are done: ✓ The metadata is cleaned. ✓ The Wikipedia and Kaggle DataFrames are merged, but there is an error. ✓ The "movies" DataFrame is created, but only ONE of the FOUR tasks is performed. During the extraction & transformation of the MovieLens rating data, the following are done: ✓ The ratings counts are cleaned. ✓ The two DataFrames are merged, but there is an error ✓ There is an attempt to fill the empty values with “0”. ✓ The Kaggle and MovieLens rating DataFrames are displayed but incorrect.0PtsIncomplete

30pts

This criterion is linked to a learning outcomeDeliverable 4: Create the Movie Database










15to >14.0PtsMastery✓ The data in the movies table in the SQL database is replaced. ✓ The ratings table is dropped, and the MovieLens rating CSV file is added to the SQL ratings table. ✓ The elapsed time to add the data to the database is displayed.14to >11.0PtsApproaching Mastery✓ The data in the movies table in the SQL database is replaced. ✓ The ratings table is dropped, but not all of the MovieLens rating CSV file is added to the ratings table. ✓ The elapsed time to add the data to the database is displayed.11to >8.0PtsProgressing✓ The data in the movies table in the SQL database is not replaced. ✓ The ratings table is not dropped, but the MovieLens rating CSV file is added to the ratings table. ✓ All of the elapsed time to add the data to the database is not displayed8to >0.0PtsEmerging✓ There is an error adding the movies table in the database. ✓ The ratings table is not dropped, and there is an error adding the MovieLens rating CSV data. ✓ The elapsed time to add the data to the database is partially displayed with an error.0PtsIncomplete

15pts

Total points:100

8.5.1: Connect Pandas and SQL" style="float: left;">Previous
Module 8 Career Connection" style="float: right;">Next© 2020 - 2021 Trilogy Education Services, a 2U, Inc. brand. All Rights Reserved.
Answered 7 days AfterDec 20, 2021

Answer To: Background Amazing Prime loves the dataset and wants to keep it updated on a daily basis. Britta...

Sathishkumar answered on Dec 27 2021
129 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here