Apache Spark Exercise: Start Setting up Spark, Load Data, and work with dataframes Dataset: baby-names.csv 1. Loading Data For this part of the exercise, you will load a comma-separated value (CSV)...

1 answer below »


Apache Spark Exercise: Start Setting up Spark, Load Data, and work with dataframes




Dataset: baby-names.csv






1.
Loading Data


For this part of the exercise, you will load a comma-separated value (CSV) file into Spark as a text file (e.g., df_text = spark.read.text["baby-names.csv"]). We will use the file baby-names.csv found in the baby-names directory of the data for this course. This data comes from the Social Security Administration and contains a list of the most popular baby names each year for each state. As an example, the row WY,M,2017,Kaison,5 means that in 2017, there were five males born in Wyoming with the name Kaison.


Spark's Online Documentation will be helpful for the programming exercises in this class. In particular, the Apache Spark Quick Start Guide and Spark SQL, DataFrames and Datasets Guide will be helpful in completing this exercise.



a. Load Data and Show Schema


Load the baby-names.csv file into Spark dataframe as a text file. Print the dataframe’s schema using the printSchema method.



b. Filtering and Counting


First, count the number of rows in the dataframe. Second, filter the dataframe so that it only contains rows that contain John. Count the number of rows in the filtered dataframe.



2. Working with DataFrames


In the previous part of the exercise, you loaded the data into the dataframe as a text file. As a consequence, Spark treated each line as a record with a single field. While this is useful for some applications (processing raw text), it is not useful when our original data contains structure. In this part of the exercise, load baby-names.csv as a CSV file instead of a text file.



a. Load Data and Show Schema


Load the baby-names.csv file as a CSV file instead of a text file. Print the schema for this dataframe. In addition to printing the dataframe’s schema, show the first 20 rows of data using the show method.



b. Filtering and Counting


Filter the dataframe so that it only contains rows that contain the name John. Count the number of rows in the filtered dataframe.



c. Sorting and Limits


For this step, filter the dataframe to include only males (sex=‘M’) born in Nebraska (state=‘NE’) in 1980 (year=‘1980’). Sort the dataframe by descending values of count and show the first ten rows. The result should be the top ten most popular boy’s names for 1980 in Nebraska.


Make sure that you save files that are output from each task, as they may be needed for other tasks. These include prepared corpus, trained models, and reports.


Answered Same DaySep 09, 2021

Answer To: Apache Spark Exercise: Start Setting up Spark, Load Data, and work with dataframes Dataset:...

Deep answered on Sep 11 2021
150 Votes
Apache_Spark_Grey_Nodes/df_three_filters_part2c.csv
state,sex,year,name,count
NE,M,1980,Matthew,434
NE,M,1980,Michael,426
NE,M,1980,Jason,409
NE,M,1980,Joshua,366
NE,M,1980,Christopher,359
NE,M,1980,Justin,337
NE,M,1980,Ryan,320
NE,M,1980,David,292
NE,M,1980,Andrew,281
NE,M,1980,Brian,278
__MACOSX/Apache_Spark_Grey_Nodes/._df_three_filters_part2
c.csv
Apache_Spark_Grey_Nodes/.DS_Store
__MACOSX/Apache_Spark_Grey_Nodes/._.DS_Store
Apache_Spark_Grey_Nodes/ReadMe_Apache_Spark_Assignment.docx
Please take note of the following:
1.The required results are stored as indicated by their respective names.
2.The code is in the Apache_Spark_Assignment_Final.ipynb’ file
3. The ‘Apache_Spark_Assignment_Final.ipynb’ file is a properly formatted Jupiter notebook.You can open this file through https://colab.research.google.com/
4.The DIRECT LINK for the CODE file is : https://colab.research.google.com/drive/1p6xq_HappXdRp8hWCEIOIRMr97nvOq5W
__MACOSX/Apache_Spark_Grey_Nodes/._ReadMe_Apache_Spark_Assignment.docx
Apache_Spark_Grey_Nodes/apache-spark-exercise-090919-k4jbhcu4.docx
Apache Spark Exercise: Start Setting up Spark, Load Data, and work with dataframes
Dataset: baby-names.csv
1. Loading Data
For this part of the exercise, you will load a comma-separated value (CSV) file into Spark as a text file (e.g., df_text = spark.read.text["baby-names.csv"]). We will use the file baby-names.csv found in the baby-names directory of the data for this course. This data comes from the Social Security Administration and contains a list of the most popular baby names each year for each state. As an example, the row WY,M,2017,Kaison,5 means that in 2017, there were five males born in Wyoming with the name Kaison.
Spark's Online Documentation will be helpful for the programming exercises in this class. In particular, the Apache Spark Quick Start Guide and Spark SQL, DataFrames and Datasets Guide will be helpful in completing this exercise.
a. Load Data and Show Schema
Load the baby-names.csv file into Spark dataframe as a text file. Print the dataframe’s schema using the printSchema method.
b. Filtering and Counting
First, count the number of rows in the dataframe. Second, filter the dataframe so that it only contains rows that contain John. Count the number of rows in the filtered dataframe.
2. Working with DataFrames
In the previous part of the exercise, you loaded the data into the dataframe as a text file. As a consequence, Spark treated each line as a record with a single field. While this is useful for some applications (processing raw text), it is not useful when our original data contains structure. In this part of the exercise, load baby-names.csv as a CSV file instead of a text file.
a. Load Data and Show Schema
Load the baby-names.csv file as a CSV file instead of a text file. Print the schema for this dataframe. In addition to printing the dataframe’s schema, show the first 20 rows of data using the show method.
b. Filtering and Counting
Filter the dataframe so that it only contains rows that contain the name John. Count the number of rows in the filtered dataframe.
c. Sorting and Limits
For this step, filter the dataframe to include only males (sex=‘M’) born in Nebraska (state=‘NE’) in 1980 (year=‘1980’). Sort the dataframe by descending values of count and show the first ten rows. The result should be the top ten most popular boy’s names for 1980 in Nebraska.
Make sure that you save files that are output from each task, as they may be needed for other tasks. These include prepared corpus, trained models, and...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here