Apache Spark:
Streaming Data
Data set:
Baby names.csv(dropbox link)
For this exercise, you will be streaming data from your local file system and outputting the results to the console. The examples used in this exercise are loosely based on examples fromSpark’s Structured Streaming Programming Guide.
1. Stream Directory Data
In the first part of the exercise, you will create a simple Spark streaming program that reads an input stream from a file source. The file source stream reader reads data from a directory on a file system. When a new file is added to the folder, Spark adds that file’s data to the input data stream.
You can find the input data for this exercise in the baby-names/streaming directory. This directory contains thebaby names CSV filerandomized and split into 98 individual files. You will use these files to simulate incoming streaming data.
a. Count the Number of Females
In the first part of the exercise, you will create a Spark program that monitors an incoming directory. To simulate streaming data, you will copy CSV files from the baby-names/streaming directory into the incoming directory. Since you will be loading CSV data, you will need to define a schema before you initialize the streaming dataframe.
From this input data stream, you will create a simple output data stream that counts the number of females and writes it to the console. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files.
2.Micro-Batching
Repeat the last step, but use a micro-batch interval to trigger the processing every 30 seconds. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files. How did the output differ from the previous example?