Part A - Spark RDDs and SQL with text (8 marks) In Part A your task is to answer a question about the data in a text file, first using Spark RDDs, and then using Spark DataFrames and SQL. By using...

1 answer below »
Please see the attached document


Part A - Spark RDDs and SQL with text (8 marks) In Part A your task is to answer a question about the data in a text file, first using Spark RDDs, and then using Spark DataFrames and SQL. By using both to answer the same question about the same file you can more readily see how the two techniques compare. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the text file "graph.txt". In this text file, each line is in the format of: “EdgeId FromNodeId ToNodeId Distance”. 0 0 1 10.0 1 0 2 5.0 2 1 2 2.0 3 1 3 1.0 4 2 1 3.0 5 2 3 9.0 6 2 4 2.0 7 3 4 4.0 8 4 0 7.0 9 4 3 5.0 (There may exist an empty line in the input file, and you can use a filter to solve this.) These data represent a graph as shown below: Given a directed graph, your task is to compute the average length of the out-going edges for each node, and sort the nodes first according to the average lengths in descending order and then according to the node IDs (numeric value) in ascending order. Your results should appear like the following (The result should be rounded to two decimal places.): 0 7.5 4 6.0 2 4.67 3 4.0 1 1.5 Each line in the single output file is in the format of “NodeID\tAverage length of out-going edges”, where '\t' indicates the 'tab' character. Please remove the nodes that have no out-going edges. First (4 marks) Write a Python program that uses Spark RDDs to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. You should be able to modify programs that you have already seen in this week's content. To sort the RDD results, you can use SortBy, and here is an example of it. Hint: >>> tmp = [('a', 3), ('b', 2), ('a', 1), ('d', 4), ('2', 5)] >>> sc.parallelize(tmp).sortBy(lambda x: (x[0],x[1])).collect() Output: [('2', 5), ('a', 1), ('a', 3), ('b', 2), ('d', 4)] To test your program you first need to create your default directory in Hadoop, and copy walden.txt to it: $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put graph.txt You can test your program by running the following command: $ spark-submit rdd.py Please save your results in the 'result-rdd' folder in HDFS. Second (4 marks) Write a Python program that uses Spark DataFrames and SQL to do this. A file called "sql.py" has been created for you - you just need to fill in the details. Again, you should be able to modify programs that you have already seen in this week's content. You can test your program by running the following command: $ spark-submit sql.py Please save your results in the 'result-sql' folder in HDFS. Part B - Spark RDDs and SQL with CSV (8 marks) In Part B your task is to answer a question about the data in a CSV file, first using Spark RDDs, and then using Spark DataFrames and SQL. By using both to answer the same question about the same file you can more readily see how the two techniques compare. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the CSV file "orders.csv". It's one that you've seen before. Here are the fields in the file: OrderDate (date) ISBN (string) Title (string) Category (string) PriceEach (decimal) Quantity (integer) FirstName (string) LastName (string) City (string) Your task is to find the number of books ordered each day, sorted by the number of books descending, then order date ascending. Your results should appear as the following: 2009-04-03,10 2009-04-02,8 2009-04-01,7 2009-04-04,6 2009-03-31,5 2009-04-05,4 2009-04-08,4 First (4 marks) Write a Python program that uses Spark RDDs to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. You should be able to modify programs that you have already seen in this week's content. To sort the RDD results, you can use SortBy, and here is an example of it. Hint: >>> tmp = [('a', 3), ('b', 2), ('a', 1), ('d', 4), ('2', 5)] >>> sc.parallelize(tmp).sortBy(lambda x: (x[0],x[1])).collect() Output: [('2', 5), ('a', 1), ('a', 3), ('b', 2), ('d', 4)] To test your program you first need to create your default directory in Hadoop, and copy orders.csv to it: $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put orders.csv You can test your program by running the following command: $ spark-submit rdd.py Please save your results in the 'result-rdd' folder in HDFS. Second (4 marks) Write a Python program that uses Spark DataFrames and SQL to do this. A file called "sql.py" has been created for you - you just need to fill in the details. Again, you should be able to modify programs that you have already seen in this week's content. You can test your program by running the following command: $ spark-submit sql.py Please save your results in the 'result-sql' folder in HDFS. 0 0 1 10.0 1 0 2 5.0 2 1 2 2.0 3 1 3 1.0 4 2 1 3.0 5 2 3 9.0 6 2 4 2.0 7 3 4 4.0 8 4 0 7.0 9 4 3 5.0
Answered 2 days AfterApr 16, 2021

Answer To: Part A - Spark RDDs and SQL with text (8 marks) In Part A your task is to answer a question about...

Abr Writing answered on Apr 18 2021
157 Votes
Assignment.ipynb
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Assignment.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"
cell_type": "markdown",
"metadata": {
"id": "7ZklnGDnIVt5"
},
"source": [
"Installing Spark in Colab"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "uE3ArNleHdxu",
"outputId": "a51e48bc-0d5a-4703-8268-cc0834d16539"
},
"source": [
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz\n",
"!tar xf spark-3.1.1-bin-hadoop2.7.tgz\n",
"!pip install -q findspark\n",
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-3.1.1-bin-hadoop2.7\"\n",
"import findspark\n",
"findspark.init()\n",
"findspark.find()"
],
"execution_count": 1,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'/content/spark-3.1.1-bin-hadoop2.7'"
]
},
"metadata": {
"tags": []
},
"execution_count": 1
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ym0ImrL-_udp"
},
"source": [
"Importing the necessary libraries"
]
},
{
"cell_type": "code",
"metadata": {
"id": "EeJCHfJH_uNA"
},
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark.sql.functions import col, desc\n",
"from numpy import mean\n",
"from itertools import islice"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lNE2OV7uAJBz"
},
"source": [
"Initializing Spark Session"
]
},
{
"cell_type": "code",
"metadata": {
"id": "BAffZCcVAJbj"
},
"source": [
"spark = SparkSession.builder\\\n",
" .master(\"local\")\\\n",
" .appName(\"Colab\")\\\n",
" .config('spark.ui.port', '4050')\\\n",
" .getOrCreate()"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "vHC7rJtLOHhx"
},
"source": [
"Initializing Spark Context"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qNzSqOFSOIEO"
},
"source": [
"sc = spark.sparkContext"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "csMpxwnTIbyE"
},
"source": [
"Uploading the data into Colab"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"resources": {
"http://localhost:8080/nbextensions/google.colab/files.js": {
"data":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here