Answer To: Part A - Spark RDDs and SQL with text (8 marks) In Part A your task is to answer a question about...
Abr Writing answered on Apr 18 2021
Assignment.ipynb
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Assignment.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "7ZklnGDnIVt5"
},
"source": [
"Installing Spark in Colab"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "uE3ArNleHdxu",
"outputId": "a51e48bc-0d5a-4703-8268-cc0834d16539"
},
"source": [
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz\n",
"!tar xf spark-3.1.1-bin-hadoop2.7.tgz\n",
"!pip install -q findspark\n",
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-3.1.1-bin-hadoop2.7\"\n",
"import findspark\n",
"findspark.init()\n",
"findspark.find()"
],
"execution_count": 1,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'/content/spark-3.1.1-bin-hadoop2.7'"
]
},
"metadata": {
"tags": []
},
"execution_count": 1
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ym0ImrL-_udp"
},
"source": [
"Importing the necessary libraries"
]
},
{
"cell_type": "code",
"metadata": {
"id": "EeJCHfJH_uNA"
},
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark.sql.functions import col, desc\n",
"from numpy import mean\n",
"from itertools import islice"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lNE2OV7uAJBz"
},
"source": [
"Initializing Spark Session"
]
},
{
"cell_type": "code",
"metadata": {
"id": "BAffZCcVAJbj"
},
"source": [
"spark = SparkSession.builder\\\n",
" .master(\"local\")\\\n",
" .appName(\"Colab\")\\\n",
" .config('spark.ui.port', '4050')\\\n",
" .getOrCreate()"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "vHC7rJtLOHhx"
},
"source": [
"Initializing Spark Context"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qNzSqOFSOIEO"
},
"source": [
"sc = spark.sparkContext"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "csMpxwnTIbyE"
},
"source": [
"Uploading the data into Colab"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"resources": {
"http://localhost:8080/nbextensions/google.colab/files.js": {
"data":...