Hello I have a data science assignment that requires you to explore a range of data sets and analyse it by answering a range of questions. The assignment must be completed using Jupyter Notebook and coded in Python. The notebook will need to be neatly presented including a thorough analysis for each question. The questions and information are provided in the Jupyter notebook. I have attached the jupyter notebook and the datafiles. Please do not import any other libraries.
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sport Vouchers Program Analysis\n", "\n", "The goal of this Portfolio task is to explore data from the Federal Government Sport Vouchers program - this is a\n", "program that provides up to two $100 vouchers for kids to participate in organised sport. Here's the [NSW Active Kids page](https://www.service.nsw.gov.au/transaction/apply-active-kids-voucher), there are similar schemes in other states - this data is from South Australia.\n", "\n", "This is an exercise in exploring data and communicating the insights you can gain from it. The source data comes\n", "from the `data.gov.au` website and provides details of all Sport Vouchers that have been redeemed since February in SA 2015 as part of the Sport Voucher program: [Sports Vouchers Data](https://data.gov.au/dataset/ds-sa-14daba50-04ff-46c6-8468-9fa593b9f100/details). This download is provided for you as `sportsvouchersclaimed.csv`.\n", "\n", "To augment this data you can also make use of [ABS SEIFA data by LGA](http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ABS_SEIFA_LGA#) which shows a few measures of Socioeconomic Advantage and Disadvantage for every Local Government Area. This data is provided for you as `ABS_SEIFA_LGA.csv`. This could enable you to answer questions about whether the voucher program is used equally by parents in low, middle and high socioeconomic areas. You might be interested in this if you were concerned that this kind of program might just benifit parents who are already advantaged (they might already be paying for sport so this program wouldn't be helping much).\n", "\n", "Questions:\n", "* Describe the distribution of vouchers by: LGA, Sport - which regions/sports stand out? \n", "* Are some sports more popular in different parts of the state?\n", "* Are any electorates over/under represented in their use of vouchers?\n", "* Is there a relationship between any of the SEIFA measures and voucher use in an LGA?\n", "\n", "A challenge in this task is to display a useful summary of the data given that there are a large number of LGAs and sports involved. Try to avoid long lists and large tables. Think about what plots and tables communicate the main points of your findings. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# read the sports vouchers data\n", "sa_vouchers = pd.read_csv(\"files/sportsvouchersclaimed.csv\")\n", "sa_vouchers.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SEIFA data includes row for each Local Government Area (LGA) but the names of the LGAs have a letter or letters in brackets after the name. To allow us to match this up with the voucher data we remove this and convert to uppercase. \n", "\n", "For each LGA the data includes a number of measures all of which could be useful in your exploration. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read the SEIFA data, create an LGA column by removing the letters in brackets and converting to uppercase\n", "seifa = pd.read_csv('files/ABS_SEIFA_LGA.csv')\n", "lga = seifa[\"Local Government Areas - 2011\"].str.replace(' \\([ACSRCDMT]+\\)', '').str.upper()\n", "seifa['LGA'] = lga\n", "seifa.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there are many rows per LGA we need to use `pivot_table` to create a new data frame with one row per LGA. Here\n", "is an example of doing this to create a table with the different SCORE measures and the population (URP) field. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "LGA_scores = seifa[seifa.MEASURE == 'SCORE'].pivot_table(index=\"LGA\", columns=[\"INDEX_TYPE\"], values=\"Value\")\n", "LGA_scores.head()\n", "LGA_pop = seifa[seifa.MEASURE == 'URP'].pivot_table(index=\"LGA\", columns=[\"INDEX_TYPE\"], values=\"Value\")\n", "LGA_scores['Population'] = LGA_pop.IEO\n", "LGA_scores.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data frame can then be joined with the vouchers data fram to create one master data frame containing both the voucher data and the SEIFA measures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sa_vouchers_scores = sa_vouchers.join(LGA_scores, on='Participant_LGA')\n", "sa_vouchers_scores.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge - Queensland\n", "\n", "_Note: this is an extra task that you might take on to get a better grade for your portfolio. You can get a good pass grade without doing this._ \n", "\n", "Queensland has a similar program called [Get Started](https://data.gov.au/dataset/ds-qld-3118838a-d425-48fa-bfc9-bc615ddae44e/details?q=get%20started%20vouchers) and we can retrieve data from their program in a similar format. \n", "\n", "The file [round1-redeemed_get_started_vouchers.csv](files/round1-redeemed_get_started_vouchers.csv) contains records of the vouchers issued in Queensland. The date of this data is not included but the program started in 2015 so it is probably from around then. \n", "\n", "The data includes the LGA of the individual but the name of the activity is slightly different. To do a comparable analysis you would need to map the activity names onto those from South Australia. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qld_vouchers = pd.read_csv('files/round1-redeemed_get_started_vouchers.csv')\n", "qld_vouchers.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Join the QLD data with the LGA data as before to get population and SIEFA data integrated\n", "qld_vouchers['LGA'] = qld_vouchers['Club Local Government Area Name'].str.replace(' \\([RC]+\\)', '').str.upper()\n", "qld_vouchers_scores = qld_vouchers.join(LGA_scores, on='LGA')\n", "qld_vouchers_scores.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 } absseifalga-puje0z22.csv LGA_2011,Local Government Areas - 2011,INDEX_TYPE,Index Type,MEASURE,Measure,Value 10050,Albury (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,SCORE,Score,967 10050,Albury (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MINS,Minimum score for SA1s in area,628 10050,Albury (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MAXS,Maximum score for SA1s in area,1169 10050,Albury (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,URP,Usual resident population,47851 10110,Armidale Dumaresq (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,SCORE,Score,985 10110,Armidale Dumaresq (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MINS,Minimum score for SA1s in area,727 10110,Armidale Dumaresq (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MAXS,Maximum score for SA1s in area,1120 10110,Armidale Dumaresq (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,URP,Usual resident population,24122 10150,Ashfield (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,SCORE,Score,1031 10150,Ashfield (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MINS,Minimum score for SA1s in area,926 10150,Ashfield (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MAXS,Maximum score for SA1s in area,1124 10150,Ashfield (A),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,URP,Usual resident population,41216 10200,Auburn (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,SCORE,Score,944 10200,Auburn (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MINS,Minimum score for SA1s in area,759 10200,Auburn (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,MAXS,Maximum score for SA1s in area,1183 10200,Auburn (C),IRSAD,Index of Relative Socio-economic Advantage and Disadvantage,URP,Usual resident population